13. ML – Data Science Basics

Data Types

  • Numerical
    • Represents some sort of quantitative measurement
    • Discrete Data: Integer based; often counts of some event
    • Continuous Data: Has an infinite number of possible values
  • Categorical
    • Qualitative data that has no inherent mathematical meaning
    • You can assign numbers to categories in order to represent them more compactly, but the numbers don’t have mathematical meaning
  • Ordinal
    • A mixture of numerical and categorical
    • Categorical data that has mathematical meaning

Data Distributions

  • Normal distribution
  • Probability Mass Function
    • aka, probability function, frequency function, discrete probability density function
    • a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.
  • Poisson Distribution
    • expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event.
    • A classic example used to motivate the Poisson distribution is the number of radioactive decay events during a fixed observation period.
  • Binomial Distribution
    • In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
  • Bernoulli Distribution
    • Special case of binomial distribution
    • Has a single trial (n=1)
    • Can think of a binomial distribution as the sum of Bernoulli distributions

Time Series Analysis

  • Trends
  • Seasonality
  • Noise
  • Seasonality + Trends + Noise = time series
    • Additive model
      • A data model in which the effects of individual factors are differentiated and added together to model the data.
    • Seasonal variation is constant
  • seasonality * trends * noise
    • Multiplicative model
      • assumes that as the data increase, so does the seasonal pattern. Most time series plots exhibit such a pattern
    • Seasonal variation increases as the trend increases

Confusion Matrix

Predicted condition
Total population
= P + N
Positive (PP)Negative (PN)
Actual conditionPositive (P)True positive (TP)False negative (FN)
Negative (N)False positive (FP)True negative (TN)
  • aka Error Matrix, or matching matrix
  • Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature. The diagonal of the matrix therefore represents all instances that are correctly predicted.
  • Total population = P + N
    • True Positive Rate (TPR) = Recall = Sensitivity (SEN) = Hit Rate = Completeness = TP / P = TP / (TP + FN) = 1 – FNR
      • how well the model identifies true positives
      • Good choice of metric when you care a lot, about false negatives, i.e., fraud detection
      • The higher, the better
      • Raise TPR, usually it may bring higher FPR, so be careful.
    • False Negative Rate (FNR) = Miss Rate = Type II Error = FN / P = 1 – TPR
    • False Positive Rate (FPR) = Type I Error = FP / N = 1 – TNR
      • The lower, the better
    • True Negative Rate (TNR) = Specificity (SPC) = Selectivitiy = TN / N = 1 – FPR
  • Prevalence = P / (P+N)
    • Positive Predictive Value (PPV) = Precision = Correct Positives = TP / (TP + FP) = 1 – FDR
      • the quality of positive predictions
      • Good choice of metric when you care a lot, about false positives, i.e., medical screening, drug testing
    • False Omission Rate (FOR) = FN / (TN + FN) = 1 – NPV
  • Accuracy (ACC) = (TP + TN) / (P + N)
    • False Discovery Rate (FDR) = FP / (TP + FP) = 1 – PPV
    • Negative Predictive Value (NPV) = FN / (TN + FN) = 1 – FOR
  • Balanced Accuracy (BA) = (TPR + TNR) / 2
    • F-Score = ( (1+b2) * Precision x Recall ) / ( (b2) Precision + Recall )
      • when b = 1, ie F1-Score = 2 (PPV x TPR / (PPV + TPR)) = 2TP / (2TP + FP + FN)
        • The F1 Score provides a balanced measure of a model’s performance by considering both precision (true predicted positive in all predicted positive) and recall (true predicted positive in all true positive).
    • Threat Score (TS) = Critical Success Index (CSI) = TP / (TP + FN + FP)
  • Root mean squared error (RMSE) is another Accuracy measurement
    • measure the differences between predicted values and actual values in a regression problem
    • the square root of the average squared differences between the predicted and actual values
    • the lower, the better
  • Cut-off (or threshold) is a value used to convert the model’s predicted probabilities or scores into binary class predictions.
  • Receiver Operating Characteristic Curve (ROC)
    • Plot of true positive rate (recall) vs. false positive rate at various threshold settings
    • Curve bent more on “upper-left” the better
  • Area Under the Curve (AUC)
    • The area under the ROC curve
    • probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
    • ROC AUC of 0.5 is a useless classifier (Random), 1.0 is perfect
  • P-R Curve
    • Precision / Recall curve
    • Good = higher area (PR-AUC) under curve, ie more “upper-right” the better
    • When to use
      • Imbalanced Datasets: When the positive class is rare, and the dataset is heavily imbalanced, the PR curve is more informative than the ROC curve.
        • Examples include fraud detection and disease diagnosis.
      • Costly False Positives: If false positives are more costly or significant than false negatives, such as in spam email detection, the PR curve is more suitable as it focuses on precision.

Feature Engineering

  • A feature is an individual measurable property within a recorded dataset. In machine learning and statistics, features are often called “variables” or “attributes.” Relevant features have a correlation or bearing (called feature importance) on a model’s use case.
  • Applying your knowledge of the data – and the model you’re using – to create better features to train your model with.
    • Which features should I use?
    • Do I need to transform these features in some way?
    • How do I handle missing data?
    • Should I create new features from the existing ones?
  • You can’t just throw in raw data and expect good results
  • The Curse of Dimensionality
    • Too many features can be a problem – leads to sparse data
    • Every feature is a new dimension
    • Much of feature engineering is selecting the features most relevant to the problem at hand
    • This often is where domain knowledge comes into play
    • Unsupervised dimensionality reduction techniques can also be employed to distill many features into fewer features
      • Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
      • K-Means aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Imputing Missing Data / Imputation

  • Mean Replacement
    • Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.)
    • Fast & easy, won’t affect mean or sample size of overall data set
    • Median may be a better choice than mean when outliers are present
    • But it’s generally pretty terrible.
      • Only works on column level, misses correlations between features
      • Can’t use on categorical features (imputing with most frequent value can work in this case, though)
      • Not very accurate
  • Dropping
    • Reduce the final data records
  • Machine Learning
    • KNN: Find K “nearest” (most similar) rows and average their values
      • Assumes numerical data, not categorical
    • Deep Learning
      • Build a machine learning model to impute data for your machine learning model!
      • Works well for categorical data. Really well. But it’s complicated.
    • Regression
      • Find linear or non-linear relationships between the missing feature and other features
      • Most advanced technique: MICE (Multiple Imputation by Chained Equations)

Unbalanced Data

  • Large discrepancy between “positive” and “negative” cases
    • “positive” means the thing you’re testing for is what happened
    • i.e., fraud detection. Fraud is rare, and most rows will be not-fraud
  • Techniques for re-balance
    • Oversampling
      • Duplicate samples from the minority class
      • Synthetic Minority Over-sampling TEchnique (SMOTE)
      • Artificially generate new samples of the minority class using nearest neighbors
        • Run K-nearest-neighbors of each sample of the minority class
        • Create a new sample from the KNN result (mean of the neighbors)
      • Both generates new samples and undersamples majority class
      • Generally better than just simple oversampling
    • Undersampling
      • Instead of creating more positive samples, remove “some” negative ones
      • Throwing data away is usually not the right answer
  • Adjusting thresholds
    • When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you’ll flag something as the positive case (fraud)
    • If you have too many false positives, one way to fix that is to simply increase that threshold.
      • Guaranteed to reduce false positives
      • But, could result in more false negatives

Outliers

  • Variance (𝜎 2 ) is simply the average of the squared differences from the mean
    • measures how “spread-out” the data is
  • Standard Deviation 𝜎 is just the square root of the variance.
    • Data points lie than one standard deviation from the mean can be considered unusual.
  • how extreme a data point is by talking about “how many sigmas” away from the mean it is.
  • Dealing with Outliers
    • AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
    • It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thus the name forest. 

Binning / Bucketing

  • Bucket observations together based on ranges of values.
  • Quantile binning categorizes data by their place in the data distribution
    • Ensures even sizes of bins
  • Transforms numeric data to categorical/ordinal data
  • Especially useful when there is uncertainty in the measurements

Transforming

  • Feature data with an exponential trend may benefit from a logarithmic transform

Encoding

  • Transforming data into some new representation required by the model
  • One-hot encoding
    • Create “buckets” for every category
    • The bucket for your category has a 1, all others have a 0
    • Very common in deep learning, where categories are represented by individual output “neurons”
DecimalBinaryUnaryOne-hot
00000000000000000001
10010000000100000010
20100000001100000100
30110000011100001000
41000000111100010000
51010001111100100000
61100011111101000000
71110111111110000000

Scaling / Normalization

  • Some models prefer feature data to be normally distributed around 0 (most neural nets)
  • Most models require feature data to at least be scaled to comparable values
  • Otherwise features with larger magnitudes will have more weight than they should
  • Example: modeling age and income as features – incomes will be much higher values than ages
  • Remember to scale your results back up

Shuffling

  • Many algorithms benefit from shuffling their training data
  • Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
    • The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean). The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).

Term Frequency and Inverse Document Frequency (TF-IDF)

  • figures out what terms are most relevant for a document
  • Term Frequency just measures how often a word occurs in a document
    • A word that occurs frequently is probably important to that document’s meaning
  • Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
    • This tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, etc.
  • So a measure of the relevancy of a word to a document might be: 𝑇𝑒𝑟𝑚 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 / 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, (aka Term Frequency * Inverse Document Frequency )
  • That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document
  • In practice, the TF-IDF often use TF * Inverse (log of) Document Frequency, since word frequencies are distributed exponentially.
  • n-gram
    • An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome.
    • Unigrams: single word
    • Bi-grams: two close-by terms
    • Tri-grams: three sequenced terms
  • Example:
    • The TF-IDF Matrix for unigram and bigrams of these two sentences
      • Please call the number below.
      • Please do not call us.
    • the matrix dimension would be (2, 16)
      • 2 – two input source (documents)
      • 16 – 8 unigrams (“please”, “call”, “the”, “number”, “below”, “do”, “not”, “us”) + 8 bigrams (“please call”, “call the”, “the number”, “number below”, “please do”, “do not”, “not call”, “call us”)