13. ML – Data Science Basics

Data Types

  • Numerical
    • Represents some sort of quantitative measurement
    • Discrete Data: Integer based; often counts of some event
    • Continuous Data: Has an infinite number of possible values
  • Categorical
    • Qualitative data that has no inherent mathematical meaning
    • You can assign numbers to categories in order to represent them more compactly, but the numbers don’t have mathematical meaning
  • Ordinal
    • A mixture of numerical and categorical
    • Categorical data that has mathematical meaning

Data Distributions

  • Normal distribution
  • Probability Mass Function
    • aka, probability function, frequency function, discrete probability density function
    • a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.
  • Poisson Distribution
    • expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event.
    • A classic example used to motivate the Poisson distribution is the number of radioactive decay events during a fixed observation period.
  • Binomial Distribution
    • In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
  • Bernoulli Distribution
    • Special case of binomial distribution
    • Has a single trial (n=1)
    • Can think of a binomial distribution as the sum of Bernoulli distributions

Time Series Analysis

  • Trends
  • Seasonality
  • Noise
  • Seasonality + Trends + Noise = time series
    • Additive model
      • A data model in which the effects of individual factors are differentiated and added together to model the data.
    • Seasonal variation is constant
  • seasonality * trends * noise
    • Multiplicative model
      • assumes that as the data increase, so does the seasonal pattern. Most time series plots exhibit such a pattern
    • Seasonal variation increases as the trend increases

Confusion Matrix

Predicted condition
Total population
= P + N
Positive (PP)Negative (PN)
Actual conditionPositive (P)True positive (TP)False negative (FN)
Negative (N)False positive (FP)True negative (TN)
  • aka Error Matrix, or matching matrix
  • Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature. The diagonal of the matrix therefore represents all instances that are correctly predicted.
  • Total population = P + N
    • True Positive Rate (TPR) = Recall = Sensitivity (SEN) = Hit Rate = Completeness = TP / P = TP / (TP + FN) = 1 – FNR
      • how well the model identifies true positives
      • Good choice of metric when you care a lot, about false negatives, i.e., fraud detection
      • The higher, the better
      • Raise TPR, usually it may bring higher FPR, so be careful.
    • False Negative Rate (FNR) = Miss Rate = Type II Error = FN / P = 1 – TPR
    • False Positive Rate (FPR) = Type I Error = FP / N = 1 – TNR
      • The lower, the better
    • True Negative Rate (TNR) = Specificity (SPC) = Selectivity = TN / N = 1 – FPR
  • Prevalence = P / (P+N)
    • Positive Predictive Value (PPV) = Precision = Correct Positives = TP / (TP + FP) = 1 – FDR
      • the quality of positive predictions
      • Good choice of metric when you care a lot, about false positives, i.e., medical screening, drug testing
    • False Omission Rate (FOR) = FN / (TN + FN) = 1 – NPV
  • Accuracy (ACC) = (TP + TN) / (P + N)
    • False Discovery Rate (FDR) = FP / (TP + FP) = 1 – PPV
    • Negative Predictive Value (NPV) = FN / (TN + FN) = 1 – FOR
  • Balanced Accuracy (BA) = (TPR + TNR) / 2
    • F-Score = ( (1+b2) * Precision x Recall ) / ( (b2) Precision + Recall )
      • when b = 1, ie F1-Score = 2 (PPV x TPR / (PPV + TPR)) = 2TP / (2TP + FP + FN)
        • The F1 Score provides a balanced measure of a model’s performance by considering both precision (true predicted positive in all predicted positive) and recall (true predicted positive in all true positive).
    • Threat Score (TS) = Critical Success Index (CSI) = TP / (TP + FN + FP)
  • Root mean squared error (RMSE) is another Accuracy measurement
    • measure the differences between predicted values and actual values in a regression problem
    • the square root of the average squared differences between the predicted and actual values
    • the lower, the better
  • Cut-off (or threshold) is a value used to convert the model’s predicted probabilities or scores into binary class predictions.
  • Receiver Operating Characteristic Curve (ROC)
    • Plot of true positive rate (recall) vs. false positive rate at various threshold settings
    • Curve bent more on “upper-left” the better
  • Area Under the Curve (AUC)
    • The area under the ROC curve
    • probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
    • ROC AUC of 0.5 is a useless classifier (Random), 1.0 is perfect
  • P-R Curve
    • Precision / Recall curve
    • Good = higher area (PR-AUC) under curve, ie more “upper-right” the better
    • When to use
      • Imbalanced Datasets: When the positive class is rare, and the dataset is heavily imbalanced, the PR curve is more informative than the ROC curve.
        • Examples include fraud detection and disease diagnosis.
      • Costly False Positives: If false positives are more costly or significant than false negatives, such as in spam email detection, the PR curve is more suitable as it focuses on precision.
Data TypeMethodNote
Binary / CategoryRecall / Sensitivity / True Positive RateTP / P
True Negative Rate / Specificity / SelectivityTN / N
PrecisionTP / (TP + FP)
Accuracy(TP + TN) / (P + N)
F1-Score2TP / (2TP + FP + FN)
Root mean squared error (RMSE)
Area Under the Curve (AUC)
P-R CurvePrecision / Recall curve
Log loss a classification metric designed for models that output probabilities for class predictions
NumericRoot mean squared error (RMSE) standard metric for evaluating regression models
Average Weighted Quantile Loss (Average wQL)for probabilistic forecasting, where one predicts a range (distribution) of possible future values rather than a single point. It’s beneficial in time-series models like those used in weather forecasting, where uncertainty and quantiles matter

Feature Engineering

  • A feature is an individual measurable property within a recorded dataset. In machine learning and statistics, features are often called “variables” or “attributes.” Relevant features have a correlation or bearing (called feature importance) on a model’s use case.
  • Applying your knowledge of the data – and the model you’re using – to create better features to train your model with.
    • Which features should I use?
    • Do I need to transform these features in some way?
    • How do I handle missing data?
    • Should I create new features from the existing ones?
  • You can’t just throw in raw data and expect good results
  • The Curse of Dimensionality
    • Too many features can be a problem – leads to sparse data
    • Every feature is a new dimension
    • Much of feature engineering is selecting the features most relevant to the problem at hand
    • This often is where domain knowledge comes into play
    • Unsupervised dimensionality reduction techniques can also be employed to distill many features into fewer features
      • Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
      • K-Means aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Imputing Missing Data / Imputation

  • Mean Replacement
    • Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.)
    • Fast & easy, won’t affect mean or sample size of overall data set
    • Median may be a better choice than mean when outliers are present
    • But it’s generally pretty terrible.
      • Only works on column level, misses correlations between features
      • Can’t use on categorical features (imputing with most frequent value can work in this case, though)
      • Not very accurate
  • Dropping
    • Reduce the final data records
  • Machine Learning
    • KNN: Find K “nearest” (most similar) rows and average their values
      • Assumes numerical data, not categorical
    • Deep Learning
      • Build a machine learning model to impute data for your machine learning model!
      • Works well for categorical data. Really well. But it’s complicated.
    • Regression
      • Find linear or non-linear relationships between the missing feature and other features
      • Most advanced technique: MICE (Multiple Imputation by Chained Equations)

Unbalanced Data

  • Large discrepancy between “positive” and “negative” cases
    • “positive” means the thing you’re testing for is what happened
    • i.e., fraud detection. Fraud is rare, and most rows will be not-fraud
  • Techniques for re-balance
    • Oversampling
      • Duplicate samples from the minority class
      • Synthetic Minority Over-sampling TEchnique (SMOTE)
      • Artificially generate new samples of the minority class using nearest neighbors
        • Run K-nearest-neighbors of each sample of the minority class
        • Create a new sample from the KNN result (mean of the neighbors)
      • Both generates new samples and undersamples majority class
      • Generally better than just simple oversampling
    • Undersampling
      • Instead of creating more positive samples, remove “some” negative ones
      • Throwing data away is usually not the right answer
  • Adjusting thresholds
    • When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you’ll flag something as the positive case (fraud)
    • If you have too many false positives, one way to fix that is to simply increase that threshold.
      • Guaranteed to reduce false positives
      • But, could result in more false negatives

Outliers

  • Variance (𝜎 2 ) is simply the average of the squared differences from the mean
    • measures how “spread-out” the data is
  • Standard Deviation 𝜎 is just the square root of the variance.
    • Data points lie than one standard deviation from the mean can be considered unusual.
  • how extreme a data point is by talking about “how many sigmas” away from the mean it is.
  • Dealing with Outliers
    • AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
    • It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thus the name forest. 

Binning / Bucketing

  • Bucket observations together based on ranges of values.
  • Quantile binning categorizes data by their place in the data distribution
    • Ensures even sizes of bins
  • Transforms numeric data to categorical/ordinal data
  • Especially useful when there is uncertainty in the measurements

Transforming

  • Feature data with an exponential trend may benefit from a logarithmic transform

Encoding

  • Transforming data into some new representation required by the model
  • One-Hot encoding
    • Create “buckets” for every category
    • The bucket for your category has a 1, all others have a 0
    • Very common in deep learning, where categories are represented by individual output “neurons”
DecimalBinaryUnaryOne-hot
00000000000000000001
10010000000100000010
20100000001100000100
30110000011100001000
41000000111100010000
51010001111100100000
61100011111101000000
71110111111110000000

Scaling / Normalization

  • Some models prefer feature data to be normally distributed around 0 (most neural nets)
  • Most models require feature data to at least be scaled to comparable values
  • Otherwise features with larger magnitudes will have more weight than they should
  • Example: modeling age and income as features – incomes will be much higher values than ages
  • Remember to scale your results back up

Shuffling

  • Many algorithms benefit from shuffling their training data
  • Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
    • The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean). The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).

Term Frequency and Inverse Document Frequency (TF-IDF)

  • figures out what terms are most relevant for a document
  • Term Frequency just measures how often a word occurs in a document
    • A word that occurs frequently is probably important to that document’s meaning
  • Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
    • This tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, etc.
  • So a measure of the relevancy of a word to a document might be: 𝑇𝑒𝑟𝑚 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 / 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, (aka Term Frequency * Inverse Document Frequency )
  • That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document
  • In practice, the TF-IDF often use TF * Inverse (log of) Document Frequency, since word frequencies are distributed exponentially.
  • n-gram
    • An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome.
    • Unigrams: single word
    • Bi-grams: two close-by terms
    • Tri-grams: three sequenced terms
  • Example:
    • The TF-IDF Matrix for unigram and bigrams of these two sentences
      • Please call the number below.
      • Please do not call us.
    • the matrix dimension would be (2, 16)
      • 2 – two input source (documents)
      • 16 – 8 unigrams (“please”, “call”, “the”, “number”, “below”, “do”, “not”, “us”) + 8 bigrams (“please call”, “call the”, “the number”, “number below”, “please do”, “do not”, “not call”, “call us”)

Embeddings

  • Embeddings are numerical representations of real-world objects that machine learning (ML) and artificial intelligence (AI) systems use to understand complex knowledge domains like humans do.
  • As an example, computing algorithms understand that the difference between 2 and 3 is 1, indicating a close relationship between 2 and 3 as compared to 2 and 100.
  • However, real-world data includes more complex relationships. For example, a bird-nest and a lion-den are analogous pairs, while day-night are opposite terms.
  • Embeddings convert real-world objects into complex mathematical representations that capture inherent properties and relationships between real-world data.
  • Benefits
    • Reduce data dimensionality
    • Train large language models
    • Build innovative applications
  • relevant models/algorithms
    • Principal component analysis
    • Singular value decomposition 
    • Word2Vec / Object2Vec
    • BERT

Model Evaluation

  • Binary Model Insights
    • Metrics
      • Accuracy (ACC): (TP + TN) / (P + N), to maximize the ratio of correct predictions to total predictions
      • Precision: TP / (TP + FP), to minimize false positives
      • Recall: TP / (TP + FN), to minimize false negatives
      • False Positive Rate (FPR): FP / (FP + TN)
      • Area Under the (Receiver Operating Characteristic) Curve (AUC)
        • measures the ability of the model to predict a higher score for positive examples as compared to negative examples
        • a tradeoff between false positives and false negatives
  • Multiclass Model Insights
    • Metrics
      • F1 Score: 2TP / (2TP + FP + FN)
  • Regression Model Insights
    • Metrics
      • standard root mean square error (RMSE)
  • Cross-Validation
    • training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data
    • Good to check out the “overfitting”