Discrete Data: Integer based; often counts of some event
Continuous Data: Has an infinite number of possible values
Categorical
Qualitative data that has no inherent mathematical meaning
You can assign numbers to categories in order to represent them more compactly, but the numbers don’t have mathematical meaning
Ordinal
A mixture of numerical and categorical
Categorical data that has mathematical meaning
Data Distributions
Normal distribution
Probability Mass Function
aka, probability function, frequency function, discrete probability density function
a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.
Poisson Distribution
expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event.
A classic example used to motivate the Poisson distribution is the number of radioactive decay events during a fixed observation period.
Binomial Distribution
In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
Bernoulli Distribution
Special case of binomial distribution
Has a single trial (n=1)
Can think of a binomial distribution as the sum of Bernoulli distributions
Time Series Analysis
Trends
Seasonality
Noise
Seasonality + Trends + Noise = time series
Additive model
A data model in which the effects of individual factors are differentiated and added together to model the data.
Seasonal variation is constant
seasonality * trends * noise
Multiplicative model
assumes that as the data increase, so does the seasonal pattern. Most time series plots exhibit such a pattern
Seasonal variation increases as the trend increases
Confusion Matrix
Predicted condition
Total population = P + N
Positive (PP)
Negative (PN)
Actual condition
Positive (P)
True positive (TP)
False negative (FN)
Negative (N)
False positive (FP)
True negative (TN)
aka Error Matrix, or matching matrix
Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature. The diagonal of the matrix therefore represents all instances that are correctly predicted.
when b = 1, ie F1-Score = 2 (PPV x TPR / (PPV + TPR)) = 2TP / (2TP + FP + FN)
The F1 Score provides a balanced measure of a model’s performance by considering both precision (true predicted positive in all predicted positive) and recall (true predicted positive in all true positive).
Root mean squared error (RMSE) is another Accuracy measurement
measure the differences between predicted values and actual values in a regression problem
the square root of the average squared differences between the predicted and actual values
the lower, the better
Cut-off (or threshold) is a value used to convert the model’s predicted probabilities or scores into binary class predictions.
Receiver Operating Characteristic Curve (ROC)
Plot of true positive rate (recall) vs. false positive rate at various threshold settings
Curve bent more on “upper-left” the better
Area Under the Curve (AUC)
The area under the ROC curve
probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
ROC AUC of 0.5 is a useless classifier (Random), 1.0 is perfect
P-R Curve
Precision / Recall curve
Good = higher area (PR-AUC) under curve, ie more “upper-right” the better
When to use
Imbalanced Datasets: When the positive class is rare, and the dataset is heavily imbalanced, the PR curve is more informative than the ROC curve.
Examples include fraud detection and disease diagnosis.
Costly False Positives: If false positives are more costly or significant than false negatives, such as in spam email detection, the PR curve is more suitable as it focuses on precision.
Feature Engineering
A feature is an individual measurable property within a recorded dataset. In machine learning and statistics, features are often called “variables” or “attributes.” Relevant features have a correlation or bearing (called feature importance) on a model’s use case.
Applying your knowledge of the data – and the model you’re using – to create better features to train your model with.
Which features should I use?
Do I need to transform these features in some way?
How do I handle missing data?
Should I create new features from the existing ones?
You can’t just throw in raw data and expect good results
The Curse of Dimensionality
Too many features can be a problem – leads to sparse data
Every feature is a new dimension
Much of feature engineering is selecting the features most relevant to the problem at hand
This often is where domain knowledge comes into play
Unsupervised dimensionality reduction techniques can also be employed to distill many features into fewer features
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
K-Means aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
Imputing Missing Data / Imputation
Mean Replacement
Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.)
Fast & easy, won’t affect mean or sample size of overall data set
Median may be a better choice than mean when outliers are present
But it’s generally pretty terrible.
Only works on column level, misses correlations between features
Can’t use on categorical features (imputing with most frequent value can work in this case, though)
Not very accurate
Dropping
Reduce the final data records
Machine Learning
KNN: Find K “nearest” (most similar) rows and average their values
Assumes numerical data, not categorical
Deep Learning
Build a machine learning model to impute data for your machine learning model!
Works well for categorical data. Really well. But it’s complicated.
Regression
Find linear or non-linear relationships between the missing feature and other features
Most advanced technique: MICE (Multiple Imputation by Chained Equations)
Unbalanced Data
Large discrepancy between “positive” and “negative” cases
“positive” means the thing you’re testing for is what happened
i.e., fraud detection. Fraud is rare, and most rows will be not-fraud
Artificially generate new samples of the minority class using nearest neighbors
Run K-nearest-neighbors of each sample of the minority class
Create a new sample from the KNN result (mean of the neighbors)
Both generates new samples and undersamples majority class
Generally better than just simple oversampling
Undersampling
Instead of creating more positive samples, remove “some” negative ones
Throwing data away is usually not the right answer
Adjusting thresholds
When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you’ll flag something as the positive case (fraud)
If you have too many false positives, one way to fix that is to simply increase that threshold.
Guaranteed to reduce false positives
But, could result in more false negatives
Outliers
Variance (𝜎 2 ) is simply the average of the squared differences from the mean
measures how “spread-out” the data is
Standard Deviation 𝜎 is just the square root of the variance.
Data points lie than one standard deviation from the mean can be considered unusual.
how extreme a data point is by talking about “how many sigmas” away from the mean it is.
Dealing with Outliers
AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thus the name forest.
Binning / Bucketing
Bucket observations together based on ranges of values.
Quantile binning categorizes data by their place in the data distribution
Ensures even sizes of bins
Transforms numeric data to categorical/ordinal data
Especially useful when there is uncertainty in the measurements
Transforming
Feature data with an exponential trend may benefit from a logarithmic transform
Encoding
Transforming data into some new representation required by the model
One-hot encoding
Create “buckets” for every category
The bucket for your category has a 1, all others have a 0
Very common in deep learning, where categories are represented by individual output “neurons”
Decimal
Binary
Unary
One-hot
0
000
00000000
00000001
1
001
00000001
00000010
2
010
00000011
00000100
3
011
00000111
00001000
4
100
00001111
00010000
5
101
00011111
00100000
6
110
00111111
01000000
7
111
01111111
10000000
Scaling / Normalization
Some models prefer feature data to be normally distributed around 0 (most neural nets)
Most models require feature data to at least be scaled to comparable values
Otherwise features with larger magnitudes will have more weight than they should
Example: modeling age and income as features – incomes will be much higher values than ages
Remember to scale your results back up
Shuffling
Many algorithms benefit from shuffling their training data
Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean). The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).
Term Frequency and Inverse Document Frequency (TF-IDF)
figures out what terms are most relevant for a document
Term Frequency just measures how often a word occurs in a document
A word that occurs frequently is probably important to that document’s meaning
Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
This tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, etc.
So a measure of the relevancy of a word to a document might be: 𝑇𝑒𝑟𝑚 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 / 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, (aka Term Frequency * Inverse Document Frequency )
That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document
In practice, the TF-IDF often use TF * Inverse (log of) Document Frequency, since word frequencies are distributed exponentially.
n-gram
An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome.
Unigrams: single word
Bi-grams: two close-by terms
Tri-grams: three sequenced terms
Example:
The TF-IDF Matrix for unigram and bigrams of these two sentences