Python & relevant
- Pandas: A Python library for slicing and dicing your data
- Data Frames
- Series
- Interoperates with numpy
- Matplotlib
- Seaborn
- scikit_learn: Python library for machine learning models
- Jupyter notebooks
Data Types
- Numerical
- Represents some sort of quantitative measurement
- Discrete Data: Integer based; often counts of some event
- Continuous Data: Has an infinite number of possible values
- Categorical
- Qualitative data that has no inherent mathematical meaning
- You can assign numbers to categories in order to represent them more compactly, but the numbers don’t have mathematical meaning
- Ordinal
- A mixture of numerical and categorical
- Categorical data that has mathematical meaning
Data Distributions
- Normal distribution
- Probability Mass Function
- aka, probability function, frequency function, discrete probability density function
- a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.
- Poisson Distribution
- expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event.
- A classic example used to motivate the Poisson distribution is the number of radioactive decay events during a fixed observation period.
- Binomial Distribution
- In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
- Bernoulli Distribution
- Special case of binomial distribution
- Has a single trial (n=1)
- Can think of a binomial distribution as the sum of Bernoulli distributions


Time Series Analysis
- Trends
- Seasonality
- Noise
- Seasonality + Trends + Noise = time series
- Additive model
- A data model in which the effects of individual factors are differentiated and added together to model the data.
- Seasonal variation is constant
- Additive model
- seasonality * trends * noise
- Multiplicative model
- assumes that as the data increase, so does the seasonal pattern. Most time series plots exhibit such a pattern
- Seasonal variation increases as the trend increases
- Multiplicative model
Amazon Athena
- Interactive query service for S3 (SQL)
- Serverless
- Supports data formats
- CSV (human readable)
- JSON (human readable)
- ORC (columnar, splittable)
- Parquet (columnar, splittable)
- Avro (splittable)
Amazon QuickSight
- Business analytics and visualizations in the cloud
- Serverless
- data source
- Redshift
- Aurora / RDS
- Athena
- EC2-hosted databases
- Files (S3 or on-premises)
- Excel
- CSV, TSV
- Common or extended log format
- AWS IoT Analytics
- Data preparation allows limited ETL
- SPICE
- Data sets are imported into SPICE
- Each user gets 10GB of SPICE
- Scales to hundreds of thousands of users
- Use cases
- Interactive ad-hoc exploration / visualization of data
- Dashboards and KPI’s
- Analyze / visualize data from:
- Logs in S3
- On-premise databases
- AWS (RDS, Redshift, Athena, S3)
- SaaS applications, such as Salesforce
- Any JDBC/ODBC data source
- Quicksight Q
- Machine learning-powered
- Answers business questions with Natural Language Processing
- Must set up topics associated with datasets
- Quicksight Security
- Multi-factor authentication on your account
- VPC connectivity
- Row-level security
- Column-level security too (CLS) –Enterprise edition only
- Private VPC access
- Elastic Network Interface, AWS Direct Connect
- QuickSight Visual Types
- AutoGraph
- Bar Charts: comparison and distribution (histograms)
- Line graphs: changes over time
- Scatter plots, heat maps: correlation
- Pie graphs: aggregation
- Tree maps: Heirarchical Aggregation
- Pivot tables: tabular data
- KPIs: compare key value to its target value
- Geospatial Charts (maps)
- Donut Charts: Percentage of Total Amount
- Gauge Charts: Compare values in a measure
- Word Clouds: word or phrase frequency

Elastic MapReduce (EMR)
- Managed Hadoop framework on EC2 instances
- Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications.
- Hadoop Distributed File System (HDFS): a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage
- Yet Another Resource Negotiator (YARN): a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.
- MapReduce: a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.
- Hadoop Common: the libraries and utilities
- Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications.
- Includes Spark, HBase, Presto, Flink, Hive & more
- Apache Spark: uses in-memory caching and optimized query execution for fast analytic queries against data of any size. Its foundational concept is a read-only set of data distributed over a cluster of machines, which is called a resilient distributed dataset (RDD).
- Apache HBase: An open source non-relational distributed database often paired with Hadoop
- Presto: a distributed and open-source SQL query engine that is used to run interactive analytical queries. It can handle the query of any size ranging from gigabytes to petabytes.
- Flink: a stream processing framework that can also handle batch processing. It is designed for real-time analytics and can process data as it arrives.
- Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a query language called HiveQL, which is similar to SQL
- Zeppelin: Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
- Apache Ranger: an open-source framework designed to enable, monitor and manage comprehensive data security across the Hadoop platform

Based On | Apache Hadoop | Apache Spark | Apache Flink |
---|---|---|---|
Data Processing | Hadoop is mainly designed for batch processing which is very efficient in processing large datasets. | It supports batch processing as well as stream processing. | It supports both batch and stream processing. Flink also provides the single run-time for batch and stream processing. |
Stream Engine | It takes the complete data-set as input at once and produces the output. | Process data streams in micro-batches. | The true streaming engine uses streams for workload: streaming, micro-batch, SQL, batch. |
Data Flow | Data Flow does not contain any loops. supports linear data flow. | Spark supports cyclic data flow and represents it as (DAG) direct acyclic graph. | Flink uses a controlled cyclic dependency graph in run time. which efficiently manifest ML algorithms. |
Computation Model | Hadoop Map-Reduce supports the batch-oriented model. | It supports the micro-batching computational model. | Flink supports a continuous operator-based streaming model. |
Performance | Slower than Spark and Flink. | More than Hadoop lesser than Flink. | Performance is highest among these three. |
Memory management | Configurable Memory management supports both dynamically or statically management. | The Latest release of spark has automatic memory management. | Supports automatic memory management |
Fault tolerance | Highly fault-tolerant using a replication mechanism. | Spark RDD provides fault tolerance through lineage. | Fault tolerance is based on Chandy-Lamport distributed snapshots results in high throughput. |
Scalability | Highly scalable and can be scaled up to tens of thousands of nodes. | Highly scalable. | It is also highly scalable. |
Iterative Processing | Does not support Iterative Processing. | supports Iterative Processing. | supports Iterative Processing and iterate data with the help of its streaming architecture. |
Supported Languages | Java, C, C++, Python, Perl, groovy, Ruby, etc. | Java, Python, R, Scala. | Java, Python, R, Scala. |
Cost | Uses commodity hardware which is less expensive | Needed lots of RAM so the cost is relatively high. | Apache Flink also needed lots of RAM so the cost is relatively high. |
Abstraction | No Abstraction in Map-Reduce. | Spark RDD abstraction | Flink supports Dataset abstraction for batch and DataStreams |
SQL support | Users can run SQL queries using Apache Hive. | Users can run SQL queries using Spark-SQL. It also supports Hive for SQL. | Flink supports Table-API which are similar to SQL expression. Apache foundation is panning to add SQL interface in its future release. |
Caching | Map-Reduce can not cache data. | It can cache data in memory | Flink can also cache data in memory |
Hardware Requirements | Runs well on less expensive commodity hardware. | It also needed high-level hardware. | Apache Flink also needs High-level Hardware |
Machine Learning | Apache Mahout is used for ML. | Spark is so powerful in implementing ML algorithms with its own ML libraries. | FlinkML library of Flink is used for ML implementation. |
High Availability | Configurable in High Availability Mode. | Configurable in High Availability Mode. | Configurable in High Availability Mode. |
Amazon S3 connector | Provides Support for Amazon S3 Connector. | Provides Support for Amazon S3 Connector. | Provides Support for Amazon S3 Connector. |
Backpressure Handing | Hadoop handles back-pressure through Manual Configuration. | Spark also handles back-pressure through Manual Configuration. | Apache Flink handles back-pressure Implicitly through System Architecture |

- An EMR Cluster
- Master node: manages the cluster
- Single EC2 instance
- Core node: Hosts HDFS data and runs tasks
- Can be scaled up & down, but with some risk
- Task node: Runs tasks, does not host data
- No risk of data loss when removing
- Good use of spot instances
- Master node: manages the cluster
- EMR Usage
- Transient vs Long-Running Clusters
- Can spin up task nodes using Spot instances for temporary capacity
- Can use reserved instances on long-running clusters to save $
- Connect directly to master to run jobs
- Submit ordered steps via the console
- Serverless
- Transient vs Long-Running Clusters
- AWS Integration
- Amazon EC2 for the instances that comprise the nodes in the cluster
- Amazon VPC to configure the virtual network in which you launch your instances
- Amazon S3 to store input and output data
- Amazon CloudWatch to monitor cluster performance and configure alarms
- AWS IAM to configure permissions
- AWS CloudTrail to audit requests made to the service
- AWS Data Pipeline to schedule and start your clusters
- EMR Storage
- HDFS
- EMR File System (EMRFS) : access S3 as if it were HDFS
- [End-of-Support] EMRFS Consistent View – Optional for S3 consistency; uses DynamoDB to track consistency
- Local file system
- EBS for HDFS
- Apache Spark
- Can work with Hadoop
- Spark MLLib
- Classification: logistic regression, naïve Bayes
- Regression
- Decision trees
- Recommendation engine (ALS)
- Clustering (K-Means)
- LDA (topic modeling)
- ML workflow utilities (pipelines, feature transformation, persistence)
- SVD, PCA, statistics
- Spark Structured Streaming
- Data stream as an unbounded Input Table
- New data in stream = new rows appended to input table



- EMR Notebook
- Notebooks backed up to S3
- Provision clusters from the notebook!
- Hosted inside a VPC
- Accessed only via AWS console
- EMR Security
- IAM policies
- Kerberos (a computer-network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.)
- SSH
- IAM roles
- Security configurations may be specified for Lake Formation
- Native integration with Apache Ranger
- EMR: Choosing Instance Types
- Master node:
- m4.large if < 50 nodes, m4.xlarge if > 50 nodes
- Core & task nodes:
- m4.large is usually good
- If cluster waits a lot on external dependencies (i.e. a web crawler), t2.medium
- Improved performance: m4.xlarge
- Computation-intensive applications: high CPU instances
- Database, memory-caching applications: high memory instances
- Network / CPU-intensive (NLP, ML) – cluster computer instances
- Accelerated Computing / AI – GPU instances (g3, g4, p2, p3)
- Spot instances
- Good choice for task nodes
- Only use on core & master if you’re testing or very cost-sensitive; you’re risking partial
data loss
- Master node:
Feature Engineering
- A feature is an individual measurable property within a recorded dataset. In machine learning and statistics, features are often called “variables” or “attributes.” Relevant features have a correlation or bearing (called feature importance) on a model’s use case.
- Applying your knowledge of the data – and the model you’re using – to create better features to train your model with.
- Which features should I use?
- Do I need to transform these features in some way?
- How do I handle missing data?
- Should I create new features from the existing ones?
- You can’t just throw in raw data and expect good results
- The Curse of Dimensionality
- Too many features can be a problem – leads to sparse data
- Every feature is a new dimension
- Much of feature engineering is selecting the features most relevant to the problem at hand
- This often is where domain knowledge comes into play
- Unsupervised dimensionality reduction techniques can also be employed to distill many features into fewer features
- Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
- K-Means aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.
Imputing Missing Data / Imputation
- Mean Replacement
- Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.)
- Fast & easy, won’t affect mean or sample size of overall data set
- Median may be a better choice than mean when outliers are present
- But it’s generally pretty terrible.
- Only works on column level, misses correlations between features
- Can’t use on categorical features (imputing with most frequent value can work in this case, though)
- Not very accurate
- Dropping
- Reduce the final data records
- Machine Learning
- KNN: Find K “nearest” (most similar) rows and average their values
- Assumes numerical data, not categorical
- Deep Learning
- Build a machine learning model to impute data for your machine learning model!
- Works well for categorical data. Really well. But it’s complicated.
- Regression
- Find linear or non-linear relationships between the missing feature and other features
- Most advanced technique: MICE (Multiple Imputation by Chained Equations)
- KNN: Find K “nearest” (most similar) rows and average their values
Unbalanced Data
- Large discrepancy between “positive” and “negative” cases
- “positive” means the thing you’re testing for is what happened
- i.e., fraud detection. Fraud is rare, and most rows will be not-fraud
- Techniques for re-balance
- Oversampling
- Duplicate samples from the minority class
- Synthetic Minority Over-sampling TEchnique (SMOTE)
- Artificially generate new samples of the minority class using nearest neighbors
- Run K-nearest-neighbors of each sample of the minority class
- Create a new sample from the KNN result (mean of the neighbors)
- Both generates new samples and undersamples majority class
- Generally better than just simple oversampling
- Undersampling
- Instead of creating more positive samples, remove “some” negative ones
- Throwing data away is usually not the right answer
- Oversampling
- Adjusting thresholds
- When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you’ll flag something as the positive case (fraud)
- If you have too many false positives, one way to fix that is to simply increase that threshold.
- Guaranteed to reduce false positives
- But, could result in more false negatives


Outliers
- Variance (𝜎 2 ) is simply the average of the squared differences from the mean
- measures how “spread-out” the data is
- Standard Deviation 𝜎 is just the square root of the variance.
- Data points lie than one standard deviation from the mean can be considered unusual.
- how extreme a data point is by talking about “how many sigmas” away from the mean it is.
- Dealing with Outliers
- AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
- It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thus the name forest.

Binning / Bucketing
- Bucket observations together based on ranges of values.
- Quantile binning categorizes data by their place in the data distribution
- Ensures even sizes of bins
- Transforms numeric data to categorical/ordinal data
- Especially useful when there is uncertainty in the measurements
Transforming
- Feature data with an exponential trend may benefit from a logarithmic transform

Encoding
- Transforming data into some new representation required by the model
- One-hot encoding
- Create “buckets” for every category
- The bucket for your category has a 1, all others have a 0
- Very common in deep learning, where categories are represented by individual output “neurons”
Decimal | Binary | Unary | One-hot |
---|---|---|---|
0 | 000 | 00000000 | 00000001 |
1 | 001 | 00000001 | 00000010 |
2 | 010 | 00000011 | 00000100 |
3 | 011 | 00000111 | 00001000 |
4 | 100 | 00001111 | 00010000 |
5 | 101 | 00011111 | 00100000 |
6 | 110 | 00111111 | 01000000 |
7 | 111 | 01111111 | 10000000 |

Scaling / Normalization
- Some models prefer feature data to be normally distributed around 0 (most neural nets)
- Most models require feature data to at least be scaled to comparable values
- Otherwise features with larger magnitudes will have more weight than they should
- Example: modeling age and income as features – incomes will be much higher values than ages
- Remember to scale your results back up
Shuffling
- Many algorithms benefit from shuffling their training data
- Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
- The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean). The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).
SageMaker Ground Truth
- Ground Truth manages humans who will label your data for training purposes
- Ground Truth creates its own model as images are labeled by people
- As this model learns, only images the model isn’t sure about are sent to human labelers
- This can reduce the cost of labeling jobs by 70%
- Ground Truth Plus is a Turnkey solution
- track progress via the Ground Truth Plus Project Portal
- Get labeled data from S3 when done
- Other ways to generate training labels
- Rekognition
- AWS service for image recognition
- Automatically classify images
- Comprehend
- AWS service for text analysis and topic modeling
- Automatically classify text by topics, sentiment
- Rekognition
Amazon Mechanical Turk
- a crowdsourcing marketplace that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually.
Term Frequency and Inverse Document Frequency (TF-IDF)
- figures out what terms are most relevant for a document
- Term Frequency just measures how often a word occurs in a document
- A word that occurs frequently is probably important to that document’s meaning
- Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
- This tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, etc.
- So a measure of the relevancy of a word to a document might be: 𝑇𝑒𝑟𝑚 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 / 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, (aka Term Frequency * Inverse Document Frequency )
- That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document
- In practice, the TF-IDF often use TF * Inverse (log of) Document Frequency, since word frequencies are distributed exponentially.
- n-gram
- An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome.
- Unigrams: single word
- Bi-grams: two close-by terms
- Tri-grams: three sequenced terms
- Example:
- The TF-IDF Matrix for unigram and bigrams of these two sentences
- Please call the number below.
- Please do not call us.
- the matrix dimension would be (2, 16)
- 2 – two input source (documents)
- 16 – 8 unigrams (“please”, “call”, “the”, “number”, “below”, “do”, “not”, “us”) + 8 bigrams (“please call”, “call the”, “the number”, “number below”, “please do”, “do not”, “not call”, “call us”)
- The TF-IDF Matrix for unigram and bigrams of these two sentences

AWS Lake Formation
- Centrally govern, secure, and share data for analytics and machine learning
