12. ML – Exploratory Data Analysis

Python & relevant

  • Pandas: A Python library for slicing and dicing your data
    • Data Frames
    • Series
    • Interoperates with numpy
  • Matplotlib
  • Seaborn
  • scikit_learn: Python library for machine learning models
  • Jupyter notebooks

Data Types

  • Numerical
    • Represents some sort of quantitative measurement
    • Discrete Data: Integer based; often counts of some event
    • Continuous Data: Has an infinite number of possible values
  • Categorical
    • Qualitative data that has no inherent mathematical meaning
    • You can assign numbers to categories in order to represent them more compactly, but the numbers don’t have mathematical meaning
  • Ordinal
    • A mixture of numerical and categorical
    • Categorical data that has mathematical meaning

Data Distributions

  • Normal distribution
  • Probability Mass Function
    • aka, probability function, frequency function, discrete probability density function
    • a function that gives the probability that a discrete random variable is exactly equal to some value. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.
  • Poisson Distribution
    • expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event.
    • A classic example used to motivate the Poisson distribution is the number of radioactive decay events during a fixed observation period.
  • Binomial Distribution
    • In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
  • Bernoulli Distribution
    • Special case of binomial distribution
    • Has a single trial (n=1)
    • Can think of a binomial distribution as the sum of Bernoulli distributions

Time Series Analysis

  • Trends
  • Seasonality
  • Noise
  • Seasonality + Trends + Noise = time series
    • Additive model
      • A data model in which the effects of individual factors are differentiated and added together to model the data.
    • Seasonal variation is constant
  • seasonality * trends * noise
    • Multiplicative model
      • assumes that as the data increase, so does the seasonal pattern. Most time series plots exhibit such a pattern
    • Seasonal variation increases as the trend increases

Amazon Athena

  • Interactive query service for S3 (SQL)
  • Serverless
  • Supports data formats
    • CSV (human readable)
    • JSON (human readable)
    • ORC (columnar, splittable)
    • Parquet (columnar, splittable)
    • Avro (splittable)

Amazon QuickSight

  • Business analytics and visualizations in the cloud
  • Serverless
  • data source
    • Redshift
    • Aurora / RDS
    • Athena
    • EC2-hosted databases
    • Files (S3 or on-premises)
    • Excel
    • CSV, TSV
    • Common or extended log format
    • AWS IoT Analytics
    • Data preparation allows limited ETL
  • SPICE
    • Data sets are imported into SPICE
    • Each user gets 10GB of SPICE
    • Scales to hundreds of thousands of users
  • Use cases
    • Interactive ad-hoc exploration / visualization of data
    • Dashboards and KPI’s
    • Analyze / visualize data from:
      • Logs in S3
      • On-premise databases
      • AWS (RDS, Redshift, Athena, S3)
      • SaaS applications, such as Salesforce
      • Any JDBC/ODBC data source
  • Quicksight Q
    • Machine learning-powered
    • Answers business questions with Natural Language Processing
    • Must set up topics associated with datasets
  • Quicksight Security
    • Multi-factor authentication on your account
    • VPC connectivity
    • Row-level security
      • Column-level security too (CLS) –Enterprise edition only
    • Private VPC access
      • Elastic Network Interface, AWS Direct Connect
  • QuickSight Visual Types
    • AutoGraph
    • Bar Charts: comparison and distribution (histograms)
    • Line graphs: changes over time
    • Scatter plots, heat maps: correlation
    • Pie graphs: aggregation
    • Tree maps: Heirarchical Aggregation
    • Pivot tables: tabular data
    • KPIs: compare key value to its target value
    • Geospatial Charts (maps)
    • Donut Charts: Percentage of Total Amount
    • Gauge Charts: Compare values in a measure
    • Word Clouds: word or phrase frequency

Elastic MapReduce (EMR)

  • Managed Hadoop framework on EC2 instances
    • Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications.
      • Hadoop Distributed File System (HDFS): a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage
      • Yet Another Resource Negotiator (YARN): a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.
      • MapReduce: a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.
      • Hadoop Common: the libraries and utilities
  • Includes Spark, HBase, Presto, Flink, Hive & more
    • Apache Spark: uses in-memory caching and optimized query execution for fast analytic queries against data of any size. Its foundational concept is a read-only set of data distributed over a cluster of machines, which is called a resilient distributed dataset (RDD). 
    • Apache HBase: An open source non-relational distributed database often paired with Hadoop
    • Presto: a distributed and open-source SQL query engine that is used to run interactive analytical queries. It can handle the query of any size ranging from gigabytes to petabytes.
    • Flink: a stream processing framework that can also handle batch processing. It is designed for real-time analytics and can process data as it arrives.
    • Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a query language called HiveQL, which is similar to SQL
    • Zeppelin: Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
    • Apache Ranger: an open-source framework designed to enable, monitor and manage comprehensive data security across the Hadoop platform
Based OnApache HadoopApache SparkApache Flink
Data ProcessingHadoop is mainly designed for batch processing which is very efficient in processing large datasets. It supports batch processing as well as stream processing.  It supports both batch and stream processing. Flink also provides the single run-time for batch and stream processing. 
Stream EngineIt takes the complete data-set as input at once and produces the output. Process data streams in micro-batches.The true streaming engine uses streams for workload: streaming, micro-batch, SQL, batch. 
Data FlowData Flow does not contain any loops. supports linear data flow.  Spark supports cyclic data flow and represents it as (DAG) direct acyclic graph.Flink uses a controlled cyclic dependency graph in run time. which efficiently manifest ML algorithms. 
      Computation ModelHadoop Map-Reduce supports the batch-oriented model. It supports the micro-batching computational model.Flink supports a continuous operator-based streaming model.
PerformanceSlower than Spark and Flink.More than Hadoop lesser than Flink.Performance is highest among these three.
Memory managementConfigurable Memory management supports both dynamically or statically management.The Latest release of spark has automatic memory management.Supports automatic memory management
Fault toleranceHighly fault-tolerant using a replication mechanism.Spark RDD provides fault tolerance through lineage.Fault tolerance is based on Chandy-Lamport distributed snapshots results in high throughput.
ScalabilityHighly scalable and can be scaled up to tens of thousands of nodes.Highly scalable.  It is also highly scalable.
Iterative ProcessingDoes not support Iterative Processing.supports Iterative Processing.supports Iterative Processing and iterate data with the help of its streaming architecture.
Supported LanguagesJava, C, C++, Python, Perl, groovy, Ruby, etc.Java, Python, R, Scala.Java, Python, R, Scala.
 CostUses commodity hardware which is less expensiveNeeded lots of RAM so the cost is relatively high.Apache Flink also needed lots of RAM so the cost is relatively high.
 Abstraction No Abstraction in Map-Reduce.Spark RDD abstraction Flink supports Dataset abstraction for batch and DataStreams
SQL supportUsers can run SQL queries using Apache Hive.Users can run SQL queries using Spark-SQL. It also supports Hive for SQL.Flink supports Table-API which are similar to SQL expression. Apache foundation is panning to add SQL interface in its future release.
 Caching Map-Reduce can not cache data. It can cache data in memoryFlink can also cache data in memory
Hardware RequirementsRuns well on less expensive commodity hardware.It also needed high-level hardware.Apache Flink also needs High-level Hardware
Machine LearningApache Mahout is used for ML.Spark is so powerful in implementing ML algorithms with its own ML libraries.  FlinkML library of Flink is used for ML implementation.
High Availability Configurable in High Availability Mode.Configurable in High Availability Mode. Configurable in High Availability Mode.
Amazon S3 connectorProvides Support for Amazon S3 Connector.Provides Support for Amazon S3 Connector.Provides Support for Amazon S3 Connector.
Backpressure HandingHadoop handles back-pressure through Manual Configuration.Spark also handles back-pressure through Manual Configuration.Apache Flink handles back-pressure Implicitly through System Architecture
  • An EMR Cluster
    • Master node: manages the cluster
      • Single EC2 instance
    • Core node: Hosts HDFS data and runs tasks
      • Can be scaled up & down, but with some risk
    • Task node: Runs tasks, does not host data
      • No risk of data loss when removing
      • Good use of spot instances
  • EMR Usage
    • Transient vs Long-Running Clusters
      • Can spin up task nodes using Spot instances for temporary capacity
      • Can use reserved instances on long-running clusters to save $
    • Connect directly to master to run jobs
    • Submit ordered steps via the console
    • Serverless
  • AWS Integration
    • Amazon EC2 for the instances that comprise the nodes in the cluster
    • Amazon VPC to configure the virtual network in which you launch your instances
    • Amazon S3 to store input and output data
    • Amazon CloudWatch to monitor cluster performance and configure alarms
    • AWS IAM to configure permissions
    • AWS CloudTrail to audit requests made to the service
    • AWS Data Pipeline to schedule and start your clusters
  • EMR Storage
    • HDFS
    • EMR File System (EMRFS) : access S3 as if it were HDFS
      • [End-of-Support] EMRFS Consistent View – Optional for S3 consistency; uses DynamoDB to track consistency
      • Local file system
      • EBS for HDFS
  • Apache Spark
    • Can work with Hadoop
    • Spark MLLib
      • Classification: logistic regression, naïve Bayes
      • Regression
      • Decision trees
      • Recommendation engine (ALS)
      • Clustering (K-Means)
      • LDA (topic modeling)
      • ML workflow utilities (pipelines, feature transformation, persistence)
      • SVD, PCA, statistics
    • Spark Structured Streaming
      • Data stream as an unbounded Input Table
      • New data in stream = new rows appended to input table
  • EMR Notebook
    • Notebooks backed up to S3
    • Provision clusters from the notebook!
    • Hosted inside a VPC
    • Accessed only via AWS console
  • EMR Security
    • IAM policies
    • Kerberos (a computer-network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.)
    • SSH
    • IAM roles
    • Security configurations may be specified for Lake Formation
    • Native integration with Apache Ranger
  • EMR: Choosing Instance Types
    • Master node:
      • m4.large if < 50 nodes, m4.xlarge if > 50 nodes
    • Core & task nodes:
      • m4.large is usually good
      • If cluster waits a lot on external dependencies (i.e. a web crawler), t2.medium
      • Improved performance: m4.xlarge
      • Computation-intensive applications: high CPU instances
      • Database, memory-caching applications: high memory instances
      • Network / CPU-intensive (NLP, ML) – cluster computer instances
      • Accelerated Computing / AI – GPU instances (g3, g4, p2, p3)
    • Spot instances
      • Good choice for task nodes
      • Only use on core & master if you’re testing or very cost-sensitive; you’re risking partial
        data loss

Feature Engineering

  • A feature is an individual measurable property within a recorded dataset. In machine learning and statistics, features are often called “variables” or “attributes.” Relevant features have a correlation or bearing (called feature importance) on a model’s use case.
  • Applying your knowledge of the data – and the model you’re using – to create better features to train your model with.
    • Which features should I use?
    • Do I need to transform these features in some way?
    • How do I handle missing data?
    • Should I create new features from the existing ones?
  • You can’t just throw in raw data and expect good results
  • The Curse of Dimensionality
    • Too many features can be a problem – leads to sparse data
    • Every feature is a new dimension
    • Much of feature engineering is selecting the features most relevant to the problem at hand
    • This often is where domain knowledge comes into play
    • Unsupervised dimensionality reduction techniques can also be employed to distill many features into fewer features
      • Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
      • K-Means aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

Imputing Missing Data / Imputation

  • Mean Replacement
    • Replace missing values with the mean value from the rest of the column (columns, not rows! A column represents a single feature; it only makes sense to take the mean from other samples of the same feature.)
    • Fast & easy, won’t affect mean or sample size of overall data set
    • Median may be a better choice than mean when outliers are present
    • But it’s generally pretty terrible.
      • Only works on column level, misses correlations between features
      • Can’t use on categorical features (imputing with most frequent value can work in this case, though)
      • Not very accurate
  • Dropping
    • Reduce the final data records
  • Machine Learning
    • KNN: Find K “nearest” (most similar) rows and average their values
      • Assumes numerical data, not categorical
    • Deep Learning
      • Build a machine learning model to impute data for your machine learning model!
      • Works well for categorical data. Really well. But it’s complicated.
    • Regression
      • Find linear or non-linear relationships between the missing feature and other features
      • Most advanced technique: MICE (Multiple Imputation by Chained Equations)

Unbalanced Data

  • Large discrepancy between “positive” and “negative” cases
    • “positive” means the thing you’re testing for is what happened
    • i.e., fraud detection. Fraud is rare, and most rows will be not-fraud
  • Techniques for re-balance
    • Oversampling
      • Duplicate samples from the minority class
      • Synthetic Minority Over-sampling TEchnique (SMOTE)
      • Artificially generate new samples of the minority class using nearest neighbors
        • Run K-nearest-neighbors of each sample of the minority class
        • Create a new sample from the KNN result (mean of the neighbors)
      • Both generates new samples and undersamples majority class
      • Generally better than just simple oversampling
    • Undersampling
      • Instead of creating more positive samples, remove “some” negative ones
      • Throwing data away is usually not the right answer
  • Adjusting thresholds
    • When making predictions about a classification (fraud / not fraud), you have some sort of threshold of probability at which point you’ll flag something as the positive case (fraud)
    • If you have too many false positives, one way to fix that is to simply increase that threshold.
      • Guaranteed to reduce false positives
      • But, could result in more false negatives

Outliers

  • Variance (𝜎 2 ) is simply the average of the squared differences from the mean
    • measures how “spread-out” the data is
  • Standard Deviation 𝜎 is just the square root of the variance.
    • Data points lie than one standard deviation from the mean can be considered unusual.
  • how extreme a data point is by talking about “how many sigmas” away from the mean it is.
  • Dealing with Outliers
    • AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
    • It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thus the name forest. 

Binning / Bucketing

  • Bucket observations together based on ranges of values.
  • Quantile binning categorizes data by their place in the data distribution
    • Ensures even sizes of bins
  • Transforms numeric data to categorical/ordinal data
  • Especially useful when there is uncertainty in the measurements

Transforming

  • Feature data with an exponential trend may benefit from a logarithmic transform

Encoding

  • Transforming data into some new representation required by the model
  • One-hot encoding
    • Create “buckets” for every category
    • The bucket for your category has a 1, all others have a 0
    • Very common in deep learning, where categories are represented by individual output “neurons”
DecimalBinaryUnaryOne-hot
00000000000000000001
10010000000100000010
20100000001100000100
30110000011100001000
41000000111100010000
51010001111100100000
61100011111101000000
71110111111110000000

Scaling / Normalization

  • Some models prefer feature data to be normally distributed around 0 (most neural nets)
  • Most models require feature data to at least be scaled to comparable values
  • Otherwise features with larger magnitudes will have more weight than they should
  • Example: modeling age and income as features – incomes will be much higher values than ages
  • Remember to scale your results back up

Shuffling

  • Many algorithms benefit from shuffling their training data
  • Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected
    • The error of an observation is the deviation of the observed value from the true value of a quantity of interest (for example, a population mean). The residual is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).

SageMaker Ground Truth

  • Ground Truth manages humans who will label your data for training purposes
  • Ground Truth creates its own model as images are labeled by people
  • As this model learns, only images the model isn’t sure about are sent to human labelers
  • This can reduce the cost of labeling jobs by 70%
  • Ground Truth Plus is a Turnkey solution
    • track progress via the Ground Truth Plus Project Portal
    • Get labeled data from S3 when done
  • Other ways to generate training labels
    • Rekognition
      • AWS service for image recognition
      • Automatically classify images
    • Comprehend
      • AWS service for text analysis and topic modeling
      • Automatically classify text by topics, sentiment

Amazon Mechanical Turk

  • a crowdsourcing marketplace that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually.

Term Frequency and Inverse Document Frequency (TF-IDF)

  • figures out what terms are most relevant for a document
  • Term Frequency just measures how often a word occurs in a document
    • A word that occurs frequently is probably important to that document’s meaning
  • Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
    • This tells us about common words that just appear everywhere no matter what the topic, like “a”, “the”, “and”, etc.
  • So a measure of the relevancy of a word to a document might be: 𝑇𝑒𝑟𝑚 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 / 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦, (aka Term Frequency * Inverse Document Frequency )
  • That is, take how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document
  • In practice, the TF-IDF often use TF * Inverse (log of) Document Frequency, since word frequencies are distributed exponentially.
  • n-gram
    • An n-gram is a sequence of n adjacent symbols in particular order. The symbols may be n adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes extracted from a speech-recording dataset, or adjacent base pairs extracted from a genome.
    • Unigrams: single word
    • Bi-grams: two close-by terms
    • Tri-grams: three sequenced terms
  • Example:
    • The TF-IDF Matrix for unigram and bigrams of these two sentences
      • Please call the number below.
      • Please do not call us.
    • the matrix dimension would be (2, 16)
      • 2 – two input source (documents)
      • 16 – 8 unigrams (“please”, “call”, “the”, “number”, “below”, “do”, “not”, “us”) + 8 bigrams (“please call”, “call the”, “the number”, “number below”, “please do”, “do not”, “not call”, “call us”)

AWS Lake Formation

  • Centrally govern, secure, and share data for analytics and machine learning