12. ML – Exploratory Data Analysis

Python & relevant

  • Pandas: A Python library for slicing and dicing your data
    • Data Frames
    • Series
    • Interoperates with numpy
  • Matplotlib
  • Seaborn
  • scikit_learn: Python library for machine learning models
  • Jupyter notebooks

Amazon Athena

  • Interactive query service for S3 (SQL)
  • Serverless
  • Supports data formats
    • CSV (human readable)
    • JSON (human readable)
    • ORC (columnar, splittable)
    • Parquet (columnar, splittable)
    • Avro (splittable)

Amazon QuickSight

  • Business analytics and visualizations in the cloud
  • Serverless
  • data source
    • Redshift
    • Aurora / RDS
    • Athena
    • EC2-hosted databases
    • Files (S3 or on-premises)
    • Excel
    • CSV, TSV
    • Common or extended log format
    • AWS IoT Analytics
    • Data preparation allows limited ETL
  • SPICE
    • Data sets are imported into SPICE
    • Each user gets 10GB of SPICE
    • Scales to hundreds of thousands of users
  • Use cases
    • Interactive ad-hoc exploration / visualization of data
    • Dashboards and KPI’s
    • Analyze / visualize data from:
      • Logs in S3
      • On-premise databases
      • AWS (RDS, Redshift, Athena, S3)
      • SaaS applications, such as Salesforce
      • Any JDBC/ODBC data source
  • Quicksight Q
    • Machine learning-powered
    • Answers business questions with Natural Language Processing
    • Must set up topics associated with datasets
  • Quicksight Security
    • Multi-factor authentication on your account
    • VPC connectivity
    • Row-level security
      • Column-level security too (CLS) –Enterprise edition only
    • Private VPC access
      • Elastic Network Interface, AWS Direct Connect
  • QuickSight Visual Types
    • AutoGraph
    • Bar Charts: comparison and distribution (histograms)
      • A histogram is a type of chart that displays the distribution of numerical data by dividing it into intervals or bins. Each bar represents the frequency or count of data points falling within each interval, providing insights into the data’s distribution and density.
    • Line graphs: changes over time
    • Scatter plot, heat maps: correlation
      • A heatmap is a visualization method that uses color gradients to represent values within a matrix. It displays data in a two-dimensional format where color intensity indicates the magnitude of values
    • Pie graphs: aggregation
    • Tree maps: Heirarchical Aggregation
    • Pivot tables: tabular data
    • KPIs: compare key value to its target value
    • Geospatial Charts (maps)
    • Donut Charts: Percentage of Total Amount
    • Gauge Charts: Compare values in a measure
    • Word Clouds: word or phrase frequency
    • Radar Chart
    • Sankey diagrams: show flows from one category to another, or paths from one stage to the next
    • Waterfall chart: visualize a sequential summation as values are added or subtracted
    • (not provided) Density Plot
      • aka Kernel Density Plot or Density Trace Graph
      • visualises the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.
Bar graphHistogram
The bar graph is the graphical representation of categorical data.A histogram is the graphical representation of quantitative data.
(axis-x presents the data as either numeric or ordinal)
There is equal space between each pair of consecutive bars.There is no space between the consecutive bars.
The height of the bars shows the frequency, and the width of the bars are same.The area of rectangular bars shows the frequency of the data and the width of the bars need not to be same.
Bar graphHistogram

Elastic MapReduce (EMR)

  • Managed Hadoop framework on EC2 instances
    • Hadoop is an open source framework based on Java that manages the storage and processing of large amounts of data for applications.
      • Hadoop Distributed File System (HDFS): a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage
      • Yet Another Resource Negotiator (YARN): a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.
      • MapReduce: a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.
      • Hadoop Common: the libraries and utilities
  • Includes Spark, HBase, Presto, Flink, Hive & more
    • Apache Spark: uses in-memory caching and optimized query execution for fast analytic queries against data of any size. Its foundational concept is a read-only set of data distributed over a cluster of machines, which is called a resilient distributed dataset (RDD). 
    • Apache HBase: An open source non-relational distributed database often paired with Hadoop
    • Presto: a distributed and open-source SQL query engine that is used to run interactive analytical queries. It can handle the query of any size ranging from gigabytes to petabytes.
    • Flink: a stream processing framework that can also handle batch processing. It is designed for real-time analytics and can process data as it arrives.
    • Apache Hive: A data warehouse that allows programmers to work with data in HDFS using a query language called HiveQL, which is similar to SQL
    • Zeppelin: Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
    • Apache Ranger: an open-source framework designed to enable, monitor and manage comprehensive data security across the Hadoop platform
Based OnApache HadoopApache SparkApache Flink
Data ProcessingHadoop is mainly designed for batch processing which is very efficient in processing large datasets. It supports batch processing as well as stream processing.  It supports both batch and stream processing. Flink also provides the single run-time for batch and stream processing. 
Stream EngineIt takes the complete data-set as input at once and produces the output. Process data streams in micro-batches.The true streaming engine uses streams for workload: streaming, micro-batch, SQL, batch. 
Data FlowData Flow does not contain any loops. supports linear data flow.  Spark supports cyclic data flow and represents it as (DAG) direct acyclic graph.Flink uses a controlled cyclic dependency graph in run time. which efficiently manifest ML algorithms. 
      Computation ModelHadoop Map-Reduce supports the batch-oriented model. It supports the micro-batching computational model.Flink supports a continuous operator-based streaming model.
PerformanceSlower than Spark and Flink.More than Hadoop lesser than Flink.Performance is highest among these three.
Memory managementConfigurable Memory management supports both dynamically or statically management.The Latest release of spark has automatic memory management.Supports automatic memory management
Fault toleranceHighly fault-tolerant using a replication mechanism.Spark RDD provides fault tolerance through lineage.Fault tolerance is based on Chandy-Lamport distributed snapshots results in high throughput.
ScalabilityHighly scalable and can be scaled up to tens of thousands of nodes.Highly scalable.  It is also highly scalable.
Iterative ProcessingDoes not support Iterative Processing.supports Iterative Processing.supports Iterative Processing and iterate data with the help of its streaming architecture.
Supported LanguagesJava, C, C++, Python, Perl, groovy, Ruby, etc.Java, Python, R, Scala.Java, Python, R, Scala.
 CostUses commodity hardware which is less expensiveNeeded lots of RAM so the cost is relatively high.Apache Flink also needed lots of RAM so the cost is relatively high.
Abstraction No Abstraction in Map-Reduce.Spark RDD abstraction Flink supports Dataset abstraction for batch and DataStreams
SQL supportUsers can run SQL queries using Apache Hive.Users can run SQL queries using Spark-SQL. It also supports Hive for SQL.Flink supports Table-API which are similar to SQL expression. Apache foundation is panning to add SQL interface in its future release.
 Caching Map-Reduce can not cache data. It can cache data in memoryFlink can also cache data in memory
Hardware RequirementsRuns well on less expensive commodity hardware.It also needed high-level hardware.Apache Flink also needs High-level Hardware
Machine LearningApache Mahout is used for ML.Spark is so powerful in implementing ML algorithms with its own ML libraries.  FlinkML library of Flink is used for ML implementation.
High Availability Configurable in High Availability Mode.Configurable in High Availability Mode. Configurable in High Availability Mode.
Amazon S3 connectorProvides Support for Amazon S3 Connector.Provides Support for Amazon S3 Connector.Provides Support for Amazon S3 Connector.
Backpressure HandingHadoop handles back-pressure through Manual Configuration.Spark also handles back-pressure through Manual Configuration.Apache Flink handles back-pressure Implicitly through System Architecture
  • An EMR Cluster
    • Master node: manages the cluster
      • Single EC2 instance
    • Core node: Hosts HDFS data and runs tasks
      • Can be scaled up & down, but with some risk
    • Task node: Runs tasks, does not host data
      • No risk of data loss when removing
      • Good use of spot instances
  • EMR Usage
    • Transient vs Long-Running Clusters
      • Can spin up task nodes using Spot instances for temporary capacity
      • Can use reserved instances on long-running clusters to save $
    • Connect directly to master to run jobs
    • Submit ordered steps via the console
    • Serverless
  • AWS Integration
    • Amazon EC2 for the instances that comprise the nodes in the cluster
    • Amazon VPC to configure the virtual network in which you launch your instances
    • Amazon S3 to store input and output data
    • Amazon CloudWatch to monitor cluster performance and configure alarms
    • AWS IAM to configure permissions
    • AWS CloudTrail to audit requests made to the service
    • AWS Data Pipeline to schedule and start your clusters
  • EMR Storage
    • HDFS
    • EMR File System (EMRFS) : access S3 as if it were HDFS
      • [End-of-Support] EMRFS Consistent View – Optional for S3 consistency; uses DynamoDB to track consistency
      • Local file system
      • EBS for HDFS
  • Apache Spark
    • Can work with Hadoop
    • Spark MLLib
      • Classification: logistic regression, naïve Bayes
      • Regression
      • Decision trees
      • Recommendation engine (ALS)
      • Clustering (K-Means)
      • LDA (topic modeling)
      • ML workflow utilities (pipelines, feature transformation, persistence)
      • SVD, PCA, statistics
    • Spark Structured Streaming
      • Data stream as an unbounded Input Table
      • New data in stream = new rows appended to input table
  • EMR Notebook
    • Notebooks backed up to S3
    • Provision clusters from the notebook!
    • Hosted inside a VPC
    • Accessed only via AWS console
  • EMR Security
    • IAM policies
    • Kerberos (a computer-network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.)
    • SSH
    • IAM roles
    • Security configurations may be specified for Lake Formation
    • Native integration with Apache Ranger
  • EMR: Choosing Instance Types
    • Master node:
      • m4.large if < 50 nodes, m4.xlarge if > 50 nodes
    • Core & task nodes:
      • m4.large is usually good
      • If cluster waits a lot on external dependencies (i.e. a web crawler), t2.medium
      • Improved performance: m4.xlarge
      • Computation-intensive applications: high CPU instances
      • Database, memory-caching applications: high memory instances
      • Network / CPU-intensive (NLP, ML) – cluster computer instances
      • Accelerated Computing / AI – GPU instances (g3, g4, p2, p3)
    • Spot instances
      • Good choice for task nodes
      • Only use on core & master if you’re testing or very cost-sensitive; you’re risking partial
        data loss

SageMaker Ground Truth

  • Ground Truth manages humans who will label your data for training purposes
  • Ground Truth creates its own model as images are labeled by people
  • As this model learns, only images the model isn’t sure about are sent to human labelers
  • This can reduce the cost of labeling jobs by 70%
  • offers a unique combination of automated data labeling and human labeling to ensure efficiency and accuracy.
  • By using active learning, Ground Truth can reduce the manual labeling required by automatically labeling data when it has high confidence in the predictions.
  • seamlessly integrates with Amazon S3 
  • handling various data types, such as text, videos, images, and 3D point clouds
  • provides built-in support for labeling tasks like text classification, object detection, and semantic segmentation
  • incorporating feedback from human labelers, ensuring the accuracy of automated labels
  • Ground Truth Plus is a Turnkey solution
  • track progress via the Ground Truth Plus Project Portal
  • Get labeled data from S3 when done
  • Other ways to generate training labels
    • Rekognition
      • AWS service for image recognition
      • Automatically classify images
    • Comprehend
      • AWS service for text analysis and topic modeling
      • Automatically classify text by topics, sentiment

Amazon Mechanical Turk

  • a crowdsourcing marketplace that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually.

AWS Lake Formation

  • Centrally govern, secure, and share data for analytics and machine learning

Data Build Tool (dbt)

  • an open-source command line tool that helps analysts and engineers transform data in their warehouse more effectively
  • Dbt enables analytics engineers to transform data in their warehouses by writing select statements, and turns these select statements into tables and views. Dbt does the transformation (T) in extract, load, transform (ELT) processes – it does not extract or load data, but is designed to be performant at transforming data already inside of a warehouse.
  • Dbt uses YAML files to declare properties. seed is a type of reference table used in dbt for static or infrequently changed data, like for example country codes or lookup tables), which are CSV based and typically stored in a seeds folder.