12. ML – Exploratory Data Analysis

Python & relevant

  • Pandas: A Python library for slicing and dicing your data
    • Data Frames
    • Series
    • Interoperates with numpy
  • Matplotlib
  • Seaborn
  • scikit_learn: Python library for machine learning models
  • Jupyter notebooks

Amazon Athena

  • Interactive query service for S3 (SQL)
  • Serverless
  • Supports data formats
    • CSV (human readable)
    • JSON (human readable)
    • ORC (columnar, splittable)
    • Parquet (columnar, splittable)
    • Avro (splittable)

Amazon QuickSight

  • Business analytics and visualizations in the cloud
  • Serverless
  • data source
    • Redshift
    • Aurora / RDS
    • Athena
    • EC2-hosted databases
    • Files (S3 or on-premises)
    • Excel
    • CSV, TSV
    • Common or extended log format
    • AWS IoT Analytics
    • Data preparation allows limited ETL
  • SPICE
    • Data sets are imported into SPICE
    • Each user gets 10GB of SPICE
    • Scales to hundreds of thousands of users
  • Use cases
    • Interactive ad-hoc exploration / visualization of data
    • Dashboards and KPI’s
    • Analyze / visualize data from:
      • Logs in S3
      • On-premise databases
      • AWS (RDS, Redshift, Athena, S3)
      • SaaS applications, such as Salesforce
      • Any JDBC/ODBC data source
  • Quicksight Q
    • Machine learning-powered
    • Answers business questions with Natural Language Processing
    • Must set up topics associated with datasets
  • Quicksight Security
    • Multi-factor authentication on your account
    • VPC connectivity
    • Row-level security
      • Column-level security too (CLS) –Enterprise edition only
    • Private VPC access
      • Elastic Network Interface, AWS Direct Connect
  • QuickSight Visual Types
    • AutoGraph
    • Bar Charts: comparison and distribution (histograms)
      • A histogram is a type of chart that displays the distribution of numerical data by dividing it into intervals or bins. Each bar represents the frequency or count of data points falling within each interval, providing insights into the data’s distribution and density.
    • Line graphs: changes over time
    • Scatter plot, heat maps: correlation
      • A heatmap is a visualization method that uses color gradients to represent values within a matrix. It displays data in a two-dimensional format where color intensity indicates the magnitude of values
    • Pie graphs: aggregation
    • Tree maps: Heirarchical Aggregation
    • Pivot tables: tabular data
    • KPIs: compare key value to its target value
    • Geospatial Charts (maps)
    • Donut Charts: Percentage of Total Amount
    • Gauge Charts: Compare values in a measure
    • Word Clouds: word or phrase frequency
    • Radar Chart
    • Sankey diagrams: show flows from one category to another, or paths from one stage to the next
    • Waterfall chart: visualize a sequential summation as values are added or subtracted
    • (not provided) Density Plot
      • aka Kernel Density Plot or Density Trace Graph
      • visualises the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.
Bar graphHistogram
The bar graph is the graphical representation of categorical data.A histogram is the graphical representation of quantitative data.
(axis-x presents the data as either numeric or ordinal)
There is equal space between each pair of consecutive bars.There is no space between the consecutive bars.
The height of the bars shows the frequency, and the width of the bars are same.The area of rectangular bars shows the frequency of the data and the width of the bars need not to be same.
Bar graphHistogram

SageMaker Ground Truth

  • Ground Truth manages humans who will label your data for training purposes
  • Ground Truth creates its own model as images are labeled by people
  • As this model learns, only images the model isn’t sure about are sent to human labelers
  • This can reduce the cost of labeling jobs by 70%
  • offers a unique combination of automated data labeling and human labeling to ensure efficiency and accuracy.
  • By using active learning, Ground Truth can reduce the manual labeling required by automatically labeling data when it has high confidence in the predictions.
  • seamlessly integrates with Amazon S3 
  • handling various data types, such as text, videos, images, and 3D point clouds
  • provides built-in support for labeling tasks like text classification, object detection, and semantic segmentation
  • incorporating feedback from human labelers, ensuring the accuracy of automated labels
  • Ground Truth Plus is a Turnkey solution
  • track progress via the Ground Truth Plus Project Portal
  • Get labeled data from S3 when done
  • Other ways to generate training labels
    • Rekognition
      • AWS service for image recognition
      • Automatically classify images
    • Comprehend
      • AWS service for text analysis and topic modeling
      • Automatically classify text by topics, sentiment

Amazon Mechanical Turk

  • a crowdsourcing marketplace that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually.

AWS Lake Formation

  • Centrally govern, secure, and share data for analytics and machine learning