11. ML – Data Engineering

S3

  • Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf
  • Storage Classes
    • Amazon S3 Standard – General Purpose
    • Amazon S3 Standard-Infrequent Access (IA)
    • Amazon S3 One Zone-Infrequent Access
      • data lost when AZ is destroyed
    • Amazon S3 Glacier Instant Retrieval
      • Millisecond retrieval
      • Minimum storage duration of 90 days
    • Amazon S3 Glacier Flexible Retrieval
      • Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours)
      • Minimum storage duration of 90 days
    • Amazon S3 Glacier Deep Archive
      • Standard (12 hours), Bulk (48 hours)
      • Minimum storage duration of 180 days
    • Amazon S3 Intelligent Tiering
      • Moves objects automatically between Access Tiers based on usage
      • There are no retrieval charges in S3 Intelligent-Tiering
    • Can managed with S3 Lifecycle
  • Lifecycle Rules
    • Transition Actions
    • Expiration actions
      • Can be used to delete old versions of files (if versioning is enabled)
      • Can be used to delete incomplete Multi-Part uploads
  • Enable S3 Versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered
  • S3 Analytics
    • decide when to transition objects to the right storage class
  • S3 Security
    • User-Based (IAM Policies)
    • Resource-Based (Bucket Policies)
      • Grant public access to the bucket
      • Force objects to be encrypted at upload
      • Grant access to another account (Cross Account)
  • Object Encryption
    • Server-Side Encryption (SSE)
      • Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
        • automatically applied to new objects stored in S3 bucket
        • Encryption type is AES-256
        • header “x-amz-server-side-encryption”: “AES256”
      • Server-Side Encryption with KMS Keys stored in AWS KMS (SSE-KMS)
        • KMS advantages: user control + audit key usage using CloudTrail
        • header “x-amz-server-side-encryption”: “aws:kms”
        • When you upload, it calls the GenerateDataKey KMS API
        • When you download, it calls the Decrypt KMS API
      • Server-Side Encryption with Customer-Provided Keys (SSE-C)
        • Amazon S3 does NOT store the encryption key you provide
        • HTTPS must be used
        • Encryption key must provided in HTTP headers, for every HTTP request made
    • Client-Side Encryption
  • Encryption in transit (SSL/TLS)
    • aka “Encryption in flight”
    • Amazon S3 exposes two endpoints:
      • HTTP Endpoint – non encrypted
      • HTTPS Endpoint – encryption in flight
        • mandatory for SSE-C
        • Set Condition in the Bucket Policy, with “aws:SecureTransport”

Kinesis Data Streams

  • Collect and store streaming data in real-time
  • Retention between up to 365 days
  • Data up to 1MB (typical use case is lot of “small” real-time data)
  • Data ordering guarantee for data with the same “Partition ID”
  • Capacity Modes
    • Provisioned mode
      • Each shard gets 1MB/s in (or 1000 records per second)
      • Each shard gets 2MB/s out
    • On-demand mode
      • Default capacity provisioned (4 MB/s in or 4000 records per second)
  • [ML] create real-time machine learning applications

Amazon Data Firehose

  • aka Kinesis Data Firehouse
  • Collect and store streaming data in real-time
  • Near Real-Time
  • Custom data transformations using AWS Lambda
  • [ML] ingest massive data near-real time

Amazon Managed Service for Apache Flink

  • aka Kinesis Data Analytics
  • Flink does not read from Amazon Data Firehose
  • Serverless
  • Common cases
    • Streaming ETL
    • Continuous metric generation
    • Responsive analytics
  • Use IAM permissions to access streaming source and destination(s)
  • Schema discovery
  • [ML] real-time ETL / ML algorithms on streams

Kinesis Video Stream

  • Video playback capability
  • Keep data for 1 hour to 10 years
  • [ML] real-time video stream to create ML applications

Glue Data Catalog

  • Metadata repository for all your tables
    • Automated Schema Inference
    • Schemas are versioned
  • Integrates with Athena or Redshift Spectrum (schema & data discovery)
  • Glue Crawlers can help build the Glue Data Catalog
    • Works JSON, Parquet, CSV, relational store
    • Crawlers work for: S3, Amazon Redshift, Amazon RDS
    • Run the Crawler on a Schedule or On Demand
    • Need an IAM role / credentials to access the data stores
    • Glue crawler will extract partitions based on how your S3 data is organized

Glue ETL

  • Extract, Transform, Load
  • Transform data, Clean Data, Enrich Data (before doing analysis)
    • Bundled Transformations
      • DropFields, DropNullFields – remove (null) fields
      • Filter – specify a function to filter records
      • Join – to enrich data
      • Map – add fields, delete fields, perform external lookups
    • Machine Learning Transformations
      • FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
      • Apache Spark transformations (example: K-Means)
  • Jobs are run on a serverless Spark platform
  • Glue Scheduler to schedule the jobs
  • Glue Triggers to automate job runs based on “events”

AWS Glue DataBrew

  • Allows you to clean and normalize data without writing any code
  • Reduces ML and analytics data preparation time by up to 80%

AWS Data Stores for Machine Learning

  • Redshift
    • Data Warehousing, SQL analytics (OLAP – Online analytical processing)
    • Load data from S3 to Redshift
    • Use Redshift Spectrum to query data directly in S3 (no loading)
  • RDS, Aurora
    • Relational Store, SQL (OLTP – Online Transaction Processing)
    • Must provision servers in advance
  • DynamoDB
    • NoSQL data store, serverless, provision read/write capacity
    • Useful to store a machine learning model served by your application
  • S3
    • Object storage
    • Serverless, infinite storage
    • Integration with most AWS Services
  • OpenSearch (previously ElasticSearch)
    • Indexing of data
    • Search amongst data points
    • Clickstream Analytics
  • ElastiCache
    • Caching mechanism
    • Not really used for Machine Learning

AWS Data Pipeline

  • Destinations include S3, RDS, DynamoDB, Redshift and EMR
  • Manages task dependencies
  • Retries and notifies on failures
  • Data sources may be on-premises

AWS Batch

  • Run batch jobs via Docker images
  • Dynamic provisioning of the instances (EC2 & Spot Instances)
  • serverless
  • Schedule Batch Jobs using CloudWatch Events
  • Orchestrate Batch Jobs using AWS Step Functions

DMS – Database Migration Service

  • Continuous Data Replication using CDC
  • You must create an EC2 instance to perform the replication tasks
  • Homogeneous migrations: ex Oracle to Oracle
  • Heterogeneous migrations: ex Microsoft SQL Server to Aurora

AWS Step Functions

  • Use to design workflows
  • Advanced Error Handling and Retry mechanism outside the (Lambda) code
  • Audit of the history of workflows
  • Ability to “Wait” for an arbitrary amount of time
  • Max execution time of a State Machine is 1 year

AWS DataSync

  • on-premises -> AWS storage services
  • A DataSync Agent is deployed as a VM and connects to your internal storage (NFS, SMB, HDFS)
  • Encryption and data validation

MQTT

  • Standard messaging protocol, for IoT (Internet of Things)
  • Think of it as how lots of sensor data might get transferred to your machine learning model
  • The AWS IoT Device SDK can connect via MQTT