Skip to main content
S3
Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf
Storage Classes
Amazon S3 Standard – General Purpose
Amazon S3 Standard-Infrequent Access (IA)
Amazon S3 One Zone-Infrequent Access
data lost when AZ is destroyed
Amazon S3 Glacier Instant Retrieval
Millisecond retrieval
Minimum storage duration of 90 days
Amazon S3 Glacier Flexible Retrieval
Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours)
Minimum storage duration of 90 days
Amazon S3 Glacier Deep Archive
Standard (12 hours), Bulk (48 hours)
Minimum storage duration of 180 days
Amazon S3 Intelligent Tiering
Moves objects automatically between Access Tiers based on usage
There are no retrieval charges in S3 Intelligent-Tiering
Can managed with S3 Lifecycle
Lifecycle Rules
Transition Actions
Expiration actions
Can be used to delete old versions of files (if versioning is enabled)
Can be used to delete incomplete Multi-Part uploads
Enable S3 Versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered
S3 Analytics
decide when to transition objects to the right storage class
S3 Security
User-Based (IAM Policies)
Resource-Based (Bucket Policies)
Grant public access to the bucket
Force objects to be encrypted at upload
Grant access to another account (Cross Account)
Object Encryption
Server-Side Encryption (SSE)
Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
automatically applied to new objects stored in S3 bucket
Encryption type is AES-256
header “x-amz-server-side-encryption”: “AES256”
Server-Side Encryption with KMS Keys stored in AWS KMS (SSE-KMS)
KMS advantages: user control + audit key usage using CloudTrail
header “x-amz-server-side-encryption”: “aws:kms”
When you upload, it calls the GenerateDataKey KMS API
When you download, it calls the Decrypt KMS API
Server-Side Encryption with Customer-Provided Keys (SSE-C)
Amazon S3 does NOT store the encryption key you provide
HTTPS must be used
Encryption key must provided in HTTP headers, for every HTTP request made
Client-Side Encryption
Encryption in transit (SSL/TLS)
aka “Encryption in flight”
Amazon S3 exposes two endpoints:
HTTP Endpoint – non encrypted
HTTPS Endpoint – encryption in flight
mandatory for SSE-C
Set Condition in the Bucket Policy, with “aws:SecureTransport”
Kinesis Data Streams
Collect and store streaming data in real-time
Retention between up to 365 days
Data up to 1MB (typical use case is lot of “small” real-time data)
Data ordering guarantee for data with the same “Partition ID”
Capacity Modes
Provisioned mode
Each shard gets 1MB/s in (or 1000 records per second)
Each shard gets 2MB/s out
On-demand mode
Default capacity provisioned (4 MB/s in or 4000 records per second)
[ML] create real-time machine learning applications
Amazon Data Firehose
aka Kinesis Data Firehouse
Collect and store streaming data in real-time
Near Real-Time
Custom data transformations using AWS Lambda
[ML] ingest massive data near-real time
Amazon Managed Service for Apache Flink
aka Kinesis Data Analytics
Flink does not read from Amazon Data Firehose
Serverless
Common cases
Streaming ETL
Continuous metric generation
Responsive analytics
Use IAM permissions to access streaming source and destination(s)
Schema discovery
[ML] real-time ETL / ML algorithms on streams
Kinesis Video Stream
Video playback capability
Keep data for 1 hour to 10 years
[ML] real-time video stream to create ML applications
Glue Data Catalog
Metadata repository for all your tables
Automated Schema Inference
Schemas are versioned
Integrates with Athena or Redshift Spectrum (schema & data discovery)
Glue Crawlers can help build the Glue Data Catalog
Works JSON, Parquet, CSV, relational store
Crawlers work for: S3, Amazon Redshift, Amazon RDS
Run the Crawler on a Schedule or On Demand
Need an IAM role / credentials to access the data stores
Glue crawler will extract partitions based on how your S3 data is organized
Glue ETL
E xtract, T ransform, L oad
Transform data, Clean Data, Enrich Data (before doing analysis)
Bundled Transformations
DropFields, DropNullFields – remove (null) fields
Filter – specify a function to filter records
Join – to enrich data
Map – add fields, delete fields, perform external lookups
Machine Learning Transformations
FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
Apache Spark transformations (example: K-Means)
Jobs are run on a serverless Spark platform
Glue Scheduler to schedule the jobs
Glue Triggers to automate job runs based on “events”
AWS Glue DataBrew
Allows you to clean and normalize data without writing any code
Reduces ML and analytics data preparation time by up to 80%
AWS Data Stores for Machine Learning
Redshift
Data Warehousing , SQL analytics (OLAP – Online analytical processing)
Load data from S3 to Redshift
Use Redshift Spectrum to query data directly in S3 (no loading)
RDS, Aurora
Relational Store, SQL (OLTP – Online Transaction Processing)
Must provision servers in advance
DynamoDB
NoSQL data store, serverless, provision read/write capacity
Useful to store a machine learning model served by your application
S3
Object storage
Serverless, infinite storage
Integration with most AWS Services
OpenSearch (previously ElasticSearch)
Indexing of data
Search amongst data points
Clickstream Analytics
ElastiCache
Caching mechanism
Not really used for Machine Learning
AWS Data Pipeline
Destinations include S3, RDS, DynamoDB, Redshift and EMR
Manages task dependencies
Retries and notifies on failures
Data sources may be on-premises
AWS Batch
Run batch jobs via Docker images
Dynamic provisioning of the instances (EC2 & Spot Instances)
serverless
Schedule Batch Jobs using CloudWatch Events
Orchestrate Batch Jobs using AWS Step Functions
DMS – Database Migration Service
Continuous Data Replication using CDC
You must create an EC2 instance to perform the replication tasks
Homogeneous migrations: ex Oracle to Oracle
Heterogeneous migrations: ex Microsoft SQL Server to Aurora
AWS Step Functions
Use to design workflows
Advanced Error Handling and Retry mechanism outside the (Lambda) code
Audit of the history of workflows
Ability to “Wait” for an arbitrary amount of time
Max execution time of a State Machine is 1 year
AWS DataSync
on-premises -> AWS storage services
A DataSync Agent is deployed as a VM and connects to your internal storage (NFS, SMB, HDFS)
Encryption and data validation
MQTT
Standard messaging protocol, for IoT (Internet of Things)
Think of it as how lots of sensor data might get transferred to your machine learning model
The AWS IoT Device SDK can connect via MQTT