
S3
- Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf
- Storage Classes
- Amazon S3 Standard – General Purpose
- Amazon S3 Standard-Infrequent Access (IA)
- Amazon S3 One Zone-Infrequent Access
- data lost when AZ is destroyed
- Amazon S3 Glacier Instant Retrieval
- Millisecond retrieval
- Minimum storage duration of 90 days
- Amazon S3 Glacier Flexible Retrieval
- Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours)
- Minimum storage duration of 90 days
- Amazon S3 Glacier Deep Archive
- Standard (12 hours), Bulk (48 hours)
- Minimum storage duration of 180 days
- Amazon S3 Intelligent Tiering
- Moves objects automatically between Access Tiers based on usage
- There are no retrieval charges in S3 Intelligent-Tiering
- Can managed with S3 Lifecycle

- Lifecycle Rules
- Transition Actions
- Expiration actions
- Can be used to delete old versions of files (if versioning is enabled)
- Can be used to delete incomplete Multi-Part uploads
- Enable S3 Versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered
- Amazon S3 Event Notifications allow you to receive notifications when certain events occur in your S3 bucket, such as object creation or deletion.
- S3 Analytics
- decide when to transition objects to the right storage class
- S3 Security
- User-Based (IAM Policies)
- Resource-Based (Bucket Policies)
- Grant public access to the bucket
- Force objects to be encrypted at upload
- Grant access to another account (Cross Account)
- Object Encryption
- Server-Side Encryption (SSE)
- Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
- automatically applied to new objects stored in S3 bucket
- Encryption type is AES-256
- header “x-amz-server-side-encryption”: “AES256”
- Server-Side Encryption with KMS Keys stored in AWS KMS (SSE-KMS)
- KMS advantages: user control + audit key usage using CloudTrail
- header “x-amz-server-side-encryption”: “aws:kms”
- When you upload, it calls the GenerateDataKey KMS API
- When you download, it calls the Decrypt KMS API
- Server-Side Encryption with Customer-Provided Keys (SSE-C)
- Amazon S3 does NOT store the encryption key you provide
- HTTPS must be used
- Encryption key must provided in HTTP headers, for every HTTP request made
- Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
- Client-Side Encryption
- Server-Side Encryption (SSE)
- Encryption in transit (SSL/TLS)
- aka “Encryption in flight”
- Amazon S3 exposes two endpoints:
- HTTP Endpoint – non encrypted
- HTTPS Endpoint – encryption in flight
- mandatory for SSE-C
- Set Condition in the Bucket Policy, with “aws:SecureTransport”

Kinesis Data Streams
- Collect and store streaming data in real-time
- data retention, data replication, and automatic load balancing
- Retention between up to 365 days
- Data up to 1MB (typical use case is lot of “small” real-time data)
- Data ordering guarantee for data with the same “Partition ID”
- Capacity Modes
- Provisioned mode
- Each shard gets 1MB/s in (or 1000 records per second)
- Each shard gets 2MB/s out
- On-demand mode
- Default capacity provisioned (4 MB/s in or 4000 records per second)
- Provisioned mode
- [ML] create real-time machine learning applications
Amazon Data Firehose
- aka Kinesis Data Firehose
- Collect and store streaming data in real-time
- Near Real-Time; in another words, suitable for batch processing
- Custom data transformations using AWS Lambda
- [ML] ingest massive data near-real time

Amazon Kinesis Data Analytics
- aka Amazon Managed Service for Apache Flink
- Apache Flink is an open-source distributed processing engine for stateful computations over data streams. It provides a high-performance runtime and a powerful stream processing API that supports stateful computations, event-time processing, and accurate fault-tolerance guarantees.
- real-time data transformations, filtering, and enrichment
- Flink does not read from Amazon Data Firehose
- AWS Apache Flink clusters
- EC2 instances
- Apache Flink runtime
- Apache ZooKeeper
- Serverless
- Common cases
- Streaming ETL
- Continuous metric generation
- Responsive analytics
- Use IAM permissions to access streaming source and destination(s)
- Schema discovery
- [ML] real-time ETL / ML algorithms on streams

Kinesis Video Stream
- Video playback capability
- Keep data for 1 hour to 10 years
- [ML] real-time video stream to create ML applications

Data Streams | Data Firehose | Data Analytics (Amazon Managed Service for Apache Flink) | Video Streams | |
Short definition | Scalable and durable real-time data streaming service. | Capture, transform, and deliver streaming data into data lakes, data stores, and analytics services. | Transform and analyze streaming data in real time with Apache Flink. | Stream video from connected devices to AWS for analytics, machine learning, playback, and other processing. |
Data sources | Any data source (servers, mobile devices, IoT devices, etc) that can call the Kinesis API to send data. | Any data source (servers, mobile devices, IoT devices, etc) that can call the Kinesis API to send data. | Amazon MSK, Amazon Kinesis Data Streams, servers, mobile devices, IoT devices, etc. | Any streaming device that supports Kinesis Video Streams SDK. |
Data consumers | Kinesis Data Analytics, Amazon EMR, Amazon EC2, AWS Lambda | Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, Datadog, New Relic, MongoDB, and Splunk | Analysis results can be sent to another Kinesis stream, a Firehose stream, or a Lambda function | Amazon Rekognition, Amazon SageMaker, MxNet, TensorFlow, HLS-based media playback, custom media processing application |
Use cases | – Log and event data collection – Real-time analytics – Mobile data capture – Gaming data feed | – IoT Analytics – Clickstream Analytics – Log Analytics – Security monitoring | – Streaming ETL – Real-time analytics – Stateful event processing | – Smart technologies – Video-related AI/ML – Video processing |



Glue Data Catalog
- Metadata repository for all your tables
- Automated Schema Inference
- Schemas are versioned
- Integrates with Athena or Redshift Spectrum (schema & data discovery)
- Glue Crawlers can help build the Glue Data Catalog
- Works JSON, Parquet, CSV, relational store
- Crawlers work for: S3, Amazon Redshift, Amazon RDS
- Run the Crawler on a Schedule or On Demand
- Need an IAM role / credentials to access the data stores
- Glue crawler will extract partitions based on how your S3 data is organized
Glue ETL
- Extract, Transform, Load
- primarily used for batch data processing and not real-time data ingestion and processing
- Transform data, Clean Data, Enrich Data (before doing analysis)
- Bundled Transformations
- DropFields, DropNullFields – remove (null) fields
- Filter – specify a function to filter records
- Join – to enrich data
- Map – add fields, delete fields, perform external lookups
- Machine Learning Transformations
- FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
- Apache Spark transformations (example: K-Means)
- Bundled Transformations
- Jobs are run on a serverless Spark platform
- Glue Scheduler to schedule the jobs
- Glue Triggers to automate job runs based on “events”

Format | Definition | Property | Usage |
CSV | Unstructured | minimal, row-based | no good for large-scale data |
XML | Semi-structured | not row- nor column-based | no good for large-scale data |
JSON | Semi-structured | not row- nor column-based | |
Parquet | Structured (columnar) | performance-oriented, column-based | large datasets (analytical queries), with data compression and encoding algorithms |
Avro-RecordIO | Structured | performance-oriented, row-based | large datasets (streaming, event data) |
grokLog | Structured | ||
Ion | Structured | ||
ORC | Structured | performance-oriented, column-based |
Feature | Avro | Parquet |
Storage Format | Row-based (stores entire records sequentially) | Columnar-based (stores data by columns) |
Best For | Streaming, event data, schema evolution | Analytical queries, big data analytics |
Read Performance | Slower for analytics since entire rows must be read | Faster for analytics as only required columns are read |
Write Performance | Faster – appends entire rows quickly | Slower – columnar storage requires additional processing |
Query Efficiency | Inefficient for analytical queries due to row-based structure | Highly efficient for analytical queries since only required columns are scanned |
File Size | Generally larger due to row-based storage | Smaller file sizes due to better compression techniques |
Use Cases | Event-driven architectures, Kafka messaging systems, log storage | Data lakes, data warehouses, ETL processes, analytical workloads |
Processing Frameworks | Works well with Apache Kafka, Hadoop, Spark | Optimized for Apache Spark, Hive, Presto, Snowflake |
Support for Nested Data | Supports nested data, but requires schema definition | Optimized for nested structures, making it better suited for hierarchical data |
Interoperability | Widely used in streaming platforms | Preferred for big data processing and analytical workloads |
Primary Industry Adoption | Streaming platforms, logging, real-time pipelines | Data warehousing, analytics, business intelligence |
AWS Glue DataBrew
- Allows you to clean and normalize data without writing any code
- Reduces ML and analytics data preparation time by up to 80%
- features
- Transformations, such as filtering rows, replacing values, splitting and combining columns; or applying NLP to split sentences into phrases.
- Data Formats and Data Sources
- Job and Scheduling
- Security
- Integration
- Components
- Project
- Dataset
- Recipe
- Job
- Data Lineage
- Data Profile
AWS Data Stores for Machine Learning
- Redshift
- Data Warehousing, SQL analytics (OLAP – Online analytical processing)
- Load data from S3 to Redshift
- Use Redshift Spectrum to query data directly in S3 (no loading)
- RDS, Aurora
- Relational Store, SQL (OLTP – Online Transaction Processing)
- Must provision servers in advance
- DynamoDB
- NoSQL data store, serverless, provision read/write capacity
- Useful to store a machine learning model served by your application
- S3
- Object storage
- Serverless, infinite storage
- Integration with most AWS Services
- OpenSearch (previously ElasticSearch)
- Indexing of data
- Search amongst data points
- Clickstream Analytics
ElastiCacheCaching mechanismNot really used for Machine Learning
AWS Data Pipeline
- Destinations include S3, RDS, DynamoDB, Redshift and EMR
- Manages task dependencies
- Retries and notifies on failures
- Data sources may be on-premises


AWS Batch
- Run batch jobs via Docker images
- Dynamic provisioning of the instances (EC2 & Spot Instances)
- serverless
- Schedule Batch Jobs using CloudWatch Events
- Orchestrate Batch Jobs using AWS Step Functions



DMS – Database Migration Service
- Continuous Data Replication using CDC
- You must create an EC2 instance to perform the replication tasks
- Homogeneous migrations: ex Oracle to Oracle
- Heterogeneous migrations: ex Microsoft SQL Server to Aurora


AWS Step Functions
- Use to design workflows
- Advanced Error Handling and Retry mechanism outside the (Lambda) code
- Audit of the history of workflows
- Ability to “Wait” for an arbitrary amount of time
- Max execution time of a State Machine is 1 year
AWS DataSync
- on-premises -> AWS storage services
- A DataSync Agent is deployed as a VM and connects to your internal storage (NFS, SMB, HDFS)
- Encryption and data validation
MQTT
- Standard messaging protocol, for IoT (Internet of Things)
- Think of it as how lots of sensor data might get transferred to your machine learning model
- The AWS IoT Device SDK can connect via MQTT
Amazon Keyspaces DB
- a scalable, highly available, and managed Apache Cassandra–compatible database service
Apache Cassandra | MongoDB | |
Data model | Cassandra uses a wide-column data model more closely related to relational databases. | MongoDB moves completely away from the relational model by storing data as documents. |
Basic storage unit | Sorted string tables. | Serialized JSON documents. |
Indexing | Cassandra supports secondary indexes and SASI to index by column or columns. | MongoDB indexes at a collection level and field level and offers multiple indexing options. |
Query language | Cassandra uses CQL. | MongoDB uses MQL. |
Concurrency | Cassandra achieves concurrency with row-level atomicity and turntable consistency. | MongoDB uses MVCC and document-level locking to ensure concurrency. |
Availability | Cassandra has multiple master nodes, node partitioning, and key replication to offer high availability. | MongoDB uses a single primary node and multiple replica nodes. Combined with sharding, MongoDB provides high availability and scalability. |
Partitioning | Consistent hashing algorithm, less control to users. | Users define sharding keys and have more control over partitioning. |
AWS | AWS Keyspaces | AWS DynomoDB |
Aspect | Flink | Spark | Kafka |
---|---|---|---|
Type | Hybrid (batch and stream) | Hybrid (batch and stream) | Stream-only |
Support for 3rd party systems | Multiple source and sink | Yes (Kafka, HDFS, Cassandra, etc.) | Tightly coupled with Kafka (Kafka Connect) |
Stateful | Yes (RocksDB) | Yes (with checkpointing) | Yes (with Kafka Streams, RocksDB) |
Complex event processing | Yes (native support) | Yes (with Spark Structured Streaming) | No (developer needs to handle) |
Streaming window | Tumbling, Sliding, Session, Count | Time-based and count-based | Tumbling, Hopping/Sliding, Session |
Data Processing | Batch/Stream (native) | Batch/Stream (micro Batch) | Stream-only |
Iterations | Supports iterative algorithms natively | Supports iterative algorithms with micro-batches | No |
SQL | Table, SQL API | Spark SQL | Supports SQL queries on streaming data with Kafka SQL API (KSQL) |
Optimization | Auto (data flow graph and the available resources) | Manual (directed acyclic graph (DAG) and the available resources) | No native support |
State Backend | Memory, file system, RocksDB or custom backends | Memory, file system, HDFS or custom backends | Memory, file system, RocksDB or custom backends |
Language | Java, Scala, Python and SQL APIs | Java, Scala, Python, R, C#, F# and SQL APIs | Java, Scala and SQL APIs |
Geo-distribution | Flink Stateful Functions API | No native support | Kafka MirrorMaker tool |
Latency | Streaming: very low latency (milliseconds) | Micro-batching: near real-time latency (seconds) | Log-based: very low latency (milliseconds) |
Data model | True streaming with bounded and unbounded data sets | Micro-batching with RDDs and DataFrames | Log-based streaming |
Processing engine | One unified engine for batch and stream processing | Separate engines for batch (Spark Core) and stream processing (Spark Streaming) | Stream processing only |
Delivery guarantees | Exactly-once for both batch and stream processing | Exactly-once for batch processing, at-least-once for stream processing | At-least-once |
Throughput | High throughput due to pipelined execution and in-memory caching | High throughput due to in-memory caching and parallel processing | High throughput due to log compaction and compression |
State management | Rich support for stateful operations with various state backends and time semantics | Limited support for stateful operations with mapWithState and updateStateByKey functions | No native support for stateful operations, rely on external databases or Kafka Streams API |
Machine learning support | Yes (Flink ML library) | Yes (Spark MLlib library) | No (use external libraries like TensorFlow or H2O) |
Architecture | True streaming engine that treats batch as a special case of streaming with bounded data. Uses a streaming dataflow model that allows for more optimization than Spark’s DAG model. | Batch engine that supports streaming as micro-batching (processing small batches of data at regular intervals). Uses a DAG model that divides the computation into stages and tasks. | Stream engine that acts as both a message broker and a stream processor. Uses a log model that stores and processes records as an ordered sequence of events. |
Delivery Guarantees | Supports exactly-once processing semantics by using checkpoints and state snapshots. Also supports at-least-once and at-most-once semantics. | Supports at-least-once processing semantics by using checkpoints and write-ahead logs. Can achieve exactly-once semantics for some output sinks by using idempotent writes or transactions. | Supports exactly-once processing semantics by using transactions and idempotent producers. Also supports at-least-once and at-most-once semantics. |
Performance | Achieves high performance and low latency by using in-memory processing, pipelined execution, incremental checkpoints, network buffers, and operator chaining. Also supports batch and iterative processing modes for higher throughput. | Achieves high performance and low latency by using in-memory processing, lazy evaluation, RDD caching, and code generation. However, micro-batching introduces some latency overhead compared to true streaming engines. | Achieves high performance and low latency by using log compaction, zero-copy transfer, batch compression, and client-side caching. However, Kafka does not support complex stream processing operations natively. |