
S3
- Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf
- Storage Classes
- Amazon S3 Standard – General Purpose
- Amazon S3 Standard-Infrequent Access (IA)
- Amazon S3 One Zone-Infrequent Access
- data lost when AZ is destroyed
- Amazon S3 Glacier Instant Retrieval
- Millisecond retrieval
- Minimum storage duration of 90 days
- Amazon S3 Glacier Flexible Retrieval
- Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours)
- Minimum storage duration of 90 days
- Amazon S3 Glacier Deep Archive
- Standard (12 hours), Bulk (48 hours)
- Minimum storage duration of 180 days
- Amazon S3 Intelligent Tiering
- Moves objects automatically between Access Tiers based on usage
- There are no retrieval charges in S3 Intelligent-Tiering
- Can managed with S3 Lifecycle

- Lifecycle Rules
- Transition Actions
- Expiration actions
- Can be used to delete old versions of files (if versioning is enabled)
- Can be used to delete incomplete Multi-Part uploads
- Enable S3 Versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered
- Amazon S3 Event Notifications allow you to receive notifications when certain events occur in your S3 bucket, such as object creation or deletion.
- S3 Analytics
- decide when to transition objects to the right storage class
- S3 Security
- User-Based (IAM Policies)
- Resource-Based (Bucket Policies)
- Grant public access to the bucket
- Force objects to be encrypted at upload
- Grant access to another account (Cross Account)
- Object Encryption
- Server-Side Encryption (SSE)
- Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
- automatically applied to new objects stored in S3 bucket
- Encryption type is AES-256
- header “x-amz-server-side-encryption”: “AES256”
- Server-Side Encryption with KMS Keys stored in AWS KMS (SSE-KMS)
- KMS advantages: user control + audit key usage using CloudTrail
- header “x-amz-server-side-encryption”: “aws:kms”
- When you upload, it calls the GenerateDataKey KMS API
- When you download, it calls the Decrypt KMS API
- Server-Side Encryption with Customer-Provided Keys (SSE-C)
- Amazon S3 does NOT store the encryption key you provide
- HTTPS must be used
- Encryption key must provided in HTTP headers, for every HTTP request made
- Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
- Client-Side Encryption
- Server-Side Encryption (SSE)
- Encryption in transit (SSL/TLS)
- aka “Encryption in flight”
- Amazon S3 exposes two endpoints:
- HTTP Endpoint – non encrypted
- HTTPS Endpoint – encryption in flight
- mandatory for SSE-C
- Set Condition in the Bucket Policy, with “aws:SecureTransport”

Kinesis Data Streams
- Collect and store streaming data in real-time
- data retention, data replication, and automatic load balancing
- Retention between up to 365 days
- Data up to 1MB (typical use case is lot of “small” real-time data)
- Data ordering guarantee for data with the same “Partition ID”
- Capacity Modes
- Provisioned mode
- Each shard gets 1MB/s in (or 1000 records per second)
- Each shard gets 2MB/s out
- On-demand mode
- Default capacity provisioned (4 MB/s in or 4000 records per second)
- Provisioned mode
- [ML] create real-time machine learning applications
Amazon Data Firehose
- aka Kinesis Data Firehose
- Collect and store streaming data in real-time
- Near Real-Time; in another words, suitable for batch processing
- Custom data transformations using AWS Lambda
- [ML] ingest massive data near-real time
- Kinesis Data Streams is for when you want to “real-time” control and manage the flow of data yourself, while Kinesis Data Firehose is for when you want the data to be “near-real-time” automatically processed and delivered to a specific destination(Redshift, S3, Splunk, etc.)



Amazon Kinesis Data Analytics
- aka Amazon Managed Service for Apache Flink
- Apache Flink is an open-source distributed processing engine for stateful computations over data streams. It provides a high-performance runtime and a powerful stream processing API that supports stateful computations, event-time processing, and accurate fault-tolerance guarantees.
- real-time data transformations, filtering, and enrichment
- Flink does not read from Amazon Data Firehose
- AWS Apache Flink clusters
- EC2 instances
- Apache Flink runtime
- Apache ZooKeeper
- Serverless
- Common cases
- Streaming ETL
- Continuous metric generation
- Responsive analytics
- Use IAM permissions to access streaming source and destination(s)
- Schema discovery
- [ML] real-time ETL / ML algorithms on streams

Kinesis Video Stream
- Video playback capability
- Keep data for 1 hour to 10 years
- [ML] real-time video stream to create ML applications

Data Streams | Data Firehose | Data Analytics (Amazon Managed Service for Apache Flink) | Video Streams | |
Short definition | Scalable and durable real-time data streaming service. | Capture, transform, and deliver streaming data into data lakes, data stores, and analytics services. | Transform and analyze streaming data in real time with Apache Flink. | Stream video from connected devices to AWS for analytics, machine learning, playback, and other processing. |
Data sources | Any data source (servers, mobile devices, IoT devices, etc) that can call the Kinesis API to send data. | Any data source (servers, mobile devices, IoT devices, etc) that can call the Kinesis API to send data. | Amazon MSK, Amazon Kinesis Data Streams, servers, mobile devices, IoT devices, etc. | Any streaming device that supports Kinesis Video Streams SDK. |
Data consumers | Kinesis Data Analytics, Amazon EMR, Amazon EC2, AWS Lambda | Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, Datadog, New Relic, MongoDB, and Splunk | Analysis results can be sent to another Kinesis stream, a Firehose stream, or a Lambda function | Amazon Rekognition, Amazon SageMaker, MxNet, TensorFlow, HLS-based media playback, custom media processing application |
Use cases | – Log and event data collection – Real-time analytics – Mobile data capture – Gaming data feed | – IoT Analytics – Clickstream Analytics – Log Analytics – Security monitoring | – Streaming ETL – Real-time analytics – Stateful event processing | – Smart technologies – Video-related AI/ML – Video processing |



Glue Data Catalog
- Metadata repository for all your tables
- Automated Schema Inference
- Schemas are versioned
- Integrates with Athena or Redshift Spectrum (schema & data discovery)
- Glue Crawlers can help build the Glue Data Catalog
- Works JSON, Parquet, CSV, relational store
- Crawlers work for: S3, Amazon Redshift, Amazon RDS
- Run the Crawler on a Schedule or On Demand
- Need an IAM role / credentials to access the data stores
- Glue crawler will extract partitions based on how your S3 data is organized
Glue ETL
- Extract, Transform, Load
- primarily used for batch data processing and not real-time data ingestion and processing
- Transform data, Clean Data, Enrich Data (before doing analysis)
- Bundled Transformations
- DropFields, DropNullFields – remove (null) fields
- Filter – specify a function to filter records
- Join – to enrich data
- Map – add fields, delete fields, perform external lookups
- Machine Learning Transformations
- FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
- Apache Spark transformations (example: K-Means)
- Bundled Transformations
- Jobs are run on a serverless Spark platform
- Glue Scheduler to schedule the jobs
- Glue Triggers to automate job runs based on “events”

Format | Definition | Property | Usage |
CSV | Unstructured | minimal, row-based | no good for large-scale data |
XML | Semi-structured | not row- nor column-based | no good for large-scale data |
JSON | Semi-structured | not row- nor column-based | |
JSON Lines (JSONL) | Structured | performance-oriented, row-based | large datasets (streaming, event data) |
Parquet | Structured (columnar) | performance-oriented, column-based | large datasets (analytical queries), with data compression and encoding algorithms |
Avro-RecordIO | Structured | performance-oriented, row-based | large datasets (streaming, event data) |
grokLog | Structured | ||
Ion | Structured | ||
ORC | Structured | performance-oriented, column-based |
Feature | Avro | Parquet |
Storage Format | Row-based (stores entire records sequentially) | Columnar-based (stores data by columns) |
Best For | Streaming, event data, schema evolution | Analytical queries, big data analytics |
Read Performance | Slower for analytics since entire rows must be read | Faster for analytics as only required columns are read |
Write Performance | Faster – appends entire rows quickly | Slower – columnar storage requires additional processing |
Query Efficiency | Inefficient for analytical queries due to row-based structure | Highly efficient for analytical queries since only required columns are scanned |
File Size | Generally larger due to row-based storage | Smaller file sizes due to better compression techniques |
Use Cases | Event-driven architectures, Kafka messaging systems, log storage | Data lakes, data warehouses, ETL processes, analytical workloads |
Processing Frameworks | Works well with Apache Kafka, Hadoop, Spark | Optimized for Apache Spark, Hive, Presto, Snowflake |
Support for Nested Data | Supports nested data, but requires schema definition | Optimized for nested structures, making it better suited for hierarchical data |
Interoperability | Widely used in streaming platforms | Preferred for big data processing and analytical workloads |
Primary Industry Adoption | Streaming platforms, logging, real-time pipelines | Data warehousing, analytics, business intelligence |
AWS Glue DataBrew
- Allows you to clean and normalize data without writing any code
- Reduces ML and analytics data preparation time by up to 80%
- features
- Transformations, such as filtering rows, replacing values, splitting and combining columns; or applying NLP to split sentences into phrases.
- Data Formats and Data Sources
- Job and Scheduling
- Security
- Integration
- Components
- Project
- Dataset
- Recipe
- Job
- Data Lineage
- Data Profile
AWS Data Stores for Machine Learning
- Redshift
- Data Warehousing, SQL analytics (OLAP – Online analytical processing)
- Load data from S3 to Redshift
- Use Redshift Spectrum to query data directly in S3 (no loading)
- RDS, Aurora
- Relational Store, SQL (OLTP – Online Transaction Processing)
- Must provision servers in advance
- DynamoDB
- NoSQL data store, serverless, provision read/write capacity
- Useful to store a machine learning model served by your application
- S3
- Object storage
- Serverless, infinite storage
- Integration with most AWS Services
- OpenSearch (previously ElasticSearch)
- Indexing of data
- Search amongst data points
- Clickstream Analytics
ElastiCacheCaching mechanismNot really used for Machine Learning
AWS Data Pipeline
- Destinations include S3, RDS, DynamoDB, Redshift and EMR
- Manages task dependencies
- Retries and notifies on failures
- Data sources may be on-premises


AWS Batch
- Run batch jobs via Docker images
- Dynamic provisioning of the instances (EC2 & Spot Instances)
- serverless
- Schedule Batch Jobs using CloudWatch Events
- Orchestrate Batch Jobs using AWS Step Functions



AWS DMS – Database Migration Service
- Continuous Data Replication using CDC
- You must create an EC2 instance to perform the replication tasks
- Homogeneous migrations: ex Oracle to Oracle
- Heterogeneous migrations: ex Microsoft SQL Server to Aurora


AWS Step Functions
- Use to design workflows
- Advanced Error Handling and Retry mechanism outside the (Lambda) code
- Audit of the history of workflows
- Ability to “Wait” for an arbitrary amount of time
- Max execution time of a State Machine is 1 year
AWS DataSync
- on-premises -> AWS storage services
- A DataSync Agent is deployed as a VM and connects to your internal storage (NFS, SMB, HDFS)
- Encryption and data validation
MQTT
- Standard messaging protocol, for IoT (Internet of Things)
- Think of it as how lots of sensor data might get transferred to your machine learning model
- The AWS IoT Device SDK can connect via MQTT
Amazon Keyspaces DB
- a scalable, highly available, and managed Apache Cassandra–compatible database service
Apache Cassandra | MongoDB | |
Data model | Cassandra uses a wide-column data model more closely related to relational databases. | MongoDB moves completely away from the relational model by storing data as documents. |
Basic storage unit | Sorted string tables. | Serialized JSON documents. |
Indexing | Cassandra supports secondary indexes and SASI to index by column or columns. | MongoDB indexes at a collection level and field level and offers multiple indexing options. |
Query language | Cassandra uses CQL. | MongoDB uses MQL. |
Concurrency | Cassandra achieves concurrency with row-level atomicity and turntable consistency. | MongoDB uses MVCC and document-level locking to ensure concurrency. |
Availability | Cassandra has multiple master nodes, node partitioning, and key replication to offer high availability. | MongoDB uses a single primary node and multiple replica nodes. Combined with sharding, MongoDB provides high availability and scalability. |
Partitioning | Consistent hashing algorithm, less control to users. | Users define sharding keys and have more control over partitioning. |
AWS | AWS Keyspaces | AWS DynomoDB |
Aspect | Flink | Spark | Kafka |
---|---|---|---|
Type | Hybrid (batch and stream) | Hybrid (batch and stream) | Stream-only |
Support for 3rd party systems | Multiple source and sink | Yes (Kafka, HDFS, Cassandra, etc.) | Tightly coupled with Kafka (Kafka Connect) |
Stateful | Yes (RocksDB) | Yes (with checkpointing) | Yes (with Kafka Streams, RocksDB) |
Complex event processing | Yes (native support) | Yes (with Spark Structured Streaming) | No (developer needs to handle) |
Streaming window | Tumbling, Sliding, Session, Count | Time-based and count-based | Tumbling, Hopping/Sliding, Session |
Data Processing | Batch/Stream (native) | Batch/Stream (micro Batch) | Stream-only |
Iterations | Supports iterative algorithms natively | Supports iterative algorithms with micro-batches | No |
SQL | Table, SQL API | Spark SQL | Supports SQL queries on streaming data with Kafka SQL API (KSQL) |
Optimization | Auto (data flow graph and the available resources) | Manual (directed acyclic graph (DAG) and the available resources) | No native support |
State Backend | Memory, file system, RocksDB or custom backends | Memory, file system, HDFS or custom backends | Memory, file system, RocksDB or custom backends |
Language | Java, Scala, Python and SQL APIs | Java, Scala, Python, R, C#, F# and SQL APIs | Java, Scala and SQL APIs |
Geo-distribution | Flink Stateful Functions API | No native support | Kafka MirrorMaker tool |
Latency | Streaming: very low latency (milliseconds) | Micro-batching: near real-time latency (seconds) | Log-based: very low latency (milliseconds) |
Data model | True streaming with bounded and unbounded data sets | Micro-batching with RDDs and DataFrames | Log-based streaming |
Processing engine | One unified engine for batch and stream processing | Separate engines for batch (Spark Core) and stream processing (Spark Streaming) | Stream processing only |
Delivery guarantees | Exactly-once for both batch and stream processing | Exactly-once for batch processing, at-least-once for stream processing | At-least-once |
Throughput | High throughput due to pipelined execution and in-memory caching | High throughput due to in-memory caching and parallel processing | High throughput due to log compaction and compression |
State management | Rich support for stateful operations with various state backends and time semantics | Limited support for stateful operations with mapWithState and updateStateByKey functions | No native support for stateful operations, rely on external databases or Kafka Streams API |
Machine learning support | Yes (Flink ML library) | Yes (Spark MLlib library) | No (use external libraries like TensorFlow or H2O) |
Architecture | True streaming engine that treats batch as a special case of streaming with bounded data. Uses a streaming dataflow model that allows for more optimization than Spark’s DAG model. | Batch engine that supports streaming as micro-batching (processing small batches of data at regular intervals). Uses a DAG model that divides the computation into stages and tasks. | Stream engine that acts as both a message broker and a stream processor. Uses a log model that stores and processes records as an ordered sequence of events. |
Delivery Guarantees | Supports exactly-once processing semantics by using checkpoints and state snapshots. Also supports at-least-once and at-most-once semantics. | Supports at-least-once processing semantics by using checkpoints and write-ahead logs. Can achieve exactly-once semantics for some output sinks by using idempotent writes or transactions. | Supports exactly-once processing semantics by using transactions and idempotent producers. Also supports at-least-once and at-most-once semantics. |
Performance | Achieves high performance and low latency by using in-memory processing, pipelined execution, incremental checkpoints, network buffers, and operator chaining. Also supports batch and iterative processing modes for higher throughput. | Achieves high performance and low latency by using in-memory processing, lazy evaluation, RDD caching, and code generation. However, micro-batching introduces some latency overhead compared to true streaming engines. | Achieves high performance and low latency by using log compaction, zero-copy transfer, batch compression, and client-side caching. However, Kafka does not support complex stream processing operations natively. |

AWS Glue can be used to extract and combine data from various sources like Amazon RDS databases, Amazon DynamoDB, and other data stores, creating a unified dataset for training the machine learning model. This combined dataset can then be stored in Amazon S3, a cost-effective and scalable object storage service, making it accessible for both training and inference purposes. For analyzing large volumes of purchase history data, which is crucial for understanding customer behavior, the company can utilize Amazon EMR, a distributed computing framework that enables efficient processing of big data. Additionally, to incorporate real-time social media activity data into the model, Amazon Data Firehose can be employed to ingest, transform, and load this streaming data directly into Amazon S3, where it can be combined with the other datasets. By leveraging these AWS services in tandem, the e-commerce company can streamline the entire data processing pipeline, from data ingestion and storage to analysis and model training, ultimately enhancing the accuracy and effectiveness of their customer churn prediction model.