11. ML – Data Engineering

S3

  • Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf
  • Storage Classes
    • Amazon S3 Standard – General Purpose
    • Amazon S3 Standard-Infrequent Access (IA)
    • Amazon S3 One Zone-Infrequent Access
      • data lost when AZ is destroyed
    • Amazon S3 Glacier Instant Retrieval
      • Millisecond retrieval
      • Minimum storage duration of 90 days
    • Amazon S3 Glacier Flexible Retrieval
      • Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours)
      • Minimum storage duration of 90 days
    • Amazon S3 Glacier Deep Archive
      • Standard (12 hours), Bulk (48 hours)
      • Minimum storage duration of 180 days
    • Amazon S3 Intelligent Tiering
      • Moves objects automatically between Access Tiers based on usage
      • There are no retrieval charges in S3 Intelligent-Tiering
    • Can managed with S3 Lifecycle
  • Lifecycle Rules
    • Transition Actions
    • Expiration actions
      • Can be used to delete old versions of files (if versioning is enabled)
      • Can be used to delete incomplete Multi-Part uploads
  • Enable S3 Versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered
  • Amazon S3 Event Notifications allow you to receive notifications when certain events occur in your S3 bucket, such as object creation or deletion.
  • S3 Analytics
    • decide when to transition objects to the right storage class
  • S3 Security
    • User-Based (IAM Policies)
    • Resource-Based (Bucket Policies)
      • Grant public access to the bucket
      • Force objects to be encrypted at upload
      • Grant access to another account (Cross Account)
  • Object Encryption
    • Server-Side Encryption (SSE)
      • Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Default
        • automatically applied to new objects stored in S3 bucket
        • Encryption type is AES-256
        • header “x-amz-server-side-encryption”: “AES256”
      • Server-Side Encryption with KMS Keys stored in AWS KMS (SSE-KMS)
        • KMS advantages: user control + audit key usage using CloudTrail
        • header “x-amz-server-side-encryption”: “aws:kms”
        • When you upload, it calls the GenerateDataKey KMS API
        • When you download, it calls the Decrypt KMS API
      • Server-Side Encryption with Customer-Provided Keys (SSE-C)
        • Amazon S3 does NOT store the encryption key you provide
        • HTTPS must be used
        • Encryption key must provided in HTTP headers, for every HTTP request made
    • Client-Side Encryption
  • Encryption in transit (SSL/TLS)
    • aka “Encryption in flight”
    • Amazon S3 exposes two endpoints:
      • HTTP Endpoint – non encrypted
      • HTTPS Endpoint – encryption in flight
        • mandatory for SSE-C
        • Set Condition in the Bucket Policy, with “aws:SecureTransport”

Kinesis Data Streams

  • Collect and store streaming data in real-time
    • data retention, data replication, and automatic load balancing
  • Retention between up to 365 days
  • Data up to 1MB (typical use case is lot of “small” real-time data)
  • Data ordering guarantee for data with the same “Partition ID”
  • Capacity Modes
    • Provisioned mode
      • Each shard gets 1MB/s in (or 1000 records per second)
      • Each shard gets 2MB/s out
    • On-demand mode
      • Default capacity provisioned (4 MB/s in or 4000 records per second)
  • [ML] create real-time machine learning applications

Amazon Data Firehose

  • aka Kinesis Data Firehose
  • Collect and store streaming data in real-time
  • Near Real-Time; in another words, suitable for batch processing
  • Custom data transformations using AWS Lambda
  • [ML] ingest massive data near-real time

Amazon Kinesis Data Analytics

  • aka Amazon Managed Service for Apache Flink
  • Apache Flink is an open-source distributed processing engine for stateful computations over data streams. It provides a high-performance runtime and a powerful stream processing API that supports stateful computations, event-time processing, and accurate fault-tolerance guarantees.
    • real-time data transformations, filtering, and enrichment
  • Flink does not read from Amazon Data Firehose
  • AWS Apache Flink clusters
    • EC2 instances
    • Apache Flink runtime
    • Apache ZooKeeper
  • Serverless
  • Common cases
    • Streaming ETL
    • Continuous metric generation
    • Responsive analytics
  • Use IAM permissions to access streaming source and destination(s)
  • Schema discovery
  • [ML] real-time ETL / ML algorithms on streams

Kinesis Video Stream

  • Video playback capability
  • Keep data for 1 hour to 10 years
  • [ML] real-time video stream to create ML applications
Data StreamsData FirehoseData Analytics
(Amazon Managed Service for Apache Flink)
Video Streams
Short definitionScalable and durable real-time data streaming service.Capture, transform, and deliver streaming data into data lakes, data stores, and analytics services.Transform and analyze streaming data in real time with Apache Flink.Stream video from connected devices to AWS for analytics, machine learning, playback, and other processing.
Data sourcesAny data source (servers, mobile devices, IoT devices, etc) that can call the Kinesis API to send data.Any data source (servers, mobile devices, IoT devices, etc) that can call the Kinesis API to send data.Amazon MSK, Amazon Kinesis Data Streams, servers, mobile devices, IoT devices, etc.Any streaming device that supports Kinesis Video Streams SDK.
Data consumersKinesis Data Analytics, Amazon EMR, Amazon EC2, AWS LambdaAmazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints,  Datadog, New Relic, MongoDB, and SplunkAnalysis results can be sent to another Kinesis stream, a  Firehose stream, or a Lambda functionAmazon Rekognition, Amazon SageMaker, MxNet, TensorFlow, HLS-based media playback, custom media processing application
Use cases– Log and event data collection
– Real-time analytics
– Mobile data capture
– Gaming data feed
– IoT Analytics
– Clickstream Analytics
– Log Analytics
– Security monitoring
– Streaming ETL
– Real-time analytics
– Stateful event processing
– Smart technologies
– Video-related AI/ML
– Video processing

Glue Data Catalog

  • Metadata repository for all your tables
    • Automated Schema Inference
    • Schemas are versioned
  • Integrates with Athena or Redshift Spectrum (schema & data discovery)
  • Glue Crawlers can help build the Glue Data Catalog
    • Works JSON, Parquet, CSV, relational store
    • Crawlers work for: S3, Amazon Redshift, Amazon RDS
    • Run the Crawler on a Schedule or On Demand
    • Need an IAM role / credentials to access the data stores
    • Glue crawler will extract partitions based on how your S3 data is organized

Glue ETL

  • Extract, Transform, Load
  • primarily used for batch data processing and not real-time data ingestion and processing
  • Transform data, Clean Data, Enrich Data (before doing analysis)
    • Bundled Transformations
      • DropFields, DropNullFields – remove (null) fields
      • Filter – specify a function to filter records
      • Join – to enrich data
      • Map – add fields, delete fields, perform external lookups
    • Machine Learning Transformations
      • FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
      • Apache Spark transformations (example: K-Means)
  • Jobs are run on a serverless Spark platform
  • Glue Scheduler to schedule the jobs
  • Glue Triggers to automate job runs based on “events”
FormatDefinitionPropertyUsage
CSVUnstructuredminimal, row-basedno good for large-scale data
XMLSemi-structurednot row- nor column-basedno good for large-scale data
JSONSemi-structurednot row- nor column-based
ParquetStructured (columnar)performance-oriented, column-basedlarge datasets (analytical queries), with data compression and encoding algorithms
Avro-RecordIOStructuredperformance-oriented, row-basedlarge datasets (streaming, event data)
grokLogStructured
IonStructured
ORCStructuredperformance-oriented, column-based
FeatureAvroParquet
Storage FormatRow-based (stores entire records sequentially)Columnar-based (stores data by columns)
Best ForStreaming, event data, schema evolutionAnalytical queries, big data analytics
Read PerformanceSlower for analytics since entire rows must be readFaster for analytics as only required columns are read
Write PerformanceFaster – appends entire rows quicklySlower – columnar storage requires additional processing
Query EfficiencyInefficient for analytical queries due to row-based structureHighly efficient for analytical queries since only required columns are scanned
File SizeGenerally larger due to row-based storageSmaller file sizes due to better compression techniques
Use CasesEvent-driven architectures, Kafka messaging systems, log storageData lakes, data warehouses, ETL processes, analytical workloads
Processing FrameworksWorks well with Apache Kafka, Hadoop, SparkOptimized for Apache Spark, Hive, Presto, Snowflake
Support for Nested DataSupports nested data, but requires schema definitionOptimized for nested structures, making it better suited for hierarchical data
InteroperabilityWidely used in streaming platformsPreferred for big data processing and analytical workloads
Primary Industry AdoptionStreaming platforms, logging, real-time pipelinesData warehousing, analytics, business intelligence

AWS Glue DataBrew

  • Allows you to clean and normalize data without writing any code
  • Reduces ML and analytics data preparation time by up to 80%
  • features
    • Transformations, such as filtering rows, replacing values, splitting and combining columns; or applying NLP to split sentences into phrases.
    • Data Formats and Data Sources
    • Job and Scheduling
    • Security
    • Integration 
  • Components
    • Project
    • Dataset
    • Recipe
    • Job
    • Data Lineage
    • Data Profile

AWS Data Stores for Machine Learning

  • Redshift
    • Data Warehousing, SQL analytics (OLAP – Online analytical processing)
    • Load data from S3 to Redshift
    • Use Redshift Spectrum to query data directly in S3 (no loading)
  • RDS, Aurora
    • Relational Store, SQL (OLTP – Online Transaction Processing)
    • Must provision servers in advance
  • DynamoDB
    • NoSQL data store, serverless, provision read/write capacity
    • Useful to store a machine learning model served by your application
  • S3
    • Object storage
    • Serverless, infinite storage
    • Integration with most AWS Services
  • OpenSearch (previously ElasticSearch)
    • Indexing of data
    • Search amongst data points
    • Clickstream Analytics
  • ElastiCache
    • Caching mechanism
    • Not really used for Machine Learning

AWS Data Pipeline

  • Destinations include S3, RDS, DynamoDB, Redshift and EMR
  • Manages task dependencies
  • Retries and notifies on failures
  • Data sources may be on-premises

AWS Batch

  • Run batch jobs via Docker images
  • Dynamic provisioning of the instances (EC2 & Spot Instances)
  • serverless
  • Schedule Batch Jobs using CloudWatch Events
  • Orchestrate Batch Jobs using AWS Step Functions

DMS – Database Migration Service

  • Continuous Data Replication using CDC
  • You must create an EC2 instance to perform the replication tasks
  • Homogeneous migrations: ex Oracle to Oracle
  • Heterogeneous migrations: ex Microsoft SQL Server to Aurora

AWS Step Functions

  • Use to design workflows
  • Advanced Error Handling and Retry mechanism outside the (Lambda) code
  • Audit of the history of workflows
  • Ability to “Wait” for an arbitrary amount of time
  • Max execution time of a State Machine is 1 year

AWS DataSync

  • on-premises -> AWS storage services
  • A DataSync Agent is deployed as a VM and connects to your internal storage (NFS, SMB, HDFS)
  • Encryption and data validation

MQTT

  • Standard messaging protocol, for IoT (Internet of Things)
  • Think of it as how lots of sensor data might get transferred to your machine learning model
  • The AWS IoT Device SDK can connect via MQTT

Amazon Keyspaces DB

  • a scalable, highly available, and managed Apache Cassandra–compatible database service
Apache CassandraMongoDB
Data modelCassandra uses a wide-column data model more closely related to relational databases. MongoDB moves completely away from the relational model by storing data as documents.
Basic storage unitSorted string tables.Serialized JSON documents.
IndexingCassandra supports secondary indexes and SASI to index by column or columns.MongoDB indexes at a collection level and field level and offers multiple indexing options.
Query languageCassandra uses CQL.MongoDB uses MQL.
ConcurrencyCassandra achieves concurrency with row-level atomicity and turntable consistency. MongoDB uses MVCC and document-level locking to ensure concurrency. 
AvailabilityCassandra has multiple master nodes, node partitioning, and key replication to offer high availability.MongoDB uses a single primary node and multiple replica nodes. Combined with sharding, MongoDB provides high availability and scalability. 
PartitioningConsistent hashing algorithm, less control to users.Users define sharding keys and have more control over partitioning.
AWSAWS KeyspacesAWS DynomoDB
AspectFlinkSparkKafka
TypeHybrid (batch and stream)Hybrid (batch and stream)Stream-only
Support for 3rd party systemsMultiple source and sinkYes (Kafka, HDFS, Cassandra, etc.)Tightly coupled with Kafka (Kafka Connect)
StatefulYes (RocksDB)Yes (with checkpointing)Yes (with Kafka Streams, RocksDB)
Complex event processingYes (native support)Yes (with Spark Structured Streaming)No (developer needs to handle)
Streaming windowTumbling, Sliding, Session, CountTime-based and count-basedTumbling, Hopping/Sliding, Session
Data ProcessingBatch/Stream (native)Batch/Stream (micro Batch)Stream-only
IterationsSupports iterative algorithms nativelySupports iterative algorithms with micro-batchesNo
SQLTable, SQL APISpark SQLSupports SQL queries on streaming data with Kafka SQL API (KSQL)
OptimizationAuto (data flow graph and the available resources)Manual (directed acyclic graph (DAG) and the available resources)No native support
State BackendMemory, file system, RocksDB or custom backendsMemory, file system, HDFS or custom backendsMemory, file system, RocksDB or custom backends
LanguageJava, Scala, Python and SQL APIsJava, Scala, Python, R, C#, F# and SQL APIsJava, Scala and SQL APIs
Geo-distributionFlink Stateful Functions APINo native supportKafka MirrorMaker tool
LatencyStreaming: very low latency (milliseconds)Micro-batching: near real-time latency (seconds)Log-based: very low latency (milliseconds)
Data modelTrue streaming with bounded and unbounded data setsMicro-batching with RDDs and DataFramesLog-based streaming
Processing engineOne unified engine for batch and stream processingSeparate engines for batch (Spark Core) and stream processing (Spark Streaming)Stream processing only
Delivery guaranteesExactly-once for both batch and stream processingExactly-once for batch processing, at-least-once for stream processingAt-least-once
ThroughputHigh throughput due to pipelined execution and in-memory cachingHigh throughput due to in-memory caching and parallel processingHigh throughput due to log compaction and compression
State managementRich support for stateful operations with various state backends and time semanticsLimited support for stateful operations with mapWithState and updateStateByKey functionsNo native support for stateful operations, rely on external databases or Kafka Streams API
Machine learning supportYes (Flink ML library)Yes (Spark MLlib library)No (use external libraries like TensorFlow or H2O)
ArchitectureTrue streaming engine that treats batch as a special case of streaming with bounded data. Uses a streaming dataflow model that allows for more optimization than Spark’s DAG model.Batch engine that supports streaming as micro-batching (processing small batches of data at regular intervals). Uses a DAG model that divides the computation into stages and tasks.Stream engine that acts as both a message broker and a stream processor. Uses a log model that stores and processes records as an ordered sequence of events.
Delivery GuaranteesSupports exactly-once processing semantics by using checkpoints and state snapshots. Also supports at-least-once and at-most-once semantics.Supports at-least-once processing semantics by using checkpoints and write-ahead logs. Can achieve exactly-once semantics for some output sinks by using idempotent writes or transactions.Supports exactly-once processing semantics by using transactions and idempotent producers. Also supports at-least-once and at-most-once semantics.
PerformanceAchieves high performance and low latency by using in-memory processing, pipelined execution, incremental checkpoints, network buffers, and operator chaining. Also supports batch and iterative processing modes for higher throughput.Achieves high performance and low latency by using in-memory processing, lazy evaluation, RDD caching, and code generation. However, micro-batching introduces some latency overhead compared to true streaming engines.Achieves high performance and low latency by using log compaction, zero-copy transfer, batch compression, and client-side caching. However, Kafka does not support complex stream processing operations natively.