03. Storage

Instance Store

  • Block-level storage (with EBS disk that is physically attached to the host computer)
  • Temporary/ephemeral, ideal for
    • temp info that changes frequently such as caches, buffers, scratch data,
    • data that is replicated across a fleet of instances where you can afford to lose a copy once in a while and the data is still replicated across other instances
  • Very high performance and low latency
  • Can be cost effective since the cost is included in the instance cost
  • You can hibernate the instance to keep what’s in memory and in the EBS, but if you stop or terminate the instance then you lose everything in memory and in the EBS storage.

EBS

  • General Purpose SSD (gp2, gp3) – for low-latency interactive apps, dev&test environments.
    • Can have bursts of CPU performance but not sustained.
  • Provisioned IOPS SSD (io1, io2) – for sub-millisecond latency, sustained IOPS performance.
    • Be sure to distinguish: IOPS solves I/O aka disk wait time, not CPU performance
    • IOPS is related to volume size, specifically per GB. 
    • These are more $
  • In contrast to SSD volumes, EBS also offers HDD volumes:
    • EBS Cold HDD (sc1) lowest cost option for infrequently accessed data and use cases like sequential data access
    • EBS Throughput Optimized HDD (st1) which is for frequent access and throughput intensive workloads such as MapReduce, Kafka, log processing, data warehouse and ETL workloads. Higher $ than sc1.  
    • however note that the HDD volumes have no IOPS SLA.
  • EBS can’t attach to multiple AZs (there is a new EBS multi-attach feature but it’s only single AZ, and only certain SSD volumes such as iop1, iop2). EBS is considered a “single point of failure”.
  • To implement a shared storage layer of files, you could replace multiple EBS with a single EFS
  • Not fully managed, doesn’t auto-scale (as opposed to EFS)
  • Use EBS Data Lifecycle Manager (DLM) to manage backup snapshots. Backup snapshots are incremental, but the deletion process is design so that you only need to retain the most recent snapshot. 
  • iSCSI is block protocol, whereas NFS is a file protocol
  • EBS supports encryption of data at rest and encryption of data in transit between the instance and the EBS volume. 

EFS

  • can attach to many instances across multiple AZ, whereas EBS cannot (there is a new EBS multi-attach feature but it’s only single AZ, and only certain SSD volumes such as iop1, iop2)
  • fully managed, auto-scales (whereas EBS is not)
  • Linux only, not Windows!
  • Since it is Linux, use POSIX permissions to restrict access to files
  • After a period up to 90 days, you can transition unused data to EFS IA
  • Protected by EFS Security Groups to control network traffic and act as firewall

S3

  • durable (99.999999999%)
  • a best practice is to enable versioning and MFA Delete on S3 buckets
  • S3 lifecycle 2 types of actions:
    1. ​​transition actions (define when to transition to another storage class)
    2. expiration actions (objects expire, then S3 deletes them on your behalf)
  • objects have to be in S3 for > 30 days before lifecycle policy can take effect and move to a different storage class.
  • Intelligent Tiering automatically moves data to the most cost-effective storage
  • Standard-IA is multi-AZ whereas One Zone-IA is not
  • pre-signed URL gives you access to the object identified in the URL (URL is made up of bucket name, object key, HTTP method, expiration timestamp). If you want to provide an outside partner with an object in S3, providing a pre-signed URL is a more secure (and easier) option than creating an AWS account for them and providing the login, which is more work to then manage and error-prone if you didn’t lock down the account properly.
  • You can’t send long-term storage data directly to Glacier, it has to pass through an S3 first
  • Accessed via API, if you want to access S3 directly it can require modifying the app to use the API which is extra effort
  • Can host a static website but not over HTTPS. For HTTPS use CloudFront+S3 instead. 
  • Best practice: use IAM policies to grant users fine-grained control to your S3 buckets rather than using bucket ACLs
  • Can use multi-part upload to speed up uploads of large files to S3

Glacier

  • slow to retrieve, but you can use Expedited Retrieval to bring it down to just 1-5min.

Amazon FSx

  • to replace Microsoft Windows file server
  • can be multi-AZ
  • supports DFS (distributed file system) protocol
  • integrates with AD
  • FSx for Lustre is for high-performance computing (HPC) – does not support Windows

Amazon Aurora Global Database

  • for globally distributed applications. 1 DB can span multiple regions
  • If too much read traffic is clogging up write requests, create an Aurora replica and direct read traffic to the replica. The replica serves as both standby instance and target for read traffic. 
  • “Amazon Aurora Serverless” is different from “Amazon Aurora” – it automatically scales capacity and is ideal for infrequently used applications. 

RDS

  • Transactional DB (OLTP)
  • If too much read traffic is clogging up write requests, create an RDS read replica and direct read traffic to the replica. The read replica is updated asynchronously. Multi-AZ creates a read replica in another AZ and synchronously replicates to it
  • RDS is a managed database, not a data store. Careful in some questions if they ask about migrating a data store to AWS, RDS would not be suitable.
  • To encrypt an existing RDS database, take a snapshot, encrypt a copy of the snapshot, then restore the snapshot to the RDS instance. Since there may have been data changed during the snapshot/encrypt/load operation, use the AWS DMS (Database Migration Service) to sync the data.
  • RDS can be restored to a backup taken as recent as 5min ago using point-in-time restore (PITR). When you restore, a new instance is created from the DB snapshot and you need to point to the new instance.

ElastiCache

  • Database cache. Put in front of DBs such as RDS or Redshift, or in front of certain types of DB data in S3, to improve performance
  • As a cache, it is an in-memory key/value store database (more OLAP than OLTP)
  • Redis vs. Memcached
    • Redis has replication and high availability, whereas Memcached does not. Memcached allows multi-core multi-thread however.
    • Redis can be token-protected (i.e. require a password). Use the AUTH command when you create the Redis instance, and in all subsequent commands.
    • For Redis, ElastiCache in-transit encryption is an optional feature to increase security of data in transit as it is being replicated (with performance trade-off)
  • Use case: accelerate autocomplete in a web page form

DynamoDB

  • Use when the question talks about key/value storage, near-real time performance, millisecond responsiveness, and very high requests per second
  • Not compatible with relational data such as what would be stored in a MySQL or RDS DB
  • No concept of read replica like in RDS and Aurora. For read-heavy or bursty workloads, use DAX, an in-memory cache, to accelerate performance. 
  • DynamoDB measures RCUs (read capacity units, basically reads per second)  and WCUs (write capacity units)
  • DynamoDB auto scaling uses the AWS Application Auto Scaling service to dynamically adjust throughput capacity based on traffic. 
  • Best practices:
    • keep item sizes small (<400kb) otherwise store in S3 and use pointers from DynamoDB
    • store more frequently and less frequently accessed data in different tables 
    • if storing data that will be accessed by timestamp, use separate tables for days, weeks, months

AWS Storage Gateway

  • Replace on-prem without changing workflow
  • Types: File Gateway (for NFS and SMB), Volume Gateway, Tape Gateway. 
  • Stores data in S3 (e.g. for file gateway type, it stores files as objects in S3)
  • Provides a cache that can be accessed at low latency, whereas EFS and EBS do not have a cache

Copying and Converting

  • Use AWS Schema Conversion Tool (SCT) to convert a DB schema from one type of DB to another, e.g. from Oracle to Redshift
  • Use Database Migration Service (DMS) to copy database. Sometimes you do SCT convert, then DMS copy. 
  • Use AWS DataSync to copy large amount of data from on-prem to S3, EFS, FSx, NFS shares, SMB shares, AWS Snowcone (via Direct Connect).  For copying data, not databases. 

Analytics (OLAP)

  • Redshift is a columnar data warehouse that you can use for complex querying across petabytes of structured data. It’s not serverless, it uses EC2 instances that must be running. Use Amazon RedShift Spectrum to query data from S3 using a RedShift cluster for massive parallelism 
  • Athena is a serverless (aka inexpensive) solution to do SQL queries on S3 data and write results back. Works natively with client-side and server-side encryption. Not the same as QuickSight which is just a BI dashboard.
  • ​Amazon S3 Select – analyze and process large amounts of data faster with SQL, without moving it to a data warehouse