DataBricks

Unity Catalog

Delta Sharing

(Apache) Iceberg

  • Iceberg aimed to bring database-like reliability, consistency, and manageability to data lakes built on open file formats like Parquet, Avro, or ORC.
  • Think of Iceberg as a modern table format that sits on top of your files and manages them like a relational database would—with support for versioning, transactions, and intelligent metadata management.
  • True Schema Evolution (Without File Rewrites): Iceberg stores schema information in its own metadata layer
  • Full ACID Transactions at Scale: Iceberg supports serializable isolation through snapshot-based transactional writes
  • Time Travel and Snapshot Isolation: Every write in Iceberg creates a new snapshot, recorded in the metadata.
  • Partitioning Without Pain
FeatureApache Parquet (Columnar Storage Format)Apache Iceberg (Table Format)
Storage FormatStores data in a highly efficient, columnar, binary format.Organizes Parquet/ORC/Avro files into structured tables using rich metadata.
Schema EvolutionLimited: adding columns is easy, but renaming/reordering requires rewriting files.Fully supports add/drop/rename/reorder, without rewriting underlying files.
ACID TransactionsNot supported. Updates/deletes require rewriting files manually.Full transactional support with isolation and atomicity across operations.
Time TravelNot natively supported. Manual versioning needed.Built-in snapshot-based versioning for point-in-time queries and rollback.
PerformanceOptimized for scan-heavy, read-mostly workloads.Optimized for dynamic datasets with concurrent writes, updates, and schema changes.
Best ForAnalytical queries, static datasets, feature stores.Data lakes, CDC pipelines, evolving schemas, and transactional workloads.

Parquet

  • an open columnar data type that is common in big data environments and great for automated workflows and storage. If your team uses Hadoop then this is most likely their favorite format. Parquet is self-describing in that it includes metadata that includes the schema and structure of the file.
  • Good on
    • Read Speed
    • File Size, with compression and encoding
    • Splittable
    • Included Data Types
      • Schema Evolution – support to adding or dropping fields
  • Bad on
    • Not Human-readable
    • Write Overhead and Latency
    • Inefficient on row-lv access
    • Tooling overhead
FeatureParquetCSVJSONAvroORC
Storage TypeColumnarRow-basedRow-basedRow-basedColumnar
CompressionHigh
(Snappy, Gzip, Brotli)
None
(manual)
Moderate
(manual)
Moderate
(Deflate)
Very High
(Zlib, LZO)
Read PerformanceExcellent
(esp. for selective columns)
PoorPoorModerateExcellent
Write PerformanceModerate to Slow
(due to encoding)
FastFastFastModerate
Schema SupportStrong
(with evolution)
NoneWeak
(schema-less)
Strong
(with evolution)
Strong
(with evolution)
Nested Data SupportExcellent
(via Arrow)
NoneGood
(but inefficient)
ModerateExcellent
Human ReadableNoYesYesNoNo
Best Use CasesAnalytics, Data Lakes, BI ToolsQuick Inspection, DebuggingLogging, Config FilesStreaming, SerializationData Warehousing
(esp. Hive)
Cloud CompatibilityUniversal
(AWS, Azure, GCP)
UniversalUniversalUniversalMostly Hadoop Ecosystems

Delta Lake

  • Delta Lake builds a metadata layer on top of existing data files, which are typically in the Parquet format.
  • This metadata layer, or transaction log, records all changes to the table.
  • The transaction log allows Delta Lake to perform ACID transactions and track data versions, which is the foundation for features like time travel and schema enforcement.
  • It is fully compatible with Apache Spark APIs, allowing it to integrate easily into existing big data and streaming workflows. 
  • Lakehouse: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

Medallion Architecture

  • medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as “multi-hop” architectures.
  • Data Quality Levels/Layers
  • Bronze
    • raw data
    • and all the data from external source systems. The table structures in this layer correspond to the source system table structures “as-is,” along with any additional metadata columns that capture the load date/time, process ID, etc. The focus in this layer is quick Change Data Capture and the ability to provide an historical archive of source (cold storage), data lineage, auditability, reprocessing if needed without rereading the data from the source system.
  • Silver
    • cleansed and conformed data
    • the data from the Bronze layer is matched, merged, conformed and cleansed (“just-enough”) so that the Silver layer can provide an “Enterprise view” of all its key business entities, concepts and transactions. 
    • Speed and agility to ingest and deliver the data in the data lake is prioritized, and a lot of project-specific complex transformations and business rules are applied while loading the data from the Silver to Gold layer.
    • From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models.
  • Gold
    • curated business-level tables
    • for reporting and uses more de-normalized and read-optimized data models with fewer joins