Copy On Write (CoW)

In the Copy On Write storage type, data is stored exclusively in columnar file formats (e.g., Parquet). Updates simply rewrite the files with the new values using a synchronous merge during the write means it rewrites the entire parquet foles on updates for fast reads but slow writes, this is idea for batch ingestion and analytics use cases i.e OLAP.
Configuration for CoW Table:
1df.write.format("hudi")
2  .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
3  .option("hoodie.table.name", tableName)
4  .save(basePath)
Concepts
  • Storage: Data in columnar formats i.e parquet files only, updates rewrite the files with the new values using a synchronous merge during the write means it rewrites the entire parquet files on updates for fast reads but slow writes
  • Writes:Higher write latency/cost due to file rewriting
  • Reads:Fast, as queries only read clean, full Parquet files (no logs to merge).
  • Best For:Read-heavy analytics (OLAP), daily batch pipelines, data marts.
  • Compaction:No compaction needed.
Pros:
  • Simplest operational model.
  • Best read performance.
  • No compaction needed.
Cons:
  • Higher write latency.
  • Higher write amplification.

Merge On Read (MoR)

Merge On Read stores data using a combination of columnar (Parquet) and row-based (Avro) file formats. Updates are logged to delta files and compacted later to create new versions of columnar files this process is called compaction and this idea will help to reduce the size of the data files and improve the read performance after compaction finished after some commits or as per the Configuration of compaction process trigger.
Configuration for MoR Table:
1df.write.format("hudi")
2  .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
3  .option("hoodie.table.name", tableName)
4  .save(basePath)
Concepts
  • Storage: Base Parquet files + separate Avro (row-based) log files for changes.
  • Writes:Lower write latency, as updates go to log files - this make faster write i.e lower write amplification and compaction merges them later.
  • Reads:Slower before compaction, as queries need to merge Parquet files with log files. but after compaction happens, reads become faster.
  • Best For:High-frequency updates, delete or CDC and required near real-time data needs.
  • Compaction:Compaction needed.
Pros:
  • Lower write latency.
  • Lower write amplification.
  • Near real-time data availability.
Cons:
  • Higher read latency.
  • Operational complexity (compaction management).

Comparison: CoW vs MoR

FeatureCopy on Write (CoW)Merge on Read (MoR)
Data StorageParquet files onlyParquet + Avro
Write MechanismUpdates trigger the rewriting of entire base files. This happens synchronously during the write operation.Updates are logged to delta files and compacted later to create new versions of columnar files this process is called compaction and this idea will help to reduce the size of the data files and improve the read performance after compaction finished after some commits or as per the Configuration of compaction process trigger.
Read MechanismQueries read only the base files, requiring no dynamic merging.Queries need to merge Parquet files with log files to provide up-to-date data. This can be slower before compaction.
Data FreshnessLess FrequentNear Real-time
CompactionNot RequiredRequired (Async/Sync) otherwise read performance will be affected.
Use CaseRead-heavy workloads, Batch processingWrite-heavy workloads, Streaming

How to Start with Apache Huid tables creation, visit our Getting Started with Apache Hudi

Need help choosing the right Architecture?


Contact Avocado Datalake for expert consultation on designing scalable data lakes.
Visit our product pages for more information and contact us for a free consultation.
That App Show
Featured on findly.tools
Verified on Verified Tools
Data Lake ETL PaaS - Featured on Startup Fame
Data Lake ETL PaaS - Featured on Aura++