Copy On Write (CoW)

In the Copy On Write storage type, data is stored exclusively in columnar file formats (e.g., Parquet). Updates simply rewrite the files with the new values using a synchronous merge during the write means it rewrites the entire parquet foles on updates for fast reads but slow writes, this is idea for batch ingestion and analytics use cases i.e OLAP.

Configuration for CoW Table:df.write.format("hudi")
  .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
  .option("hoodie.table.name", tableName)
  .save(basePath)

Concepts

Storage: Data in columnar formats i.e parquet files only, updates rewrite the files with the new values using a synchronous merge during the write means it rewrites the entire parquet files on updates for fast reads but slow writes
Writes:Higher write latency/cost due to file rewriting
Reads:Fast, as queries only read clean, full Parquet files (no logs to merge).
Best For:Read-heavy analytics (OLAP), daily batch pipelines, data marts.
Compaction:No compaction needed.

Pros:

Simplest operational model.
Best read performance.
No compaction needed.

Cons:

Higher write latency.
Higher write amplification.

Merge On Read (MoR)

Merge On Read stores data using a combination of columnar (Parquet) and row-based (Avro) file formats. Updates are logged to delta files and compacted later to create new versions of columnar files this process is called compaction and this idea will help to reduce the size of the data files and improve the read performance after compaction finished after some commits or as per the Configuration of compaction process trigger.

Configuration for MoR Table:df.write.format("hudi")
  .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
  .option("hoodie.table.name", tableName)
  .save(basePath)

Concepts

Storage: Base Parquet files + separate Avro (row-based) log files for changes.
Writes:Lower write latency, as updates go to log files - this make faster write i.e lower write amplification and compaction merges them later.
Reads:Slower before compaction, as queries need to merge Parquet files with log files. but after compaction happens, reads become faster.
Best For:High-frequency updates, delete or CDC and required near real-time data needs.
Compaction:Compaction needed.

Pros:

Lower write latency.
Lower write amplification.
Near real-time data availability.

Cons:

Higher read latency.
Operational complexity (compaction management).

Comparison: CoW vs MoR

Feature	Copy on Write (CoW)	Merge on Read (MoR)
Data Storage	Parquet files only	Parquet + Avro
Write Mechanism	Updates trigger the rewriting of entire base files. This happens synchronously during the write operation.	Updates are logged to delta files and compacted later to create new versions of columnar files this process is called compaction and this idea will help to reduce the size of the data files and improve the read performance after compaction finished after some commits or as per the Configuration of compaction process trigger.
Read Mechanism	Queries read only the base files, requiring no dynamic merging.	Queries need to merge Parquet files with log files to provide up-to-date data. This can be slower before compaction.
Data Freshness	Less Frequent	Near Real-time
Compaction	Not Required	Required (Async/Sync) otherwise read performance will be affected.
Use Case	Read-heavy workloads, Batch processing	Write-heavy workloads, Streaming

How to Start with Apache Huid tables creation, visit our Getting Started with Apache Hudi

Need help choosing the right Architecture?

Email us at:

Contact Avocado Datalake for expert consultation on designing scalable data lakes.

Visit our product pages for more information and contact us for a free consultation.