Data Lakes requires an open table format storage and as per the official documentation of Apache Iceberg: Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time. official docs at: Apache Iceberg.
Getting Started with Apache Iceberg
This blog helps you quickly explore the Apache Iceberg and help you to start on your local machine with Apache Spark 3.5.2 and Apache Iceberg latest version 1.9.2 as of 7th Sept 2025 on Spark Scala shell.
Install the Apache Spark 3.5.1 by following the official guide. After installation, start Apache Spark with Apache Iceberg using the following command, you need to provide the local storage path as Apache Iceberg managed that path for each dataset storage and default database as `db`:
Let's keep this blog post as simple as possible, let's start by creating a basic DataFrame and store the data into Apache Iceberg table format and then reading it back.
As Apache Iceberg handle the storage of datasets automatically so we don't need to specify the path while writing the data frame into iceberg table format.:
Finally store into Apache Iceberg table format:
1df.writeTo("local.db.products").create()
Let's analyze the storage path and metadata of Apache Iceberg files. In the image below, you can see all the parquet files generated inside the data folders with unique dataset path as `warehouse/db/products` along with metadata folder, So Apache Iceberg has two directory by default `data` and `metadata` for datasets in the warehouse path.
Apache Iceberg file storage structures.
Read the Apache Iceberg table Using the Spark SQL
1val icebergTable = spark.sql("SELECT * FROM local.db.products")2icebergTable.show(false)
Read the Apache Iceberg table Using DataFrame API
1spark.table("local.db.products").show(false)
Both of the above query will print the same requests, as shown below:
Reading Apache Iceberg table from spark shell
Getting Started with Apache Iceberg - Inline insert and update into Apache Iceberg table
Apache Iceberg brings the simplicity of SQL into data warehouse, and let's see it as how simple is to insert and update the records into large datasets.
Insert new records into existing Apache Iceberg table
1spark.sql(2"INSERT INTO local.db.products VALUES (6, 'Avocado', 'IN'), (7, 'NAUKRI', 'IN')",3)
For more information visit the official Apache Iceberg site for updates based SQL queries and insert based SQL queries at here
Getting Started with Apache Iceberg - CDC:Merge statements into existing Apache Spark Table
Imagine you going to update lots of records into the existing apache iceberg table in bulk and make the set and insert for each will be difficult tasks so here is the merge statements that simplify the overall CDC operation into the exist Apache Iceberg table.
1MERGE INTO prod.db.target t -- a target table
2USING (SELECT ...) s -- the source updates
3ON t.id = s.id -- condition to find updates for target rows
4WHEN ...-- updates
As merging into a existing with merge into statements requires a creation of temporary table and then use that table to transfer the data into main table and then truncate or delete the temp table, so let's create a temp table as below:
1val data = Seq(2("1","Kayako","US"),3("8","M&M","IN"),4("9","K&K","UK"),5("10","L&L","JP")6).toDF("id","name","country")7data.writeTo("local.db.temp_products").create()8spark.table("local.db.temp_products").show(false)
Now let's utilize the merge into statements and merge the temp table into main table `local.db.products` as:
1spark.sql("MERGE INTO local.db.products t \
2 USING (SELECT id, name, country FROM local.db.temp_products) s \
3 ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")
As the blogs contents getting bigger and bigger so let's deep dive with real example where you want to capture the delete, update and insert events into the main tables, so let's create the temp table for the CDC data as below:
1val data = Seq(2("1","Kayako","IN","update"),3("8","M&M","IN","delete"),4("10","L&L","JP","delete"),5("11","S&S","JP","insert"),6).toDF("id","name","country","event")7data.writeTo("local.db.temp_products4").create()
Let's merge this event into main table `local.db.products` as
1spark.sql("MERGE INTO local.db.products t \
2 USING (SELECT * FROM local.db.temp_products4) s \
3 ON t.id = s.id WHEN MATCHED AND s.event ='delete' THEN DELETE \
4 WHEN MATCHED THEN UPDATE SET * \
5 WHEN NOT MATCHED THEN INSERT *")
All the operations discussed above are captured in a screenshot, let's see the all reflected data into the main table.
A Events captured into an Apache Iceberg with all possible cases.
Contact Avocado Data lake for expert data lake implementation and consulting.
For complete codebase and implementation details, please visit our github page AvocadoData
As we discussed in the above example, while the CDC data come from Kafka or some RDBMS databases then simple update the table with merge into statements is not as simple as its look above, Avocado Data lake engineers can isolated all those complexity into Load class and you will eventually needs to pass the data frame as show below with the concepts of `.andThen` pipelines.
Visit our product pages for more information and contact us page for a free consultation of half an hour on data lake implementation using any open table format.