Getting Started with Apache Iceberg

This blog helps you quickly explore the Apache Iceberg and help you to start on your local machine with Apache Spark 3.5.2 and Apache Iceberg latest version 1.9.2 as of 7th Sept 2025 on Spark Scala shell.

Install the Apache Spark 3.5.1 by following the official guide. After installation, start Apache Spark with Apache Iceberg using the following command, you need to provide the local storage path as Apache Iceberg managed that path for each dataset storage and default database as `db`:

1spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.2 \
2    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
3    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
4    --conf spark.sql.catalog.spark_catalog.type=hive \
5    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
6    --conf spark.sql.catalog.local.type=hadoop \
7    --conf spark.sql.catalog.local.warehouse=/Users/avocadodata/data/apache-iceberg/warehouse \
8    --conf spark.sql.defaultCatalog=local

Let's keep this blog post as simple as possible, let's start by creating a basic DataFrame and store the data into Apache Iceberg table format and then reading it back.

A simple data frame construction:val data = Seq(("1", "Kayako", "IN"), ("2", "Zomato", "IN"), ("3", "Google", "US"), ("4", "Tesla", "US"), ("5", "Nippon", "JP"))
val schema = Seq("id", "name", "country")
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF(schema:_*)

As Apache Iceberg handle the storage of datasets automatically so we don't need to specify the path while writing the data frame into iceberg table format.:

Finally store into Apache Iceberg table format:df.writeTo("local.db.products").create()

Let's analyze the storage path and metadata of Apache Iceberg files. In the image below, you can see all the parquet files generated inside the data folders with unique dataset path as `warehouse/db/products` along with metadata folder, So Apache Iceberg has two directory by default `data` and `metadata` for datasets in the warehouse path.

Apache Iceberg storage structure — Apache Iceberg file storage structures.

Read the Apache Iceberg table Using the Spark SQLval icebergTable = spark.sql("SELECT * FROM local.db.products")
icebergTable.show(false)

Read the Apache Iceberg table Using DataFrame APIspark.table("local.db.products").show(false)

Both of the above query will print the same requests, as shown below:

Apache Iceberg table reading from spark shell — Reading Apache Iceberg table from spark shell

Getting Started with Apache Iceberg - Inline insert and update into Apache Iceberg table

Apache Iceberg brings the simplicity of SQL into data warehouse, and let's see it as how simple is to insert and update the records into large datasets.

Insert new records into existing Apache Iceberg tablespark.sql(
        "INSERT INTO local.db.products VALUES (6, 'Avocado', 'IN'), (7, 'NAUKRI', 'IN')",
    )

Read back and visualize the inserted recordsscala> spark.table("local.db.products").show(false)
    +---+-------+-------+
    |id |name   |country|
    +---+-------+-------+
    |6  |Avocado|IN     |
    |1  |Kayako |IN     |
    |7  |NAUKRI |IN     |
    |2  |Zomato |IN     |
    |3  |Google |US     |
    |4  |Tesla  |US     |
    |5  |Nippon |JP     |
    +---+-------+-------+
        

Similarly updating the records is as simple as a one line SQL queryspark.sql("UPDATE local.db.products SET country = 'IN' WHERE id = '4'")
spark.table("local.db.products").show(false)

Read back and visualize the updated recordscala> spark.table("local.db.products").show(false)
    +---+-------+-------+
    |id |name   |country|
    +---+-------+-------+
    |6  |Avocado|IN     |
    |1  |Kayako |IN     |
    |7  |NAUKRI |IN     |
    |2  |Zomato |IN     |
    |3  |Google |US     |
    |4  |Tesla  |IN     | # updated
    |5  |Nippon |JP     |
    +---+-------+-------+

For more information visit the official Apache Iceberg site for updates based SQL queries and insert based SQL queries at here

Getting Started with Apache Iceberg - CDC:Merge statements into existing Apache Spark Table

Imagine you going to update lots of records into the existing apache iceberg table in bulk and make the set and insert for each will be difficult tasks so here is the merge statements that simplify the overall CDC operation into the exist Apache Iceberg table.

Here is the official MERGE INTO statements:

1MERGE INTO prod.db.target t   -- a target table
2USING (SELECT ...) s          -- the source updates
3ON t.id = s.id                -- condition to find updates for target rows
4WHEN ...                      -- updates

As merging into a existing with merge into statements requires a creation of temporary table and then use that table to transfer the data into main table and then truncate or delete the temp table, so let's create a temp table as below:

1val data = Seq(
2  ("1", "Kayako", "US"),
3  ("8", "M&M", "IN"),
4  ("9", "K&K", "UK"),
5  ("10", "L&L", "JP")
6).toDF("id", "name", "country")
7data.writeTo("local.db.temp_products").create()
8spark.table("local.db.temp_products").show(false)

temp table values as shown

1+---+------+-------+                                                            
2|id |name  |country|
3+---+------+-------+
4|1  |Kayako|US     |
5|8  |M&M   |IN     |
6|9  |K&K   |UK     |
7|10 |L&L   |JP     |
8+---+------+-------+

Now let's utilize the merge into statements and merge the temp table into main table `local.db.products` as:

1spark.sql("MERGE INTO local.db.products t \
2     USING (SELECT id, name, country FROM local.db.temp_products) s \
3     ON t.id = s.id  WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")

As the blogs contents getting bigger and bigger so let's deep dive with real example where you want to capture the delete, update and insert events into the main tables, so let's create the temp table for the CDC data as below:

1val data = Seq(
2  ("1", "Kayako", "IN", "update"),
3  ("8", "M&M", "IN", "delete"),
4  ("10", "L&L", "JP", "delete"),
5  ("11", "S&S", "JP", "insert"),
6).toDF("id", "name", "country", "event")
7data.writeTo("local.db.temp_products4").create()

Let's merge this event into main table `local.db.products` as

1spark.sql("MERGE INTO local.db.products t \
2    USING (SELECT * FROM local.db.temp_products4) s \
3    ON t.id = s.id WHEN MATCHED AND s.event = 'delete' THEN DELETE \
4    WHEN MATCHED THEN UPDATE SET * \
5    WHEN NOT MATCHED THEN INSERT *")

Let's see the result:

1scala> spark.table("local.db.products").show(false)
2+---+-------+-------+
3|id |name   |country|
4+---+-------+-------+
5|6  |Avocado|IN     |
6|7  |NAUKRI |IN     |
7|1  |Kayako |IN     |
8|11 |S&S    |JP     |
9|5  |Nippon |IN     |
10|9  |K&K    |UK     |
11|2  |Zomato |IN     |
12|3  |Google |US     |
13|4  |Tesla  |IN     |
14+---+-------+-------+

All the operations discussed above are captured in a screenshot, let's see the all reflected data into the main table.

event captured into an existing table of Apache Iceberg — A Events captured into an Apache Iceberg with all possible cases.

Contact Avocado Data lake for expert data lake implementation and consulting.

Email us at:

For complete codebase and implementation details, please visit our github page AvocadoData

As we discussed in the above example, while the CDC data come from Kafka or some RDBMS databases then simple update the table with merge into statements is not as simple as its look above, Avocado Data lake engineers can isolated all those complexity into Load class and you will eventually needs to pass the data frame as show below with the concepts of `.andThen` pipelines.

1val pipeline = sparkStage
2  .andThen(extract)
3  .andThen(transform)
4  .andThen(load)
5
6pipeline(())

Visit our product pages for more information and contact us page for a free consultation of half an hour on data lake implementation using any open table format.