The Technology Stack

Below are the component versions configured for this implementation, aligning with Google Cloud Dataproc Serverless runtime compatibility.

Component	Version
Scala	2.13.14
Apache Spark	3.5.4
Apache Iceberg	1.9.1
Dataproc Serverless	v2.2
SBT	1.10.7

Dependency Strategy & SBT Build

When running Spark jobs in serverless cloud environments like Dataproc, handling dependencies correctly is critical to avoid classpath conflicts. Spark and Iceberg are declared as provided in our build.sbt:

At compile & test time: SBT resolves and downloads them normally so your IDE, autocomplete, and local tests work out of the box.
In the assembled JAR: They are excluded. Dataproc Serverless v2.2 pre-installs Spark 3.5, and Iceberg runtime JARs are provided via the --jars submission argument. This ensures the output JAR remains minimal (fat JAR containing only application code).

SBT Build definition (build.sbt):// build.sbt
ThisBuild / organization := "com.avocado"
ThisBuild / version      := "0.1.0"
ThisBuild / scalaVersion := "2.13.14"

lazy val root = (project in file("."))
  .settings(
    name := "gcp-iceberg-lakehouse",

    // Dependencies
    libraryDependencies ++= Seq(
      // --- Apache Spark (provided: Dataproc Serverless v2.2 ships Spark 3.5) ---
      "org.apache.spark" %% "spark-core" % "3.5.4" % "provided",
      "org.apache.spark" %% "spark-sql"  % "3.5.4" % "provided",

      // --- Apache Iceberg (provided: supplied at submit time via --jars) ---
      "org.apache.iceberg" % "iceberg-spark-runtime-3.5_2.13" % "1.9.1" % "provided",
      "org.apache.iceberg" % "iceberg-gcp-bundle"              % "1.9.1" % "provided",

      // --- Test ---
      "org.scalatest" %% "scalatest" % "3.2.19" % Test
    ),

    // sbt-assembly: fat JAR configuration
    assembly / assemblyJarName := s"${name.value}-assembly-${version.value}.jar",

    assembly / assemblyMergeStrategy := {
      case PathList("META-INF", "MANIFEST.MF")            => MergeStrategy.discard
      case PathList("META-INF", xs @ _*) if xs.exists(f =>
        f.endsWith(".SF") || f.endsWith(".DSA") || f.endsWith(".RSA"))
                                                          => MergeStrategy.discard
      case PathList("META-INF", "services", _*)           => MergeStrategy.concat
      case "reference.conf"                               => MergeStrategy.concat
      case PathList("META-INF", _*)                       => MergeStrategy.discard
      case _                                              => MergeStrategy.first
    },

    // Mark provided scope as excluded from the assembly JAR
    assembly / assemblyOption := (assembly / assemblyOption).value
      .withIncludeScala(false),

    scalacOptions ++= Seq(
      "-encoding", "utf8",
      "-deprecation",
      "-feature",
      "-unchecked",
      "-Xlint:unused"
    )
  )

Compile and build commands:# Compile the project
sbt compile

# Run tests
sbt test

# Produce assembled/fat JAR (app code only — Spark/Iceberg marked provided/excluded)
sbt assembly

Scala Iceberg Implementation

Our Spark program demonstrates the end-to-end Iceberg lifecycle. Crucially, catalog configurations are not hardcoded. Instead, the Spark session is instantiated dynamically, and the BigLake catalog parameters are injected at runtime via Dataproc command properties.

Scala application entrypoint (com.avocado.lakehouse.IcebergQuickstart):package com.avocado.lakehouse

import org.apache.spark.sql.SparkSession

/**
 * IcebergQuickstart — Scala equivalent of quickstart.py from GCP wiki: https://docs.cloud.google.com/lakehouse/docs/use-lakehouse-metastore-iceberg-rest-catalog
 *
 * Demonstrates end-to-end Iceberg table lifecycle on GCP Lakehouse using the
 * BigLake Iceberg REST catalog:
 *   1. Create namespace (idempotent)
 *   2. Create (or replace) an Iceberg table
 *   3. Insert sample rows
 *   4. Read back and print results
 */
object IcebergQuickstart {

  private val CatalogName   = "quickstart_catalog"
  private val NamespaceName = "quickstart_namespace"
  private val TableName     = "test_table_from_jar"
  private val FullTableRef  = s"`$CatalogName`.$NamespaceName.$TableName"

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("iceberg-quickstart")
      .getOrCreate()

    try {
      run(spark)
    } finally {
      spark.stop()
    }
  }

  def run(spark: SparkSession): Unit = {
    // Step 1: Create namespace (dataset) if it doesn't already exist
    println(s"[1/4] Creating namespace '$NamespaceName' in catalog '$CatalogName'...")
    spark.sql(
      s"CREATE NAMESPACE IF NOT EXISTS `$CatalogName`.$NamespaceName"
    )
    println(s"      Namespace ready.")

    // Step 2: Create (or replace) the Iceberg table
    println(s"[2/4] Creating table $FullTableRef ...")
    spark.sql(
      s"""CREATE OR REPLACE TABLE $FullTableRef (
         |  id   INT,
         |  name STRING
         |) USING iceberg
         |""".stripMargin
    )
    println(s"      Table created.")

    // Step 3: Insert sample data
    println(s"[3/4] Inserting rows into $FullTableRef ...")
    spark.sql(
      s"""INSERT INTO $FullTableRef
         |VALUES (1, 'one'), (2, 'two'), (3, 'three')
         |""".stripMargin
    )
    println(s"      Rows inserted.")

    // Step 4: Read back and print results
    println(s"[4/4] Reading back rows from $FullTableRef ...")
    val df = spark.sql(s"SELECT * FROM $FullTableRef ORDER BY id")
    df.show(truncate = false)
    println(s"      Total rows: ${df.count()}")
  }
}

Submitting to Dataproc Serverless

To run the Spark code in your GCP environment, the script compiles the Scala code into a fat JAR using SBT, uploads it to Google Cloud Storage (GCS), and submits the batch job. In the submit arguments, we reference the Maven coordinates of the Apache Iceberg dependencies and pass BigLake Iceberg REST Catalog settings (e.g. uri, OAuth credentials endpoint, warehouse storage path).

GCP Submit Script (scripts/submit_quickstart.sh):#!/bin/bash
# submit_quickstart.sh
set -euo pipefail

# Configuration — adjust these values to match your GCP environment
REGION="*Your-Region*"
PROJECT_ID="*Your-Project-ID*"
LAKEHOUSE_CATALOG_ID="*Your-Catalog-ID*"

# GCS bucket and path for the Scala JAR
JAR_BUCKET="*Your-Bucket-Name*"
JAR_PREFIX="scala/jar"
JAR_VERSION="0.1.0"
JAR_NAME="gcp-iceberg-lakehouse-assembly-${JAR_VERSION}.jar"
JAR_LOCAL="target/scala-2.13/${JAR_NAME}"
JAR_GCS="gs://${JAR_BUCKET}/${JAR_PREFIX}/${JAR_NAME}"

# Iceberg JARs — sourced directly from Maven Central via HTTPS.
MAVEN_BASE="https://storage-download.googleapis.com/maven-central/maven2/org/apache/iceberg"
ICEBERG_VERSION="1.9.1"
ICEBERG_RUNTIME_JAR="${MAVEN_BASE}/iceberg-spark-runtime-3.5_2.12/${ICEBERG_VERSION}/iceberg-spark-runtime-3.5_2.12-${ICEBERG_VERSION}.jar"
ICEBERG_GCP_BUNDLE_JAR="${MAVEN_BASE}/iceberg-gcp-bundle/${ICEBERG_VERSION}/iceberg-gcp-bundle-${ICEBERG_VERSION}.jar"

# Main class in the JAR
MAIN_CLASS="com.avocado.lakehouse.IcebergQuickstart"

# 1. Verify local JAR exists
if [[ ! -f "${JAR_LOCAL}" ]]; then
  echo "ERROR: JAR not found at ${JAR_LOCAL}. Run 'sbt assembly' first."
  exit 1
fi

# 2. Upload JAR to GCS
echo "[1/2] Uploading JAR to GCS..."
gcloud storage cp "${JAR_LOCAL}" "${JAR_GCS}"

# 3. Submit Spark batch job to Dataproc Serverless
echo "[2/2] Submitting Dataproc Serverless batch job..."
gcloud dataproc batches submit spark \
    --class="${MAIN_CLASS}" \
    --jars="${JAR_GCS},${ICEBERG_RUNTIME_JAR},${ICEBERG_GCP_BUNDLE_JAR}" \
    --project="${PROJECT_ID}" \
    --region="${REGION}" \
    --version=2.2 \
    --properties="\
spark.sql.defaultCatalog=quickstart_catalog,\
spark.sql.catalog.quickstart_catalog=org.apache.iceberg.spark.SparkCatalog,\
spark.sql.catalog.quickstart_catalog.type=rest,\
spark.sql.catalog.quickstart_catalog.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\
spark.sql.catalog.quickstart_catalog.warehouse=gs://${LAKEHOUSE_CATALOG_ID},\
spark.sql.catalog.quickstart_catalog.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO,\
spark.sql.catalog.quickstart_catalog.header.x-goog-user-project=${PROJECT_ID},\
spark.sql.catalog.quickstart_catalog.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager,\
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,\
spark.sql.catalog.quickstart_catalog.header.X-Iceberg-Access-Delegation=vended-credentials,\
spark.sql.catalog.quickstart_catalog.gcs.oauth2.refresh-credentials-endpoint=https://oauth2.googleapis.com/token"

Submit commands:# 1. Build the JAR
sbt assembly

# 2. Submit Spark job to Dataproc Serverless
chmod +x scripts/submit_quickstart.sh
./scripts/submit_quickstart.sh

Once the job executes successfully, the Apache Iceberg table is registered automatically in GCP. You can query the resulting dataset directly from BigQuery using Google Cloud BigLake SQL interface:

1SELECT * FROM `scenic-treat-435221-t1.avocado_lakehouse_catalog.quickstart_namespace.test_table_from_jar`;

Getting Started with GCP Iceberg Lakehouse and BigQuery

Contact Avocado Datalake for expert data lake implementation and consulting.

Email us at:

Implementing serverless Spark pipelines with BigLake and Apache Iceberg catalog synchronization involves orchestrating GCP credentials, schema configurations, and partition strategies. Our engineers at Avocado Datalake specialize in building, testing, and optimizing cloud-native lakehouses.

We will provide the proper guidance to set up your lakehouse using the terraform as IaaC so the replication will be easier for your organization in future.

Visit our products page for more information and our contact page for a free 30-minute consultation.