Getting Started with Delta Lake

In this guide, you will learn what Delta Lake is, how its open table format works, and how to set it up with Apache Spark on your local machine. We will walk through the key concepts, architecture, and step-by-step implementation and help you to start on your local machine with Apache Spark 3.5 and Delta Lake 3.1.0 on Spark Scala shell.

Install the Apache Spark 3.5.1 by following the official guide. After installation, start Apache Spark with Delta Lake using the following command:

1spark-shell \
2    --packages io.delta:delta-spark_2.12:3.1.0 \
3    --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
4    --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Getting Started with Delta lake + Apache Spark — Getting Started with Delta Lake + Apache Spark using Spark Shell on local machine

Let's keep this blog post as simple as possible, let's start by creating a basic DataFrame and store the data into delta lake table format - a non-partitioned table and then reading it back.

Imports statements for delta lake:

1import io.delta.tables._
2import org.apache.spark.sql.functions.

A simple data frame construction:

1val data = Seq(("1", "Kayako", "IN"), ("2", "Zomato", "IN"), ("3", "Google", "US"), ("4", "Tesla", "US"), ("5", "Nippon", "JP"))
2val schema = Seq("id", "name", "country")
3val rdd = spark.sparkContext.parallelize(data)
4val df = rdd.toDF(schema:_*)

Define the local path in your machine:

1val basePath = "file:///Users/avocadodata/data/delta-lake/no-partition/getting-started/"

Finally store into Delta lake table format:

1df.write.format("delta").mode("overwrite").save(basePath)

Let's analyze the storage path and metadata (delta_log) of delta lake files. In the image below, you can see all the parquet files generated in the base path along with delta_log files. However, this method may not be efficient if your data is expected to scale to terabytes.

Delta lake storage structure — Delta lake file storage structures.

Read the delta lake table from the above base path:

1val readdf = spark.read.format("delta").load(basePath)
2readdf.show()

Delta lake table reading from spark shell — Reading Delta Lake table from spark shell

Getting Started with Delta Lake - a partitioned table with an existing column:

So, let's use the above-discussed concept of a non partition delta lake table as it might not be efficient for the large scale dataset that can be expected to scale to terabytes. For the dataset that's going to scale into terabytes size in future, we should utilize the concepts of partitioning a table in delta lake table while writing into disk. I will use the same data frame as created above, just the way of storage in Delta Lake will be changed, We need to select a static column from dataframe and based on that we can partition the physical storage of the delta lake table as show below:

Initialize a separate base path for partitioned table and store with partition by country:

1val partitionBasePath = "file:///Users/avocadodata/data/delta-lake/partition/getting-started/"
2<!-- Store in partition as country -->
3df.write.format("delta").partitionBy("country").mode("overwrite").save(partitionBasePath)

Let's analyse the physical storage of data in the disk by delta lake table storage from aboe codebase:

A partition delta lake table storage structure in the delta lake table foramt — A partition table storage structure for the delta lake table files.

Focus on `.partitionBy("country")`. so it will create physical storage folders (prefixes) for each country and hence data will be distributed across different physical storage of prefixes, that will help in reading and writing the data more efficient when data scaling out quickly

Getting Started with Delta Lake - A creation of a derived column that will serve as the partition column:

Imagine you have a dataset with a finite number of country, that could be good but whenever the partition column going to grow faster and not statics then it could become a challenge if each category (as of now country) expands into terabytes of data. This is where the magic of the derived column comes into play. You can treat this column as metadata and utilize it for partitioning when ingesting data into the Delta Lake storage format in the data lake.

To optimize partitioning concepts, you can create a derived column. Since the primary key may grow rapidly, creating a separate partition for each ID is not practical. Instead, use a function (such as modulo) to limit the number of partitions as per your data size growth in the future.

1<!-- Derive the partition column from exist column -->
2<!-- Like you want to create only 100 (here I will create 3) partition based on the id -->
3// Cast column "id" to long
4import org.apache.spark.sql.types.{DoubleType, IntegerType, LongType}
5val partitionLabel = "_partition"
6val cdf = df.withColumn("id", col("id").cast(LongType))
7val fdf = cdf.withColumn(partitionLabel, col("id") % 3)
8<!-- Store into delta lake format -->
9val basePathCustomPartition = "file:///Users/avocadodata/data/delta-lake/cusotom_partition"
10fdf.write.format("delta").partitionBy(partitionLabel).mode("overwrite").save(basePathCustomPartition)

Let's analyze the physical storage of data in the disk by delta lake table:

A derived partition delta lake table storage structure in the delta lake table foramt — A derived partition delta lake table storage structure in the delta lake table format

By doing these ways, now it is up to you how many partitions you want to create and how much read and write throughput you want for your table in the near future, this is the beauty of Data Lake using open source and cloud providers and we can say that it has no limit on the storage of the data and read and write throughput. This can be achieved through open-storage frameworks like Apache Hudi, Apache Iceberg, or Delta Lake, each offering table-level ACID properties.

Contact Avocado Data lake for expert data lake implementation and consulting.

Email us at:

For complete codebase and implementation details, please visit our github page AvocadoData

Visit our product pages for more information and contact us page for a free consultation of half an hour on data lake implementation using any open table format.