As per the official documentation of Delta Lake: Delta Lake is an open source storage layer that brings reliability to Data Lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. More on official documentation of Delta Lake Delta Lake.
Getting Started with Delta Lake
This blog helps you quickly explore the Delta Lake and help you to start on your local machine with Apache Spark 3.5 and Delta Lake 3.1.0 on Spark Scala shell.
Install the Apache Spark 3.5.1 by following the official guide. After installation, start Apache Spark with Delta Lake using the following command:
Getting Started with Delta Lake + Apache Spark using Spark Shell on local machine
Let's keep this blog post as simple as possible, let's start by creating a basic DataFrame and store the data into delta lake table format - a non-partitioned table and then reading it back.
Let's analyze the storage path and metadata (delta_log) of delta lake files. In the image below, you can see all the parquet files generated in the base path along with delta_log files. However, this method may not be efficient if your data is expected to scale to terabytes.
Delta lake file storage structures.
Read the delta lake table from the above base path:
Getting Started with Delta Lake - a partitioned table with an existing column:
So, let's use the above-discussed concept of a non partition delta lake table as it might not be efficient for the large scale dataset that can be expected to scale to terabytes. For the dataset that's going to scale into terabytes size in future, we should utilize the concepts of partitioning a table in delta lake table while writing into disk. I will use the same data frame as created above, just the way of storage in Delta Lake will be changed, We need to select a static column from dataframe and based on that we can partition the physical storage of the delta lake table as show below:
Initialize a separate base path for partitioned table and store with partition by country:
1val partitionBasePath ="file:///Users/avocadodata/data/delta-lake/partition/getting-started/"2<!-- Store in partition as country -->3df.write.format("delta").partitionBy("country").mode("overwrite").save(partitionBasePath)
Let's analyse the physical storage of data in the disk by delta lake table storage from aboe codebase:
A partition table storage structure for the delta lake table files.
Focus on `.partitionBy("country")`. so it will create physical storage folders (prefixes) for each country and hence data will be distributed across different physical storage of prefixes, that will help in reading and writing the data more efficient when data scaling out quickly
Getting Started with Delta Lake - A creation of a derived column that will serve as the partition column:
Imagine you have a dataset with a finite number of country, that could be good but whenever the partition column going to grow faster and not statics then it could become a challenge if each category (as of now country) expands into terabytes of data. This is where the magic of the derived column comes into play. You can treat this column as metadata and utilize it for partitioning when ingesting data into the Delta Lake storage format in the data lake.
To optimize partitioning concepts, you can create a derived column. Since the primary key may grow rapidly, creating a separate partition for each ID is not practical. Instead, use a function (such as modulo) to limit the number of partitions as per your data size growth in the future.
1<!-- Derive the partition column from exist column -->2<!-- Like you want to create only 100(here I will create 3) partition based on the id -->3// Cast column "id" to long4importorg.apache.spark.sql.types.{DoubleType, IntegerType, LongType}5val partitionLabel ="_partition"6val cdf = df.withColumn("id", col("id").cast(LongType))7val fdf = cdf.withColumn(partitionLabel, col("id")%3)8<!-- Store into delta lake format -->9val basePathCustomPartition ="file:///Users/avocadodata/data/delta-lake/cusotom_partition"10fdf.write.format("delta").partitionBy(partitionLabel).mode("overwrite").save(basePathCustomPartition)
Let's analyze the physical storage of data in the disk by delta lake table:
A derived partition delta lake table storage structure in the delta lake table format
By doing these ways, now it is up to you how many partitions you want to create and how much read and write throughput you want for your table in the near future, this is the beauty of Data Lake using open source and cloud providers and we can say that it has no limit on the storage of the data and read and write throughput. This can be achieved through open-storage frameworks like Apache Hudi, Apache Iceberg, or Delta Lake, each offering table-level ACID properties.
Contact Avocado Data lake for expert data lake implementation and consulting.
For complete codebase and implementation details, please visit our github page AvocadoData
Visit our product pages for more information and contact us page for a free consultation of half an hour on data lake implementation using any open table format.