Getting Started with Apache Hudi

This blog helps you quickly explore the Apache Hudi and help you to start on your local machine with Apache Spark 3.5.2 and Apache Hudi latest version 1.0.2 as of 30th Sept 2025 on Spark Scala shell.

Install the Apache Spark 3.5.1 by following the official guide. After installation, start Apache Spark with Apache Hudi using the following spark shell command, you need to provide the local storage path as Apache Hudi managed that path while writing the dataset, and each dataset can be name differently and you can provide the different base path for each dataset (table).

1spark-shell --packages org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.2 \
2--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
3--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
4--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
5--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar'

Let's keep this blog post as simple as possible, let's start by creating a basic DataFrame and store the dummy data into Apache Hudi table format and then reading it back for demo purpose only.

A simple data frame construction:val data = Seq(("1", "Kayako", "IN"), ("2", "Zomato", "IN"), ("3", "Google", "US"), ("4", "Tesla", "US"), ("5", "Nippon", "JP"))
val schema = Seq("id", "name", "country")
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF(schema:_*)

As Apache Hudi handle the storage of each datasets at the time of writing so let's initialized a local path for local developments along with a table name

val tableName = "getting-started"
val basePath = "file:///Users/avocadodata/data/apache-hudi/getting-started/no-partition/"

Finally store into Apache Hudi table format as shown below:df.write.format("hudi")
  .option("hoodie.table.name", tableName)
  .mode(Overwrite)
  .save(basePath)

Let's analyze the storage path and metadata of Apache Hudi files. In the image below, you can see all the parquet files generated inside the base path folder with metadata folder as `.hoodie` directory, So Apache Hudi has single directory for `data` and `metadata` for dataset i.e the base path you have provided.

Apache Hudi storage structure — Apache Hudi file storage structures along with metadata .hoodie.

Read the Apache Hudi table Using DataFrame APIval testTable = spark.read.format("hudi").load(basePath)
testTable.show(false)

Read the Apache Hudi table Using the Spark SQL API// Create Temp view from above df: testTable
testTable.createOrReplaceTempView("testTable")
// Read using the Spark SQL 
spark.sql("SELECT _hoodie_commit_time, _hoodie_record_key, id, name, country FROM  testTable").show(false)

Both of the above query will print the same results, as shown below:

Apache Hudi table reading from spark shell — Reading Apache Hudi table in spark shell with data frame api or Spark SQL query using the concept of temp table creation in the spark session

Summary: You can expect that a non-partitioned table can become a bottleneck soon if your dataset size grows rapidly. In the era of AI/ML, there is a high chance that data growth will be very significant. So, let's explore my next proposal for a partitioned table storage in Apache Hudi to achieve petabyte scale without any concerns about storage size.

Managing Apache Hudi table Partition configuration to handle a large scale size of data

In this section I will describe the simplest approach of partitioning an Apache Hudi table within a data lake, with storage options on your local machine for this post. Storage location could be AWS S3, GCS etc i.e any cloud storage.

Basically, the partitioning concept in Apache Hudi is simple, and it works by organizing data storage within the base path using different prefixes (For ease of understanding, we can say that it is literally a separate directory storage for a separate type of data) based on the instructions provided when writing or loading the data into Apache Hudi table format. Let's define the base path and table name for the partition table

val basePath = "file:///Users/avocadodata/data/apache-hudi/getting-started/partition"
val tableName = "partition_example"

Now, store the same data frame we defined previously

df.write.format("hudi")
  .option("hoodie.table.name", tableName)
  .option("hoodie.datasource.write.partitionpath.field", "country")
  .option("hoodie.datasource.write.recordkey.field", "id")
  .mode(Overwrite)
  .save(basePath)

Focus on the line no. `3` in the above codebase i.e `hoodie.datasource.write.partitionpath.field` parameter, and the value of the parameter could be any statics column of your data frame or you can derived a another columns from existing one with the concept of `.withColumn` transformation. See the storage of partition table on physical disk:

A Example of Apache Hudi partition table — Physical storage files structure for a partition Apache Hudi table

Now, Let's Query the table in Spark Shell and see the column information in the Apache Hudi table:

A info of partition Apache Hudi table — Partition Column info in the metadata column: `_hoodie_partition_path`

Summary: Partitioning is based on a single column from the source table, which should be a countable finite set. For instance, using a country column and storing the data for each country in a separate partition can significantly enhance data retrieval efficiency. Another example is using a date or datetime column, where partitions can be organized daily, weekly, monthly, or yearly—largely depending on the volume of data you intend to store in your data lake table and how you plan to retrieve it. For more details on how these strategies help improve read and write performance, please check my next blog post on Apache Hudi table partitioning for handling petabyte-scale data."

Contact Avocado Data lake for expert data lake implementation using Apache Hudi table format

Email us at:

For complete codebase and implementation details, please visit our github page AvocadoData

As we discussed in the above example, while the CDC data come from Kafka or some RDBMS databases then simple insert into the table will not worked and you want to handle the CDC while writing into Apache Hudi format either with CoW or MoR table type, Avocado Data lake engineers can isolated all those complexity into Load class and you will eventually needs to pass the data frame as show below with the concepts of `.andThen` pipelines.

1val pipeline = sparkStage
2  .andThen(extract)
3  .andThen(transform)
4  .andThen(load)
5
6pipeline(())

Visit our product pages for more information and contact us page for a free consultation of half an hour on data lake implementation using any open table format.