Semi-Structured Data Sources (NoSQL databases) to Data Lake
List of semi structured data sources / databases we support for ingestion into your data lake:
- MongoDB
- AWS DynamoDB
- Google BigTable
- Kafka messages
- Any other NoSQL databases
- Semi-structured data sources like XML, Avro, Parquet
- Semi-structured files like JSON or CSV
Our semi-structured data pipeline ETL job will allow you to process and transform data from sources like MongoDB, and AWS DynamoDB into your centralized data lake. This enables you to gain deeper insights from your data and improve your decision-making using AI/ML, Data Insights or Data Analytics.
MongoDB Ingestion: We use the MongoDB Connector for Apache Kafka to read data from MongoDB (CDC) and push it to a Kafka topic. From there, we can consume the data via streaming or batch jobs based on business requirements.
For batch processing, we utilize GCP Cloud Dataflow, AWS Glue, or Azure Data Factory, creating configurable jobs for each MongoDB collection to write into the Data Lake in open table formats like Apache Iceberg, Apache Hudi, or Delta Lake. For real-time needs, Spark Streaming jobs can consume directly from Kafka.
For batch processing, we utilize GCP Cloud Dataflow, AWS Glue, or Azure Data Factory, creating configurable jobs for each MongoDB collection to write into the Data Lake in open table formats like Apache Iceberg, Apache Hudi, or Delta Lake. For real-time needs, Spark Streaming jobs can consume directly from Kafka.
Example of incremental batch job for MongoDB to Data Lake using Kafka:
1MongoDB -> MongoDB Kafka Connector -> Kafka -> Spark Structured Streaming/batch -> Data LakeDynamoDB Ingestion: A similar concept applies to DynamoDB, which has a Kinesis connector to capture changes (CDC) and push them to a Kinesis stream. We can then consume this data via batch or streaming jobs.
For batch jobs, we use cloud-native ETL tools (Dataflow, Glue, ADF) to read from Kinesis and write to the Data Lake in your preferred open table format. Spark Streaming can also be used for real-time ingestion.Let's see our high level design flow for the semi-structured data sources to data lake.
For batch jobs, we use cloud-native ETL tools (Dataflow, Glue, ADF) to read from Kinesis and write to the Data Lake in your preferred open table format. Spark Streaming can also be used for real-time ingestion.
Example of incremental batch job for DynamoDB to Data Lake using Kinesis:
Specially for the DDB, we can replace the Kinesis with Kafka as well, We need to write the AWS Lambda function to read the data from DynamoDB Stream and push it to Kafka topic. Example of incremental batch job for DynamoDB to Data Lake using Kafka:1DynamoDB -> DynamoDB Stream -> Kinesis -> Spark Structured Streaming / batch -> Data Lake1DynamoDB -> DynamoDB Stream -> AWS Lambda -> Kafka -> Spark Structured Streaming / batch -> Data LakeSemi-Structured Data Sources (NoSQL databases) to Data Lake High level design flow:

More details about the offering, including pipeline orchestration and setup timeline, will be available for half/one hour presentation, if interested then update us in below email id: [email protected]
More on the data flow like BigTable NoSQL to Data Lake, See the high level design flow below:
1BigTable -> BigTable Stream -> PubSub -> Spark Structured Streaming / batch -> Data LakeOnce the data ingested into the data lake with open table format like Apache Iceberg/Delta Lake/Hudi, then the data can be consumed by various tools like AWS Athena, AWS QuickSight, AWS SageMaker, AWS EMR, AWS Redshift, Databricks SQL, Databricks ML, Databricks BI, etc, and the data can be used for analytics, machine learning, business intelligence, etc. We will help your organization to get the right access to the data lake and provide the right tools to consume the data as per the business requirement.
1AWS S3 -> AWS Glue Catalog -> AWS Lake Formation -> AWS IAM Role -> AWS Athena
2 / AWS QuickSight / AWS SageMaker / AWS EMR / AWS RedshiftList of the products we offer as an ingestion other than semi-structured data sources to data lake:
- Relational data sources to data lake RDBMS to Data Lake
- Unstructured data sources to data lake Unstructured Data Sources to Data Lake
- Data lake to data warehouse Data Lake to Data Warehouse


