Leverage unstructured data within your centralized data lake for advanced analytics and machine learning using our highly configurable ETL pipelines. We specialize in ingesting data from unstructured sources by setting up automated pipelines in your preferred cloud environment. These pipelines continuously process raw data and write it into your data lake in optimized open table formats—such as Delta Lake, Apache Iceberg, or Apache Hudi — while enabling a data catalog for seamless discovery and governance. This empowers your organization to harness the full potential of your data for Data Science, Machine Learning, and AI.

For example, if your organization receives bank statements as CSV files via AWS S3, SFTP, or FTP servers, our custom pipeline handles the ingestion using secure credentials (AWS ARN, GCP credentials, or Azure AD). It first dumps the raw data into an internal S3 bucket. Then, it processes and transforms the data according to your organization's requirements, writing it into an optimized open table format like Apache Iceberg, Delta Lake, or Apache Hudi. Finally, it enables data cataloging for seamless discovery and governance. This empowers your organization to harness the full potential of your data for Data Science, Machine Learning, and AI.

Unstructured data can be ingested from a variety of sources, including SFTP/FTP servers, such as below:

SFTP, FTP servers, AWS S3, GCP Cloud Storage, Azure Blob Storage

The above sources can contain any type of files and we support below kind of unstructured/semi-structured files:

parquet
JSON Files
CSV Files
Avro Files

XML
Web APIs
Text Documents
ORC Files

We have developed a custom pipeline to ingest data from these diverse sources into the data lake. First, the pipeline reads files from the source and dumps them as-is into a raw storage layer (S3/GCS/Azure Blob). Then, using custom configurations, it processes and validates the data against the existing table schema. This process can be orchestrated as multiple tasks in Airflow, Azure Data Factory (ADF), or GCP Dataflow, all easily achieved with our solution.

Spark/PySpark -> Connect to files sources 
        -> dump to local S3/GCS/Azure Blob -> Spark/PySpark
        -> process as per configuration
        -> write to Delta Lake/Apache Iceberg/Apache Hudi
        -> Data Catalog -> Data Science/ML/AI

Once we have the data into our own s3 bucket/GCS/Azure Blob, we will process the data as per the organization requirements and add the transformation over the incoming data as per the existing table schema.

CSV/JSON/Text documents -> Spark/PySpark batch/streaming 
        -> S3/GCS/Azure Blob Storage (open table format)
        -> Data Catalog -> Data Science/ML/AI

The diagram below illustrates our high-level architecture:

Unstructured to centralized Data Lake Architecture — configurable or pattern based reading of unstructured files and writing into centralized datalake with required cleaning and transformation.

Our unstructured data lake solutions enable you to leverage data from sources like CSV files, JSON files, web APIs, and text documents for advanced analytics and machine learning. This allows you to unlock the full potential of your data and gain a competitive advantage once your data is available into centralized data lake.

The implementation described above is cloud-agnostic and can be deployed on AWS, Azure, or GCP. For instance, using an Airflow DAG, the workflow would involve:

Reading from sources like SFTP/FTP, S3, Azure Blob, GCS, Web APIs, or Text Documents
Writing into our own S3 bucket (or equivalent) as-is
Reading from the raw bucket, applying required transformations, and writing to the data lake in Parquet format with open table standards
Enabling the Data Catalog (e.g., AWS Glue) for discovery and governance

Each of the above steps can be implemented as Airflow Task or Separate Step function in combination with multiple Lambda functions or Azure Data Factory pipeline or GCP Dataflow pipeline to achieve the end to end pipeline for unstructured data ingestion and transformation.

List of the products we offer as an ingestion service or ETL pipelines other than unstructured data sources are as below

OLTP to Data lake: RDBMS to Data Lake
Semi-Structured Data Sources (NoSQL databases) to Data Lake: Semi-Structured Data Sources (NoSQL databases) to Data Lake
Data Lake to Data Warehouse: Data Lake to Data Warehouse
Data Lake to Data Marts: coming soon