The classic ETL process involves extracting data from the source, storing it in raw format in bucket storage, and then sequentially processing the files before loading them into an analytical database. The transformation of the raw data and loading it into ClickHouse previously took 45 to 60 minutes for 60 million records. In this post, we will explore how we utilised DuckDB to expedite the process, which now takes only 5 to 7 minutes for 60 million records.
The time taken 5 to 7 minutes is only for "TL" process(Transform using Duckdb and Load to Clickhouse) Extract depends on how fast the API responds from the data source which is not included here.

Above image shows how this works. We will take SAP as data source, it will have multiple API end points like Accounts, Customers, Vendors, Companies, Profit Centres(Cost Centres), Journal Entries, Settlements etc., Here master data which are Accounts, Customers, Vendors will be stored in Postgres because we may need it often for tagging, enriching, and adding custom permissions like which user can see which line items. Journal entries count will be huge So we are saving it in bucket storage(S3).
Extract
Download data from the source and then save it to S3, partitioned by date or you can also use Hive and store it in Parquet format.
tenant_name/source_documents
├── journal_entries
│ ├── 2025-03-21-rand_string.parquet
│ ├── 2025-03-22-rand_string.parquet
│ ├── 2025-03-23-rand_string.parquet
│ ├── 2025-03-24-rand_string.parquet
│ └── ...
├── settlements
│ ├── 2025-03-21-rand_string.parquet
│ ├── 2025-03-22-rand_string.parquet
│ ├── 2025-03-23-rand_string.parquet
│ ├── 2025-03-24-rand_string.parquet
│ └── ...
Transform
The transformation step is the stage where the data is enriched into a format that’s useful for analytics.
The raw data consists of journal entries from an organization; however, it doesn’t contain all the necessary details. For example, it may include only a party_id without additional information such as the party name, party type, or relevant tags this supplementary information is retrieved from Postgres.
DuckDB allows us to efficiently load data from both S3 and Postgres simultaneously. Below is an example illustrating how the code would look:
For small organizations, we analyze yearly data, for medium-sized organizations, we review data semi-annually, and for large organizations, we examine monthly data to enhance performance and reduce server load, including on ClickHouse.
--- small
select * from read_parquet('s3://source_data/2025-*.parquet');
--- medium
select *
from
read_parquet(['s3://source_data/2025-2-*.parquet', 's3://source_data/2025-3-*.parquet', 's3://source_data/2025-4-*.parquet']);
--- large
select * from read_parquet('s3://source_data/2025-3-*.parquet');
DuckDB can directly save the result back to S3 using the below code
Load
Once Duckdb writes parsed data to S3 bucket, Clickhouse can read data from S3 and insert it to the desired table.
INSERT INTO
tenant_name.tenant_name_20250128192345_ledger_entries
SELECT * FROM
s3(s3_creds, url = 'https://s3.eu-west-3.amazonaws.com/tenant/parsed_docs/Y1-rand_string.parquet');
We can also do all DuckDB process(Transform) directly in Clickhouse, but it is 3X slower than the DuckDB and we also didn't want to load our Clickhouse and our Clickhouse is managed by us, it doesn't autoscale Ram and CPU. But background workers are auto scalable based on the tenant configuration, So we used DuckDB. chDB is also as good as DuckDB but some features are in development phase, we may have to try it out in future and other choice to try out is Limbo.
Our Clickhouse is 4core(8vCPU) & 72GB RAM and DuckDB processing server is 2Core(4vCPU) & 32GB Ram N2D machine.
All of this is done in Ruby on Rails thanks to the Rails team, along with the authors of the DuckDB gem and ClickHouse gem, which have collectively elevated our application to the next level.
Limitations
In DuckDB we were not able to use window functions if data is more than 20million records. It requires more CPU, RAM and storage.