Here in this post we will see how to parse large excel file with more than million rows and insert the data in batches without crashing server.
Parsing Excel
Install below libraries, If you are using ruby docker install python3 python3-pip
pip3 install openpyxl==3.1.2
pip3 install pep517==0.13.1
pip3 install polars==0.20.22
pip3 install xlsx2csv==0.8.2
pip3 install sqlalchemy==2.0.29
pip3 install pandas==2.2.2
pip3 install pyarrow==16.0.0
Here is the code to python code to read excel file and save it as sqlite file. Why sqlite? I will explain later.
Here is the ruby code to run the above python code
temp_file = Tempfile.new([rand_str(60), '.db'])
data_file.active_storage_attached_file.open do |file|
# Here you have to pass headers which you want to read in excel file
# we are passing this because some-times in excel dates are stored as string
# to parse data to exact data-type we are using this
# you can also skip this and here and in python code it will work
headers = {}.to_json
`python3 #{Rails.root}/python/read_xlsx.py #{file_path} #{temp_file.path} '#{column_hash_json}'`
end
Inserting Data
Now we have parsed the excel file we can also saved it as JSON format which is easier to parse and insert, but dumping millions of data to postgres(Our production DB) by reading large JSON file is not memory efficient. So the hack is dump it to sqlite file and then using background jobs read it using limit & offset and insert data in batches is a better choice.
(0..total_row_count_in_sqlite).each_slice(InsertExcelDataJob::LIMIT) do |item|
InsertExcelDataJob.perform_async(temp_file.path item.min)
end
require 'sqlite3'
require 'activerecord-import'
class InsertExcelDataJob
LIMIT = 30_000
include Sidekiq::Job
def perform(file, offset = 0)
db = SQLite3::Database.new file
db.results_as_hash = true
rows = db.execute("Select * from test order by 1 limit #{LIMIT} offset #{offset}")
PostgresTable.import(rows, validate: false)
end
end
Above code is just an overview of how it can be done. We use sidekiq-batches after all the jobs are finished we do some clean up works by deleting the temp sqlite db file and other things.
Most of the above code is implemented by my teammate Mohan Prasad.
Thanks to him and Thank you for reading!