Parse large excel files in Ruby on Rails

Parse large excel files in Ruby on Rails

Here in this post we will see how to parse large excel file with more than million rows and insert the data in batches without crashing server.

Though we are doing this in Rails most of the script are in python. We have tried most gems, nothing worked for us.

Parsing Excel

Install below libraries, If you are using ruby docker install python3 python3-pip

pip3 install openpyxl==3.1.2
pip3 install pep517==0.13.1
pip3 install polars==0.20.22
pip3 install xlsx2csv==0.8.2
pip3 install sqlalchemy==2.0.29
pip3 install pandas==2.2.2
pip3 install pyarrow==16.0.0

Here is the code to python code to read excel file and save it as sqlite file. Why sqlite? I will explain later.

import sys
import json
import polars as pd


def read_rows_sheet(filename, db_file,dtypes=None):
    if dtypes is None:
      df = pd.read_excel(source=filename,read_options={"has_header": True, "ignore_errors": True})
    else:
      dtypes = json.loads(dtypes)
      dtypes = {col: eval(dtype) for col, dtype in dtypes.items()}
      df = pd.read_excel(source=filename, read_options={"has_header": True, "ignore_errors": True,"dtypes": dtypes})

    non_empty_headers = [col for col in df.columns if col]
    df = df.select(non_empty_headers)
    dbname = 'sqlite:///' + db_file
    df.write_database(table_name="test",connection=dbname,if_table_exists="replace")
    del df
    return db_file

filename = sys.argv[1]
db_file = sys.argv[2]
dtypes = sys.argv[3]
print(read_rows_sheet(filename,db_file,dtypes))

{Rails.root}/python/read_xlsx.py

Here is the ruby code to run the above python code

temp_file = Tempfile.new([rand_str(60), '.db'])
data_file.active_storage_attached_file.open do |file|
  # Here you have to pass headers which you want to read in excel file
  # we are passing this because some-times in excel dates are stored as string
  # to parse data to exact data-type we are using this
  # you can also skip this and here and in python code it will work
  headers = {}.to_json
  `python3 #{Rails.root}/python/read_xlsx.py #{file_path} #{temp_file.path} '#{column_hash_json}'`
end

Inserting Data

Now we have parsed the excel file we can also saved it as JSON format which is easier to parse and insert, but dumping millions of data to postgres(Our production DB) by reading large JSON file is not memory efficient. So the hack is dump it to sqlite file and then using background jobs read it using limit & offset and insert data in batches is a better choice.

(0..total_row_count_in_sqlite).each_slice(InsertExcelDataJob::LIMIT) do |item|
  InsertExcelDataJob.perform_async(temp_file.path item.min)
end
require 'sqlite3'
require 'activerecord-import'

class InsertExcelDataJob
  LIMIT = 30_000
  include Sidekiq::Job

  def perform(file, offset = 0)
    db                 = SQLite3::Database.new file
    db.results_as_hash = true
    rows               = db.execute("Select * from test order by 1 limit #{LIMIT} offset #{offset}")
    PostgresTable.import(rows, validate: false)
  end
end

Above code is just an overview of how it can be done. We use sidekiq-batches after all the jobs are finished we do some clean up works by deleting the temp sqlite db file and other things.

Most of the above code is implemented by my teammate Mohan Prasad.

Thanks to him and Thank you for reading!