Bruce Kuo

Love programming and think about interesting problems

Graceful Data Ingestion with SQLAlchemy and Pandas

Published Nov 19, 2018Last updated Apr 02, 2019

When the data size is not large enough to use distributed computing frameworks (like Apache Spark), processing data in a machine with pandas is an efficient way. But how to insert data with dataframe object in an elegant way is a big challenge. As we know, python has a good database tookit SQLAlchemy with good ORM integration and a good data processing library Pandas. Here we explore some different implementations and discuss the pros and cons in this article.

Problem

With a pandas dataframe with thousands data and complex data type. How to load the data into target database fast and the code should be easy to maintain.

Note. Here we use PostgreSQL as the target database.

Simple Idea - Use Pandas `df.to_sql` function

With this function, you can insert your data with pandas API df.to_sql, then you done the work!

Advantages

Easiest way to implement.

Drawbacks

Very slow!
If you need to truncate the table first, it is not a smart way to use the function.

Another Simple Idea - Inserting Data With CSV Files

Another naive idea to solve this problem is to output dataframe as a CSV file and use copy command or the same implementation in python to import data into database.
Following is the code example:

def bulkload_csv_data_to_database(engine, tablename, columns, data, sep=","):
    logging.info("Start ingesting data into postgres ...")
    logging.info("Table name: {table}".format(table=tablename))
    logging.info("CSV schema: {schema}".format(schema=columns))
    conn = engine.connect().connection
    cursor = conn.cursor()
    cursor.copy_from(data, tablename, columns=columns, sep=sep, null='null')
    conn.commit()
    conn.close()
    logging.info("Finish ingesting")

df.to_csv(csv_path, index=False, header=False)
buldload_csv_data_to_database(engine, tablename, columns, data)

Advantages

The loading speed is fast!

Drawbacks

Need to maintain and process data to CSV format that the target database library recognizes it. It goes crazy when your schema is pretty complicated (think about the data with json fields and arrays ...). You'll need to consider the format between df.to_csv and cursor.copy_from very carefully.

Third Idea - Insert Data by SQLAlchemy ORM

To get rid of the huge effort to maintain the CSV format, another solution is to use the same method in web: creating a table object with pandas row and add the object to a session one by one. Following is a simple example:

Session = sessionmaker(bind=conn)
session = Session()
for _, row in df.iterrows():
  user = User(name=row["name"])
  session.add(user)
session.commit()
session.close()

Advantages

Easy to maintain
Enjoy ORM's advantages

Drawbacks

Slow, because it need to execute clause one by one.

Fourth Idea - Insert Data with Pandas and SQLAlchemy ORM

With exploration on SQLAlchemy document, we found there are bulk operations in SQLAlchemy ORM component. In this document, we found bulk_insert_mappings can use list of dictionary with mappings. With this, we can easily develop bulk insert and maintainable code with pandas dataframe.

Here is the code example:

Session = sessionmaker(bind=dest_db_con)
session = Session()
session.bulk_insert_mappings(MentorInformation, df.to_dict(orient="records"))
session.close()

Advantages

Fast
Simple
Easy to maintain
Enjoy ORM's advantages

Drawbacks

No concern from my point.

Conclusion & Discussion

Because pandas can only process data in a machine, how to solve the same problem in distributed environments is worthwhile to think also. There are many frameworks like Apache Spark to solve the extended problem.

But in single-machine size data, using pandas + SQLAlchemy is a powerful way to solve the data ingestion problem enough!

Sqlalchemy Data processing Python Pandas Etl

Report

Enjoy this post? Give Bruce Kuo a like if it's helpful.

Bruce Kuo

Love programming and think about interesting problems

Discover and read more posts from Bruce Kuo

get started

21Replies

Oualid Elouazi

4 years ago

i tried this solution but they gave me the below error :
AttributeError: ‘pyodbc.Cursor’ object has no attribute ‘copy_from’

Oualid Elouazi

4 years ago

hi bruce thank you a lot for this article , it help a lot

satya V

5 years ago

Hi, Can anyone help me to publish sample code for loading csv files dynamically using session.bulk_insert_mappings, pandas and sqlalchemy in python?
Thanks in advance

Show more replies

Graceful Data Ingestion with SQLAlchemy and Pandas

Problem

Simple Idea - Use Pandas df.to_sql function

Advantages

Drawbacks

Another Simple Idea - Inserting Data With CSV Files

Advantages

Drawbacks

Third Idea - Insert Data by SQLAlchemy ORM

Advantages

Drawbacks

Fourth Idea - Insert Data with Pandas and SQLAlchemy ORM

Advantages

Drawbacks

Conclusion & Discussion

Simple Idea - Use Pandas `df.to_sql` function