Codementor Events

Graceful Data Ingestion with SQLAlchemy and Pandas

Published Nov 19, 2018Last updated Apr 02, 2019

When the data size is not large enough to use distributed computing frameworks (like Apache Spark), processing data in a machine with pandas is an efficient way. But how to insert data with dataframe object in an elegant way is a big challenge. As we know, python has a good database tookit SQLAlchemy with good ORM integration and a good data processing library Pandas. Here we explore some different implementations and discuss the pros and cons in this article.

Problem

With a pandas dataframe with thousands data and complex data type. How to load the data into target database fast and the code should be easy to maintain.

Note. Here we use PostgreSQL as the target database.


Simple Idea - Use Pandas df.to_sql function

With this function, you can insert your data with pandas API df.to_sql, then you done the work!

Advantages

Easiest way to implement.

Drawbacks

  • Very slow!
  • If you need to truncate the table first, it is not a smart way to use the function.

Another Simple Idea - Inserting Data With CSV Files

Another naive idea to solve this problem is to output dataframe as a CSV file and use copy command or the same implementation in python to import data into database.
Following is the code example:

def bulkload_csv_data_to_database(engine, tablename, columns, data, sep=","):
    logging.info("Start ingesting data into postgres ...")
    logging.info("Table name: {table}".format(table=tablename))
    logging.info("CSV schema: {schema}".format(schema=columns))
    conn = engine.connect().connection
    cursor = conn.cursor()
    cursor.copy_from(data, tablename, columns=columns, sep=sep, null='null')
    conn.commit()
    conn.close()
    logging.info("Finish ingesting")

df.to_csv(csv_path, index=False, header=False)
buldload_csv_data_to_database(engine, tablename, columns, data)

Advantages

The loading speed is fast!

Drawbacks

Need to maintain and process data to CSV format that the target database library recognizes it. It goes crazy when your schema is pretty complicated (think about the data with json fields and arrays ...). You'll need to consider the format between df.to_csv and cursor.copy_from very carefully.


Third Idea - Insert Data by SQLAlchemy ORM

To get rid of the huge effort to maintain the CSV format, another solution is to use the same method in web: creating a table object with pandas row and add the object to a session one by one. Following is a simple example:

Session = sessionmaker(bind=conn)
session = Session()
for _, row in df.iterrows():
  user = User(name=row["name"])
  session.add(user)
session.commit()
session.close()

Advantages

  • Easy to maintain
  • Enjoy ORM's advantages

Drawbacks

  • Slow, because it need to execute clause one by one.

Fourth Idea - Insert Data with Pandas and SQLAlchemy ORM

With exploration on SQLAlchemy document, we found there are bulk operations in SQLAlchemy ORM component. In this document, we found bulk_insert_mappings can use list of dictionary with mappings. With this, we can easily develop bulk insert and maintainable code with pandas dataframe.

Here is the code example:

Session = sessionmaker(bind=dest_db_con)
session = Session()
session.bulk_insert_mappings(MentorInformation, df.to_dict(orient="records"))
session.close()

Advantages

  • Fast
  • Simple
  • Easy to maintain
  • Enjoy ORM's advantages

Drawbacks

No concern from my point.


Conclusion & Discussion

Because pandas can only process data in a machine, how to solve the same problem in distributed environments is worthwhile to think also. There are many frameworks like Apache Spark to solve the extended problem.

But in single-machine size data, using pandas + SQLAlchemy is a powerful way to solve the data ingestion problem enough!

Discover and read more posts from Bruce Kuo
get started
post commentsBe the first to share your opinion
Oualid Elouazi
2 years ago

i tried this solution but they gave me the below error :
AttributeError: ‘pyodbc.Cursor’ object has no attribute ‘copy_from’

Oualid Elouazi
2 years ago

hi bruce thank you a lot for this article , it help a lot

satya V
3 years ago

Hi, Can anyone help me to publish sample code for loading csv files dynamically using session.bulk_insert_mappings, pandas and sqlalchemy in python?
Thanks in advance

Show more replies