Harshit Tyagi

Developer and Data Scientist at Elucidata | Data Science Evangelist

Optimized I/O operations in Python

Published Dec 04, 2018

Input/Output hacks to speed up analytics using Python’s data science stack

While working on data-intensive applications, I am often confronted with Input/Output(I/O) challenges which represent the bottleneck for every performance-critical application. With the increasing volume of stored data, there is a necessity to store the data in disks to compensate for lack of sufficient RAM by loading the data from disk to RAM and vice-versa. I/O operations are thus essentially very important tasks when it comes to processing financial data or any scientific data.

Through this post, I am trying to shed some light on a few libraries and their tricks of the trade. Python has built-in capabilities which can be used to store the object on disk and to read it from disk into RAM. Also, Python is robust when it comes to processing text files and SQL databases. The Pandas library provides a multitude of classes and methods to read and write files in a good range of formats.

We are going to study the following areas of data storage and retrieval methods here:

Serialized storage using Pickle module
I/O operations on textual data
SQL databases
I/O with PyTables

The 2 major factors which are taken into consideration while optimizing I/O operations in the Python language are efficiency(performance) and flexibility. Let’s dive right into it:

Serialized storage using Pickle module

There are numerous modules in the Python language which can be easily used in a large-scale deployment setting.

Using the pickle module to read and write files

You need to store the data on your disk in order to share, document or use it later. We have pickle module which serializes the python object to make the read and write operation swiftly.

# On running the above code snippet, you'll see:

CPU times: user 40.9 ms, sys: 14 ms, total: 54.9 ms
Wall time: 54.5 ms

The random floats build up a 9MB file which is serialized to a byte stream and written to the disk in 54.9 ms. You will use the dump and load function of the pickle module to write and read the file respectively. To assert the serialized and deserialized data, you can use the Numpy’s allclose method. Here is how you can do it:

np.allclose(np.array(a1), np.array(a2))

# here a2 is the deserialized object after reading the same file
# using the load function.

Thus, pickle module stores the python list, dict, etc. after converting them to character stream on the disk. The crux here is that this byte stream contains the information necessary to reconstruct the object in another python script.

I/O operations on textual data

Python has been the go-to language especially while dealing with text files because of the robustness and ease of processing textual data. There are several options to operate string objects and with text files in general.

To write a CSV(Comma Separated Values), we can make use of the write and readline methods:

csv_file.write(header)

# time is time array and data is the dummy numpy array
for time, (a, b, c, d, e) in zip(time, data): 
  s = '%s,%f,%f,%f,%f,%f\n' % (time, a, b, c, d, e) csv_file.write(s)

csv_file.close()

# to read the file, we can use readlines function
content = csv_file.readlines()

Though python provides methods to process text files, we have the pandas library which can read and write a variety of data formats and is way better and easier to get your hands on.

Be it CSV(comma-separated value), SQL(Structured Query Language), XLS/XLSX(Microsoft Excel files), JSON(Javascript Object Notation), or HTML(Hypertext Markup Language).

Pandas make the entire process of reading and writing CSV files a bit more convenient, concise and faster.

%time data.to_csv(filename + '.csv')

# CPU times: user 5.59 s, sys: 137 ms, total: 5.69 s

# And to read the files back from the disk

pd.read_csv(<path to the CSV file>)

SQL databases

Python comes with support for SQL database which is SQLite3. We can work with almost any kind of database(SQL or NoSQL) with python.

SQL queries are written as string objects where the syntax and data types depend upon the database in use. Illustrating a Todo table creation through python in an SQLite database:

import sqlite3 as sq

# query string to create the table

query = 'CREATE TABLE TODO_NUMBER (Num1 real, Num2 real, Num3 real)'
con = sq.connect(path + 'todo.db')
con.execute(query)
con.commit()

Let’s try to insert some data in the database created,

data = np.random.standard_normal((1000000, 3))
%%time
con.executemany('INSERT INTO TODO_NUMBER VALUES (?, ?, ?, ?, ?)', data)
con.commit()

# Time taken: CPU times: user 10.3 s, sys: 316 ms, total: 10.6 s
Wall time: 11 s

Writing 1million rows into the database is a little heavy and time-consuming task. Reading the database is much faster:

con.execute('SELECT * FROM TODO_NUMBER').fetchall()

If you are dealing with a lot of numbers and arrays in your database, you can make use of Numpy arrays to read the data directly into a numpy ndarray.

np_query = 'SELECT * FROM TODO_NUMBER WHERE Num1 > 0 AND Num2 < 0'
res = np.array(con.execute(np_query).fetchall()).round(3)

This is a very good hack to read and plot the results of the query without any hustle. To make the reading more efficient and optimized, we should read whole tables and query results with pandas. Analytics and processing become way faster when the entire table is loaded into memory. This is achieved using the sublibrary pandas.io.sql

import pandas.io.sql as pds
data_df = pds.read_sql('SELECT * FROM TODO_NUMBERS', con)

The table is now loaded in-memory which allows for much faster processing. The SQL query that takes a few seconds with SQLite3 finishes in milliseconds with pandas in-memory:

%time data_df[(data_df['Num1'] > 0) & (data_df['Num2'] < 0)].head()

# CPU times: user 50 ms, sys: 0 ns, total: 50 ms

# Wall time: 49.9 ms

We can master a lot more complex queries with pandas which will yield the result much faster than SQL but it can’t replace SQL. Given that pandas is able to replicate the SQL query, we can significantly speed up the analytics using the pandas in-memory processing.

Point to be noted here is that pandas was not made to replace SQL database neither can it do the same at this point of time. Pandas doesn’t have the support for relational data structures.

I/O with PyTables

PyTables is a Python binding for the HDF5 database/file standard. It is specifically designed and developed to enhance the performance of I/O operations and make the best possible use of the available hardware. It does a very good job of speeding up analytics and generating output faster. A PyTables database can accommodate many tables, and it supports compression and indexing and also nontrivial queries on tables.

PyTables has a file-based database format. Let’s walk through the working of tables,

This will create the table for us with the required fields with specified dtypes. Now let’s populate the database, we’ll have to create some random values and write them row by row in the table like this:

But again, we have one more optimized and Pythonic way to hit the same result, making use of NumPy structured arrays:

dty = np.dtype([('Num1', 'i4'), ('Num2', '<i4')])

sarray = np.zeros(len(ran_int), dtype=dty)

Now that we are set with complete data in our table, it all boils down to the creation of the table as follows:

%%time
h5.create_table('/', 'ints_from_array', sarray,

title='Integers', expectedrows=rows, filters=filters)

This approach is faster and we accomplish the same results with fewer lines of code. We can delete the duplicate table using this:

h5.remove_node(‘/’, ‘ints_from_array’)

Both pandas and PyTables have the capability to process complex, SQL-like queries and indexing, and selections. They are both designed and optimized speed when it comes to I/O operations.

A prime advantage of using PyTables is its modus operandi to compression. It uses compression not only to save space on disk but also to enhance the performance of I/O operations.

Conclusions

Many enterprise-grade application domains in finance or science in general, can succeed with just array-based data modeling. In most of these cases, noticeable performance enhancements can be made using a combination of NumPy and PyTables I/O capabilities. HDF5 based stores have proven out to be a distinctive addition to all the approaches.

Relational databases have their pros if you have complex data structures which exhibit a number of relations between single objects/tables. This might turn out to be justified in certain circumstances where there is a performance disadvantage over pure NumPy ndarray-based or pandas DataFrame-based approaches.

Python Data analytics Programming Numpy Pandas

Report

Enjoy this post? Give Harshit Tyagi a like if it's helpful.

Harshit Tyagi

Developer and Data Scientist at Elucidata | Data Science Evangelist

A top-selling author (**100K enrolled learners, rated 4.7 out of 5**) and educator in Data Science and AI with a portfolio spanning **LinkedIn Learning, UpGrad, O’Reilly Media**, and beyond. I bring practical experience from devel...

Discover and read more posts from Harshit Tyagi

get started