Ryan Cleary

Senior software engineer with 5 years professional experience (including Amazon!)

Airflow Case Study: ProofPort

Published Jan 15, 2020

AIRFLOW CASE STUDY: PROOFPORT

This is the first in a series of case studies on how I designed and developed a production-grade data ingestion pipeline (ETL pipeline) using Apache Airflow.

Airflow is an open source tool under the Apache Software Foundation developed by Airbnb. Simply put, Airflow is a workflow orchestration platform. Even so, it is most commonly used for data processing (ETL). It has been very successful and has become the industry standard for batch data processing.

BACKGROUND

The Company
I worked for a startup called ProofPort for the last 6 months. Before it shut down, ProofPort was the fastest and simplest way to automatically exchange and maintain up-to-date Proof of Compliance data with providers, vendors, and B2B customers.

I started working at ProofPort on July 1st, 2019. They were trying to launch within 2 or 3 months. And, after several iterations of data pipelines that were not cutting it, they were (understandably) eager to get the data side of the app working.

The Requirements

•	Ingest a total of 8,000+ records from 8 different data sources weekly

•	Require minimal code changes when adding new data sources in the future

•	Compare the two most recent data runs, and capture the differences between them: adds, deletes, and changes

•	Detect company name changes (e.g. when a company name changes from "Federal Express" to "FedEx")

•	Ingest data from web pages, Excel files, and PDFs

•	Perform Optical Character Recognition (OCR) on the PDFs to harvest meaningful data

•	Facilitate manual human intervention for data audits to ensure premium data quality

•	Capture the data at each stage in the ETL process for auditing purposes

•	Capture and log high-level analytics on the data such as the number of adds, changes, and deletes

•	Load the data into ProofPort's backend system using GraphQL APIs 

•	Uphold the ProofPort measure of quality: "The data in the our app must match the data as it appears in the public sources at the time of data capture.”

TLDR; a production-grade end-to-end ETL pipeline in 3 months. From scratch.

UNDERSTAND

As tempting as it was to jump right into the work and start coding into all hours of the night, I remained calm and spent the first few days asking tons of questions and trying to grasp the problem. I went to the whiteboard, sketched out the system architecture and domain entities as best as I understood them, and asked for feedback from my new colleagues. In particular, I asked a lot of “Why?" questions:

Why did was the system designed this way? Why was this technology chosen? Why not that alternative? Why had the data pipeline failed in the past?

Those first few days were critical. Ultimately, the goal of software is to deliver incredible business value. So it is imperative that you understand the business problem. Then, with discipline, you can use that understanding to evaluate the potential tools and frameworks.

Ultimately, the goal of software is to deliver incredible business value.

To give an example, with ProofPort we evaluated Stitch as a potential ETL platform. While this comparison could easily merit its own blog post, Stitch is a UI-based ETL system focused on connecting various data sources to one another. It does not allow for writing and injecting custom transformations into the pipeline. In comparison, Airflow is a workflow orchestration platform, so it has a more generic scope by design. You can extend Airflow to orchestrate pretty much any type of workflow imaginable. Once I understood ProofPort's need to inject some fairly complex analytics (detect company name changes, etc.) and even facilitate a few points of manual intervention, it was clear that Stitch would not work for our use case.

After considering one or two alternatives, we decided to use Airflow. Its flexibility and out-of-the-box integration into the de facto data science ecosystem (Python!) is hard to beat.

A few more words on understanding — a lot of people will just seek to understand current needs, under the banner of Agile Software Development (another blog post I'll have to write). Future needs are at best, unpredictable, and at worst, unknown. Yet this does not mean they are unimportant. When making architectural decisions, you need to ask "What is likely going to change in the future?”. This simple question will ground your decisions. It will prevent you from traveling far down a road only to later realize you have new requirements that completely ruin your architecture.

At ProofPort, the decision to go with Airflow became even more clear as I asked questions about future requirements. There were simply too many parts that would need custom business logic. This was a clear sign that we needed to ditch the out-of-the-box solution (Stitch) in favor of a more custom solution (Airflow).

ITERATE

By the end of the first week or so, I had a very rough sketch drawn up, which looked like this:

In short, the plan was to use Import.io for web scraping the HTML pages, simple Python packages for handling the Excel files, and then AWS Textract for OCR (we later ended up using Camelot instead). Then, we would pull all the data sources into a common data storage format (a Pandas DataFrame). Finally, we would write code to compare two DataFrames and generate the differences.

Within a few weeks, I had a small prototype up and running grabbing data from Import.io and creating DataFrames, and comparing the DataFrames.

Pro Tip: One of the keys to getting Airflow running locally and deployed quickly is to use Astronomer. Astronomer offers an "Airflow as a Service" solution, and it is fantastic. They offer a free CLI tool that will get you up and running with Airflow on your local machine using Docker in no time. Additionally, they offer a hosted Airflow solution; it makes deployment a cinch.

Together with the terrific team at ProofPort, we iterated on the data pipeline many times. I would show a bit of functionality, the data auditor would audit the data and give feedback on any issues he found. Those issues would either turn into bug fixes or new features to be added. Rinse and repeat.

This points to the importance of iterative development, not just in software in general, but particularly in data-related work. You simply do not know what you do not know about the data. Often the data sets are so large and unruly that idiosyncrasies and data corruptions do not present themselves until you begin to work with the data more. That is why it is critical that you develop your pipeline iteratively.

Testing

The benefits of (read: "need for") proper testing in software development has been thoroughly established elsewhere. Here, I will simply examine its relationship to an Airflow data pipeline.

Because of the incredible diversity of data that can come through a data pipeline, if you do not have tests to validate your core business logic, it’s not a matter of "if" but "when" your data will become corrupted.

Someone will be regularly check the outputs of the pipeline, but as more functionality is added, he or she certainly will not be checking every possible combination of data features with each new release. Instead, it is likely will implement some sort of Six Sigma quality assurance strategy, or a related concept. This is a powerful way to measure quality, but it assumes that the underlying system remains constant! This entire assumption is invalidated if you do not have automated testing to verify your core business logic.

While developing the ETL pipeline at ProofPort, I wrote extensive unit tests around the core code library (which was responsible for Extracting, Transforming, and Loading the data). These tests were further verified by comparing the results of our diff tool to Import.io's diff tool. This illustrates another important and familiar idiom: "Trust but verify."

Automation

“We reduced the time spent in manual audits by 90%”

To my surprise, we had a working, production ETL pipeline at the end of 3 months, and so we launched the app. I then spent the next few months focused on adding new features as well as automating as much of the process as possible. This was a huge learning takeaway for us. Any time spent automating the manual human intervention process of the data load paid off tremendously. In the diagram above you can see "statistical audits”. These were not required for launch, but the idea was to automate the majority of the human auditor's work away.

We developed several automations, but one of the best examples was automating company name changes. A company name change would result in an added record (e.g. "FedEx") and a removed record (e.g. "Federal Express") in a diff output. If the diff dataset did not contain both added and removed records, then we could go ahead and load it to production. Another simple automation was checking if anything had changed at all between the two data runs. If not, the output of the diff tool would be an empty dataset, and we would just update some metadata for that particular data run.

With several such optimizations, our auditor estimates that we reduced the time spent in manual data audits by 90%.

Slack proved to be a helpful tool for spinning up a simple UI for the ETL pipeline. Before closing shop, we were experimenting with a custom Slack Bot that would enable the data auditor to confirm or deny company name changes with the click of a button, all within the Slack channel window.

DELIVER

One of the key requirements was being able to link data runs across multiple Airflow DAGs (directed acyclic graphs). At its base, this is a fairly common challenge in ETL development. You basically need some way to persist metadata for record-keeping but also for re-runs of the data as needed. We solved this by generating a unique ID (a GUID) for each data run (of which there are 3 steps, or phases). We created a directory inside an Amazon S3 bucket with this ID, and stored all the files representing each stage of transformation in the pipeline for this data run to that directory. This unique ID was also included in the Slack message notifications. So essentially we created a system with separation of concerns but still linked together to present a complete story of the data.

“your plan for ETL using AirFlow transformed an unreliable headache of an ETL process to a strategic asset for our business that customers can rely upon”
— Mark Hamm, CEO ProofPort

The human user (the data auditor) would update Airflow variables to point to particular import.io runs (if desired), and then trigger the first phase DAG for each data source. The DAG would run and update files in S3 and metadata in DynamoDB. It also updated some metadata in the Airflow Variables. When the DAG finished, it would kick out a notification to Slack, which included a link to the dataset file to audit on S3 and a link to the second phase DAG to run. The auditor would download the file from S3, make any necessary edits, upload it to S3, and then kick off the next DAG. This process would be repeated one more time, and it was finished.

Before the startup closed down, I was working on automating this process even further so that the auditor would simply be presented with yes/no decisions in Slack messages with interactive buttons.

This approach gave us the flexibility to automate what we could automate while still allowing for the data auditor to inject changes, if necessary, at various points in the pipeline.

In 3 months we had developed a production ETL pipeline that ingested thousands of records weekly, compared them against the most recent data run, and loaded the differences in a cleaned format to our web app via a GraphQL API.

The CEO of ProofPort said that this ETL pipeline “transformed an unreliable headache of an ETL process to a strategic asset for our business that customers can rely upon”. He added that “wiring AirFlow into slack data approvals was transformational and eliminated some major manual audit costs”.

Apache HTTP Server Python Etl Data

Report

Enjoy this post? Give Ryan Cleary a like if it's helpful.

Ryan Cleary

Senior software engineer with 5 years professional experience (including Amazon!)

- 7 years of programming experience - 3.5+ years of Python experience, primarily focused on ETL pipelines and web development using Django + React - I've worked at: 2 Fortune 100 companies (Amazon & ConocoPhillips), a mid-sized co...

Discover and read more posts from Ryan Cleary

get started

2Replies

Blade Nelson

6 years ago

Awesome article. Enjoyed reading what you had to share. Thank you very much.

Ryan Cleary

6 years ago

Glad you enjoyed it! Thanks for commenting!