Codementor Events

Writing Millions of records in DynamoDB

Published Mar 29, 2020
Writing Millions of records in DynamoDB

The biggest is trying to write millions of rows efficiently into DynamoDB. I slowly worked my way up from 100 rows/second to around the 1500 rows/second range, and here’s how I got there.

The problem

DynamoDB does not offer integrated bulk load directly. The recommended approach is to create an AWS Data Pipeline to import from S3. We found deal-breaker issues with this approach, including that Data Pipeline and EMR are finicky to set up, lag behind Spark updates, and do not provide sufficient—or often any—logging for key failures.
The example Data Pipeline scenarios read files from S3 and write the corresponding DynamoDB items. Apache Spark distributes the Dataset in memory across EC2 instances in a cluster. The cluster is large when the data is large. This makes perfect sense when you’re playing to Spark’s strengths by operating on the data. However, when writing to DynamoDB we only need a few items at a time to batch writes efficiently. Reading a whole Dataset into large-scale distributed memory, for example, is wasteful.
Import tasks must also map the input schema to the exact format of a DynamoDB Item. We do not assume Rue Data Science jobs know anything about how data is stored and delivered. In fact, we deliberately align the output format of Data Science tasks to the input schema, usually from our Snowflake Data Warehouse. This removes surprises with naming of output columns.
AWS Glue is promising, but does not directly support DynamoDB as an endpoint as of this writing.

Auto-scaling - Better turn that OFF

Writing data at scale to DynamoDB must be done with care to be correct and cost effective. We started by setting the provisioned capacity high in the Airflow tasks or scheduled Databricks notebooks for each API data import (25,000+ writes per second) until the import was complete. Then setting it back down to 1 when the importers completed. We sometimes forgot, which cost money, sometimes a lot, for write capacity we weren’t consuming. And sometimes the initial capacity increase failed, which caused the imports to take too long or just fail.
We started by writing CloudWatch alarms on write throttling to modulate capacity. We had some success with this approach. Then Amazon announced DynamoDB autoscaling. It works for some important use cases where capacity demands increase gradually, but not for others like all-or-nothing bulk load.
There’s also an unpleasant side-effect you need to know about that was a deal-breaker for our use case. More on that shortly.
We had to pay particular attention to the following:
The console UI does not expose two important properties of the autoscaling we found valuable: ScaleInCooldown and ScaleOutCooldown, which default to 0. These settings limit the rate of successive events for scale in and out, requesting a pause of at least the given number of seconds between changes. In order to give these parameters non-zero values you must use the SDK, the AWS CLI, or CloudFormation
Even with the cool-down parameters set to limit the frequency of capacity changes, we had little or no control over the magnitude of changes. Consequently, capacity was reduced to minimum even when there was just a pause in imports. This can happen, for example, because all data API importers share bounded compute clusters in AWS Batch and are competing for resources. We’re trying to limit the maximum size of these clusters since unchecked they’re a significant expense.
You’re only allowed four capacity decreases per table per day. So a transient decrease counts as one of these. And before you know it, you’re out of available decreases so must remember to decrease manually at the next window, currently after an additional two hours.
But perhaps the worst surprise of all was that auto scaling only gets triggered when there’s traffic. So after we autoscale up with bulk imports we stay at the expensive high capacity when we abruptly stop importing and our traffic goes to zero. We have a nice solution to this issue, which we’ll publish shortly.

Work with Batch Write Item

First mistake I did was a data modelling error. I was supposed to take a single field (IC Number) and return the results of all phone numbers in the breach. So I initially modeled the phone numbers as an array within an item (what you’d called a row in regular DB speak).

Strictly speaking this is fine, DynamoDB has an update command that allows you to update/insert an existing item. Problem is that you can’t batch an update command, each update command can only update/insert one item at a time.

Running a script that updated one row in DynamoDB (at a time) was painfully slow. Around 100 items/second on my machine, even if I copied that script to an EC2 instance in the same datacenter as the DynamoDB, I got no more than 150 items/second.

At that rate, a 10 million row file would take nearly 18 hours to insert. That wasn’t very efficient.

So I destroyed the old paradigm, and re-built.

Instead of phone numbers being arrays within an item, phone numbers were the item itself. I kept IC Number as the partition key (which isn’t what Amazon recommend), which allowed me to query for an IC Number and get an array of items.

This allowed me to use DynamoDB’s batch_write_item functionality, which does up to 25 request at once (up to a maximum of 16MB). Since my items weren’t anywhere 16MB, I would theoretically get a 25 fold increase in speed.

In practice though, I got ‘just’ a 10 fold increase, allowing me to write 1000 items/second, instead of 100. This meant I could push through a 10 million row file in under 3 hours.

First rule of thumb when trying to write lots of rows into DynamoDB — make sure the data is modeled so that you can batch insert, anything else is painfully slow.

Latency matters

Latency_1.png

Even with Batch write item, latency matters.

With my laptop in Singapore, writing to the AWS Singapore region, I sensed that latency was causing issues. Just FYI, Singapore has world-class infrastructure, my home has a 1Gbps connection, and the country is just 60km across (end to end), which meant that I couldn’t have been more than 30km from the actual data center hosting my DynamoDB.

Even then, when I spun up an EC2 instance in the same region, and ran the script from there, I got a 50% increase in throughput, from 1000 items/second to roughly 1500 items/second, peaking at around 1800 items / second.

It wasn’t hard cloning my scripts (they were on gitHub), but I made sure I encrypted the file, before sFTP-ing to the EC2 instance. Once transferred, I could sit back and relax. If you buy spot-instances, they can cost as little as just 3-4 cents per hour — well worth the money.

Multi-threading

From here the obvious next option was multi-thread the whole thing. If one thread can do 1500, then 20 threads can do 30,000. I was keeping Troy Hunt’s 22,500 row/second record as my benchmark.

But hold on a gosh-darn second there!

Troy uses Azure Table Storage which is very different to DynamoDB.

Table Storage charges per operation, while DynamoDB charges per Capacity Unit, and those are billed per hour. Plus, there’s a soft-limit of 10,000 Units per table, which you’d need to write to Amazon to have increased.

In other words, Azure charges per operation with seemingly no limit on the throughput, while Amazon charges entirely on the throughput. You need to optimize your throughput, to balance out cost vs. performance, if you’re hoping to make DynamoDB work for you.

At 1,000 rows per second, it takes 3 hours to insert 10 million rows, and cost $2.22.

At 10,000 rows per second, it takes 17 minutes to insert 10 million rows, but cost $7.44.

You basically pay more to spend less time, or my way of thinking, why spend money, when time is cheaper. The keened eyed would point out that I’m using EC2 instances to push this, so each extra hour cost me something — but spot instances (even of high end servers) can run to just $0.03 per hour.

And honestly speaking, I haven’t really mastered this multi-threading stuff, so I decided to give it a skip, and just stick to best effort of 1,500 rows per second. 3 hours to process 10 million is fine by me, just leave it overnight, and look at the logs in the morning.

In summary

If you want to write millions of rows into DynamoDB at once, here’s my advice:

Model the data right, so you can batch write everything
Turn of auto-scaling, and manually manage the throughput
Run the insertion from an EC2 instance in the same region
Consider multi-threading, but also consider the cost associated with it
Sorry I don’t have the cloudwatch graphs, learnt today they only last 2 weeks, and it’s been more than 2 weeks since I pushed this data into the DynamoDB.

Lastly, consider this, a full-blown DynamoDB in SG, cost $7.44 per hour. That’s a noSQL database capable of doing 10,000 writes per second — with all the backup/restore, mirroring, clustering, taken care for you. Serverless is ridiculously cheap, and getting cheaper, the only thing holding most people back is feature set, but I expect that over the next 3-5 years DynamoDB will be 70% of features existing noSQL databases have today, and that’s what 99% of folks ever need.

Discover and read more posts from Mohd Belal
get started
post commentsBe the first to share your opinion
James Daily
10 months ago

For brand new tables, DynamoDB now supports importing data directly from S3: https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/

Existing tables can use AWS DMS or other approaches such as the one you describe.

Lars U
3 years ago

Nice article :)
Are you using BatchWrite with multi threading ? Can you point me to any code snippets ?

Show more replies