Codementor Events

Best Practices for Data Sync in AWS S3 and Other Platforms

Published Jan 04, 2018Last updated Jan 09, 2018
Best Practices for Data Sync in AWS S3 and Other Platforms

Over the last few years Amazon S3 became the most popular object storage solution in the Cloud. This is not only due to high durability and availability provides by S3, but also the simplicity of S3 API, SDK and CLI which simplified to do S3 data sync from different data sources. In addition, AWS also came up with multiple sets of tools such as, AWS Data Pipeline, AWS Glue & etc for common use cases. You can also find third party tools using AWS services, which handle use cases such as large scale s3 data sync using AWS EMR. This article focuses on discussing different use cases and best practices in doing S3 data sync process. Lets identify several use cases where S3 is being used as a data source to synchronize data from internal AWS services and external third party platforms.

Using S3 for Cloud File Storage Sync

One of the most common use case in using Amazon S3 as a Cloud file storage system (more details). This allows not only to store files remotely but also allows to replicate, version these files for change tracking. However for efficient S3 data sync from external sources, it requires to follow several best practices.

Practice 1: Synchronize files directly from Client to S3

Screen Shot 2018-01-01 at 10.09.16 PM.png
One of the most important practices you can follow is to allow s3 data sync operations directly from the client whenever possible. For example, file uploads of a web application can be done using JavaScript runs in the web browser.

You might wonder whether this is possible for private buckets? Yes, its possible using a web server, API or a serverless endpoint (providing an endpoint not in the middle of the s3 data sync flow) to generate either signed urls or temporary access credentials to proceed direct s3 data sync operations. This allows to utilize the maximum available bandwidth of S3 without limiting to the server performance to carry out sequential to parallel uploads.

Data Backup Sync

When backing up files to S3, it is important that we split the files into correct sizes. When large files are uploaded, even for a tiniest change, it will require to upload the entire file with the change to S3. This makes it challenging to recover from partial failures since it requires to restart the upload in most of the cases.

Practice 2: Split large files to smaller ones

When archiving files, backups or snapshots, it is important to synchronize them reliably also allowing retry. Breaking large files to smaller ones allows to replace smaller units whenever necessary. In a failure situation, it will be more practical to retry small files compared to the large files saving bandwidth.
Screen Shot 2018-01-01 at 10.10.42 PM.png

Practice 3: Compress files

Since AWS S3 pricing and involves storage and data transfer (see the official pricing page), it will be beneficial to compress files before uploading. It is also important to select a compression algorithm that allows to separate and reassemble files.

Continuous Change Events Sync

If you are planning to do a s3 data sync for a large amounts of continuous charging small data sets, it will be efficient to use streaming solutions to build a reliable and efficient event storage system.
Screen Shot 2018-01-01 at 10.12.03 PM.png

Practice 4: For Small files use Streaming Solutions

When a large number of small file changes happen, it's efficient to use a streaming, storage ingestion service such as AWS Kinesis Firehouse to parallel and efficiently upload their data. You can also consider using the newly introduced Amazon S3 Managed Uploader in the AWS SDK for JavaScript (more details) which allows large buffers, blobs, or streams to be uploaded more easily and efficiently with intelligent selection of splitting options.

Continuously Changing Files Sync

It can get challenging to keep the continuously changing files, since it also necessarily keeps track of partial syncs with respect to changes. For example, when set of files are changed, it is important to not only sync the changes, but also keep track of the files so that it allows to fall back to Start writing here...

Discover and read more posts from Ashan Fernando
get started
post commentsBe the first to share your opinion
Show more replies