Amazon Athena vs. Amazon Redshift - Setup and Management Comparison
Amazon Athena and Amazon Redshift are cloud-based data services that both have quite different use cases.
Amazon Athena is a portable solution that allows you to quickly query data stored in the Amazon S3 web storage platform without setting up any infrastructure. The portability and low cost of Athena makes it helpful for running quick ad-hoc queries to troubleshoot issues with your applications.
Redshift is a data warehouse that pulls together data from disparate sources for high-level reporting and analysis purposes (see this overview of different types of data warehouses).
Comparing both the setup and management of these cloud-based data services can help you better understand when to choose which service in addition to the potential resources you'll need when using either service.
This article is based on the full Athena vs Redshift comparison written by Panoply (more details).
Partitioning allows for quicker querying because it divides the database or table into separate parts, so only relevant data is scanned during a query.
Athena lets you partition using any key, and the maximum number of partitions is 20,000 per table.
The Redshift data warehouse doesn't support table partitioning by default. Rather, you must define distribution keys when loading data to make the parallel processing work most efficiently. It's important to choose the right distribution style for each table to maximize performance.
Amazon has recently added the ability to perform table partitioning using Amazon Spectrum. It is possible to partition external tables on one or more columns.
Partitioning the external tables improves performance, because the Amazon Redshift query optimizer eliminates partitions that don’t contain data.
Primary Key Constraint
An SQL PRIMARY KEY Constraint is a unique identifier used for each value stored in a table.
Redshift doesn't require a Primary Key Constraint, meaning you can upload the same data any number of times. Such replicated data is sometimes dangerous because it can give inaccurate results. If you want to use a Primary Key Constraint, you can declare it before loading data into the data warehouse.
Athena doesn't enforce Primary Keys. Duplication is only possible if your underlying S3 data sets contain duplicate values.
Data Formats & Data Types
Athena supports CSV, JSON, columnar formats such as ORC, and Parquet. Additionally, Athena supports all compressed data formats with the exception of LZO data.
Redshift supports JSON, CSV, TSV, and Apache logs. You can also load compressed data files from Amazon S3, including GZIP, LZOP, or BZIP formats.
Athena supports a range of disparate data types, including arrays, maps, and structs. Redshift does not support arrays or Object Identifier Types.
Setting Up Athena
One of Athena's advantages is the speed at which you can set it up.
First, you need to open the Catalog Manager in Athena. Create a database, create a table, and specify the location of your data on Amazon S3.
Next, choose the format of the data. You can choose any one of several format options.
After you choose a format, set the column names and types. A typical column name and type could be something such as Age (string). You can add columns in bulk using the Athena Bulk Add interface.
Setting Up Redshift
Redshift requires setting up an infrastructure so it takes longer to get started with. Redshift uses clusters, which are composed of a series of servers named nodes that communicate with each other.
First, you need to configure the properties of your cluster.
Then, choose a node type. Each node has its own RAM, capacity, and CPU.
You also need to also specify additional settings such as parameter groups before waiting for the cluster to initialize.
The waiting time before you can sue a cluster depends on how much data you load.
Athena uses Amazon Web Services Identity Access Management to restrict access to Athena operations. Users who want to query data must have access to the Amazon S3 locations where data is stored. Additionally, AWS IAM makes it easy to control who can perform queries by using one of two managed policies.
Note: When accessing data on another AWS resource, you need permission to access the resource and its data. AWS Identity and Access Management are used to provide the permission, either through an IAM role or through an AWS access key. For more information, see Create an IAM Role.
Security for Redshift is more complex. To grant other users access to a cluster, you need to define a cluster security group, add users to it, and associate the group with your cluster. You can also protect your Redshift data by using Amazon’s virtual networking environment VPC.
Redshift has the clear advantage when it comes to upgrading. You can scale a cluster up by simply purchasing more nodes to cater for more data.
Amazon Athena has several service limits. Athena users are limited to 100 S3 buckets per account. The number of databases is also limited to 100. To scale Athena you need to request higher limits from Amazon.
Amazon Athena quickly queries S3 data using Presto, which is a distributed SQL engine used by multinational corporations such as Facebook. Presto targets data analysts who need lightning fast response times from queries. You can partition tables in Athena to further improve query speed and performance. You can also convert to convert data to columnar formats such as Parquet for better results, as this blogger showed when querying a dataset with over 1 billion lines of taxi journey data in just 6.63 seconds.
Using Redshift, you have to load the data into the data warehouse, which takes a few minutes. After the loading completes, you can expect quicker query times than Athena. Redshift is the winner for high-performance, but you need to gauge whether the improvement in performance justifies the cost of setting up the Redshift infrastructure.
Redshift requires the use of a visual data tool such as SQL Workbench to test query performance. For Athena, query results are displayed directly in the console, reducing dependency on other tools to assess performance.
Creating tables is quicker in Redshift. Redshift uses PostgreSQL to create tables. In Athena, you use Apache Hive data definition language (DDL) statements to define tables. You need to specify an S3 bucket location for underlying data when creating tables in Athena.
The following experiment shows how long it took to create the same table in each service.
It takes just over 2.5 times longer to create a given table in Athena than Redshift.
Simple Read Query
Athena is optimized for simple read queries, as the results of the following experiment show. This query, executed in both Athena and Redshift, filters out rows from the data set.
Athena comprehensively outperformed Redshift with this simple read query, executing it over 40 times quicker.
Aggregated queries such as sums and average functions are used to get single-figure summary answers to questions. The following images show the results of executing the same aggregated query in both Redshift and Athena.
Athena outperformed Redshift with this query too, but the difference was marginal.
A join is a more complex type of query—it is the type of query Redshift is optimized for, as the results of performing the same query in Redshift and Athena show.
Note that you must push the data into a table using a COPY command before joining. Loading the table to Redshift added an extra 25.52 seconds to the results of this query.
Even taking the extra 25 seconds to load the data into consideration, Redshift significantly outperforms Athena for performing complex join queries.
For quick ad hoc queries, you can’t beat Amazon Athena for its combination of low cost and stellar performance.
Athena does not require you to provision any computing resources—you can get started with it right away. Athena supports several data formats and complex data types that Redshift doesn't currently support. Partitioning is straightforward in Athena, while it's not supported in Redshift, making query optimization a trickier topic because it requires the correct choice of distribution keys for a given data set. Athena is the far more accessible option, and it performs simple read queries quicker than Redshift. Even for aggregated queries, Athena performs well, and organizations with less data might find Athena meets their needs for aggregations.
Redshift's computational advantages come to the fore when you need to perform lots of aggregations on huge data sets and many complex joins. Redshift's infrastructure is also scalable for future data growth. Redshift is therefore suitable for organizations or applications with huge data sets.
For everyday data querying needs, Athena is a solid option.
(All screenshots are taken from Panoply's comparison, except where specified otherwise.)