Managing large datasets on AWS with S3, Athena, and Glue

Cost Optimization for Large Datasets on AWS: A Case for S3, Athena, and Glue

By: Oscar Moncada

Managing large datasets is not only an operational issue, but a financial one as well.

S3, Athena, and Glue, in particular, provide a reliable solution to handling ballooning costs.

Achieving long-term cost optimization as a business scales may mean moving to other AWS services.

Key Takeaways

Thoroughly assess your data, especially if streaming, which is often a main culprit in skyrocketing costs.
Transform raw data before storing it, this may include using transformation tools such as AWS Glue and converting the data into formats that are easier to digest and require less storage space such as Parquet.
If needed, consider different services that better support your complex data processing needs.

Introduction: The Challenge of Managing Large Datasets on AWS

As cloud computing continues to transform how organizations handle large data sets, the sheer volume of information being generated, processed, and stored on platforms like Amazon Web Services (AWS) has exploded. For many enterprises, managing these large datasets poses a unique challenge—not just from an operational perspective, but from a financial one. Scaling in the cloud offers incredible flexibility, but without the right cost optimization strategies, you may find yourself grappling with skyrocketing bills that negate the very efficiencies cloud computing promises. And in addition to cost optimization, you need to plan properly so you build a solution that meets your needs from the beginning (easier said than done).

The question for organizations is simple: how can you continue to scale your data processing in the cloud while keeping costs in check? This article provides actionable insights and strategies for tackling the rising costs associated with large datasets on AWS, focusing on key services like S3, Athena, and Glue, among others.

Understanding Your Cost Drivers in AWS: Storage, Compute, and Queries

One of the first steps in controlling cloud costs is understanding where the major expenses originate. For most organizations, storage and compute are the primary cost drivers. Services like Amazon S3 are incredibly scalable and offer low-cost storage, but improper data management—such as storing millions of tiny files or allowing unoptimized datasets to pile up or not having the right S3 policies—can quickly inflate costs. Similarly, AWS Athena, a serverless interactive query service, is cost-efficient when optimized but can become costly when dealing with large datasets that require multiple scans.

The real issue often lies in how data is stored and queried. Without efficient file consolidation and partitioning, querying massive datasets can cause Athena to scan hundreds of terabytes of data, leading to disproportionately high costs. Understanding, monitoring, and mitigating these cost drivers is essential for maintaining financial control in the cloud.

S3 Cost Management: Reducing Costs of Large Amounts of Data

Streaming large amounts of data such as metrics or small data points from many IoT devices often presents a specific challenge in S3 storage costs, especially when data is written as it streams in, creating millions of small files. This leads to substantial S3 GET costs, as well as inefficiencies in querying this fragmented data. One successful fix is to implement a wait time of a specific interval in Kinesis to consolidate files before they are written to S3.

This approach results in a considerable decrease in the number of files stored in S3, transitioning from millions to thousands. As a result, retrieval costs will drop significantly, making it a powerful example of how minor adjustments in data stream processing can lead to substantial cost reductions.

Optimizing Athena Queries for Large Datasets

As your dataset grows, the cost of querying in Athena can quickly spiral out of control too. Your costs can rapidly triple due to unoptimized queries scanning terabytes of data. The issue stems from Athena’s need to scan entire datasets unless they are partitioned and stored efficiently. Enter Parquet files, a columnar formatted file that allows for faster and cheaper queries.

By partitioning your data by key fields, such as client IDs or timestamps, and storing it in a columnar format like Parquet, you reduce the amount of data scanned during queries. Now Athena only needs to access specific partitions rather than the entire dataset. Additionally, Parquet files offer compression benefits, further decreasing storage and query costs.

Athena's cost-based optimizer (CBO) uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. By using these statistics, CBO improves query performance by selecting the most efficient query execution plan based on data statistics, meaning you achieve faster queries with lower costs.

Using AWS Glue ETL for Efficient Data Processing

Ingesting raw data directly into S3 for querying can result in high processing costs due to inefficient file structures and formats. A more optimized approach leverages AWS Glue ETL (Extract, Transform, Load) to preprocess and transform raw data before storage. Glue can convert unstructured data formats, such as CSV or plain text, into highly efficient formats like Parquet. This further improves performance while reducing both storage and query costs.

Use Cases

Streamlining CloudWatch Metrics Processing

One common challenge is the skyrocketing costs of processing real-time CloudWatch metrics. By default in AWS, every metric is written to S3 as it arrives, creating millions of small files, each triggering a costly retrieval process. The solution here involves applying the 5-minute wait time in Kinesis to allow multiple metrics to be consolidated before storage.

But what about all the historical CloudWatch data we’ve gathered?

To optimize file size of your existing data, you can write a Python script to restructure and consolidate these tiny files into larger, more manageable files, reducing S3 GET costs and making the dataset much easier to query in Athena. A straightforward adjustment like this can quickly result in a dramatic reduction in overall costs while improving data access performance.

AWS Glue ETL and Parquet for Optimizing Ingested Data

Raw data from CloudWatch and/or other third party metrics often requires heavy lifting when queried in Athena, leading to unnecessary cost escalations. Glue ETL offers a way to flatten and transform these datasets into Parquet files. Combined with effective partitioning of the data, you can achieve significant improvements in the amount of data scanned during each query, time required for queries, and total cost incurred.

Managing AWS Glue Crawler Costs

AWS Glue Crawler is a powerful tool for automatically scanning and cataloging data in S3, but its usage comes with a cost. In large-scale operations, the continuous need to update database partitions and schemas can result in high Glue Crawler expenses. One way to mitigate this is to create custom ETL logic that updates partitions without relying entirely on the Glue Crawler.

By doing so, you can slash Glue Crawler costs, streamlining processes while ensuring that the schema remains up-to-date. This solution also improves performance by avoiding redundant scans of the same data, enhancing both cost efficiency and operational speed.

Achieving Cost-Efficient Scalability of Large Datasets on AWS

Applying best practices for attaining long-term cost efficiency in AWS requires a proactive approach. One crucial practice is the regular auditing of cloud usage and identifying areas of waste and inefficiency.

Additionally, you may need to consider options beyond S3 and Athena. Navigating this strategic direction includes exploring Amazon Redshift, especially if you’re experience any of the following scenarios:

Increased data volume and complexity
Need for low-latency queries
Data warehousing and BI reporting

As businesses grow and their data requirements become more complex, transitioning from S3 and Athena to Redshift can provide enhanced performance, scalability, and advanced analytics capabilities.

Scale and Stabilize Costs with Amazon RedShift

As data volumes grow and require more complex queries, Redshift becomes more efficient than Athena. It is optimized for large datasets with complex joins and aggregations, offering better performance for data warehousing tasks.

While Athena offer a cost-effective solution for ad-hoc queries on datasets stored in S3, Amazon Redshift is designed to handle high-performance, large-scale analytics workloads:

Improved performance for complex queries: Redshift is ideal for use cases that require low-latency responses. Its Massively Parallel Processing (MPP) architecture allows for the rapid execution of queries that span billions of rows, outperforming Athena in scenarios where query performance is critical.
Zero-ETL integrations: Supporting zero-ETL integrations with other data sources like Amazon Aurora and DynamoDB, Redshift allows for seamless data synchronization and reduces the complexity of managing ETL processes.
Scalable storage and compute: By separating compute and storage through Redshift Spectrum, you’re able to scale resources based on demand. This means you can store large data volumes cost-effectively in S3 while using Redshift for complex data processing.
Enhanced data security and compliance: With advanced security features like data encryption, IAM integration, and fine-grained access control, Redshift is suitable for businesses with strict compliance requirements.

For organizations that start with S3 and Athena due to their flexibility and low cost, transitioning to Amazon Redshift as data needs grow can provide the performance and scalability necessary for advanced analytics, making it a logical next step in a data strategy.

Conclusion

Managing large datasets on AWS requires a careful balance between scalability and cost efficiency. With tools like S3, Athena, and Glue, organizations have the capabilities to handle vast amounts of data, but without proper optimization, costs can spiral out of control.

By applying targeted strategies—such as consolidating files in Kinesis, partitioning data, and utilizing Parquet formats—businesses can significantly reduce their cloud expenses while maintaining high performance and scalability.

Of course, addressing essential cost optimization techniques, from rightsizing to reservations, plays an essential role in managing costs.

Seeking better visibility into your AWS cost drivers? Easily track and monitor your compute, storage, and network costs with Kalos by Stratus10. Leverage Kalos AI to deliver tailored recommendations to cost-optimize your unique infrastructure. Try Kalos free today >>