Bringing more to the table: How Amazon S3 Tables rapidly delivered new capabilities in the first 5 months

Amazon S3 redefined data storage when it launched as the first generally available AWS service in 2006 to deliver highly reliable, durable, secure, low-latency storage with virtually unlimited scale. While designed to deliver simple storage, S3 has proven to be built to handle the explosive growth of data we have seen in the last 19 years. Just 10 years ago, less than 100 S3 customers were storing 1PB+ of data. Today there are thousands of customers storing more than that, and in fact, there are individual customers storing exabytes of data. It’s something most of us take for granted, but S3 has taken away the challenges of scaling storage for customers, while doing it cost-effectively, durably, and securely.

Scalability, performance, cost-effectiveness, ease of use, and durability are some of the reasons why S3 is the foundation for more than a million data lakes that power latency-sensitive applications like interactive data analytics, financial modeling, real-time advertising, and AI. While one of the best compliments we regularly hear is “S3 just works”, we constantly ask, “How do we make S3 just work … better?”

Working closely with customers revealed a need: make S3 more powerful for analytics workloads. With Apache Parquet becoming the preferred format for large datasets, with many S3 customers storing millions or billions of Parquet files, and with Apache Iceberg emerging as the most popular solution for managing Parquet files, we saw an opportunity to simplify data lake management.

While Iceberg provides a powerful table format that enables transactional consistency and SQL querying across massive datasets, managing it at scale creates operational complexity. It requires dedicated teams to build custom systems to keep tables optimized for cost and performance – requiring specialized expertise many organizations lack.

Which is why we launched Amazon S3 Tables at AWS re:Invent 2024. Amazon S3 Tables introduced purpose-built tabular storage and a new bucket type for Iceberg tables that makes it simple to store structured data in S3. S3 Tables automatically handle maintenance tasks like compaction, snapshot management, and unreferenced file removal, so that you get continuously optimized query performance and cost, even as your data lake scales.

The momentum since launch has been nothing short of extraordinary. In five months, the S3 team has driven innovation by responding directly to customer feedback. For example, S3 Tables expanded from 3 to 30 AWS Regions, launched powerful new feature capabilities, introduced a migration solution, and built integrations with both AWS and third-party analytics services.

In this post, I recap key launches in S3 Tables and how you can use them in your analytics workflows.

Bucket filled with ice on a table

Updates since launch

Seamless integration across AWS and third-party analytics applications

S3 Tables integrated with Amazon SageMaker Lakehouse to provide unified S3 Tables data access across various analytics engines and tools. With this integration, you can access SageMaker Lakehouse from Amazon SageMaker Unified Studio, a single data and AI development environment that brings together functionality and tools from AWS analytics and AI/ML services. All S3 Tables data integrated with SageMaker Lakehouse can be queried from SageMaker Unified Studio and engines such as Amazon Athena, Amazon EMR, Amazon Redshift, as well as Apache Iceberg-compatible engines like Apache Spark, Trino, or PyIceberg. With this integration, you can simplify building secure analytic workflows where you can read and write to S3 Tables and join with data in Redshift data warehouses and third-party and federated data sources, such as Amazon DynamoDB or PostgreSQL. This unified data management experience lets you analyze data using a variety of AWS and third-party query engines and applications, while managing security through centralized, fine-grained permissions in SageMaker Unified Studio. Read the blog post for more information.

Access S3 Tables using the Apache Iceberg REST Catalog standard from any compatible engine

S3 Tables added table management APIs that are compatible with the Apache Iceberg REST Catalog standard, enabling you to use any Iceberg-compatible query engine (e.g., Spark, Trino, PyIceberg, or DuckDB) to access tabular data directly from S3 Tables. S3 Tables Iceberg REST endpoint can be used to access tables in AWS Partner Network (APN) catalog implementations or custom catalog implementations. It can also be used if you only need basic read/write access to a single table bucket. With an ever-growing community of applications supporting Iceberg, these APIs make it easier to integrate your preferred applications at every step of your data pipeline. Read the documentation to get started.

It’s easier to get started from the S3 console with Athena

We’ve simplified getting started with S3 Tables through the S3 console. You can create tables, populate them with data, and query tables using Athena, all within the S3 console. With this integration, it’s easier than ever to get started for automatic data discovery across AWS analytics services to query new or existing table buckets.

Enhanced schema definition capabilities

We’ve added schema definition support in the CreateTable API enabling you to easily create a table with its complete schema through CLI commands without having to spin up an Iceberg-compatible engine. After a table is created with its schema, you can begin streaming transactional, log, or other data from various sources like Apache Kafka, Apache Flink, and Amazon Data Firehose. This streamlined workflow helps you to build data infrastructure more efficiently while maintaining precise control over table structures.

Scaled table quota

We have significantly increased the scalability of S3 Tables, by supporting the creation of up to 10,000 tables within each table bucket. This means that data teams can scale up to 100,000 tables across 10 table buckets within a single AWS Region and AWS account. This enhancement allows organizations to manage growing data needs with greater efficiency and flexibility.

Guidance for migrating tabular data from S3 to S3 Tables

This solution guidance demonstrates how to migrate tabular data from general purpose S3 buckets to S3 Tables. It shows you how to set up an automated migration process for moving Apache Iceberg and Apache Hive tables by making use of AWS Step Functions, EMR, and AWS Glue Data Catalog. After the migration you will benefit from increased performance and cost savings.

Server-side encryption using AWS KMS

S3 Tables now offer enhanced encryption options with AWS Key Management Service (SSE-KMS) support for customer-managed keys. While tables are encrypted by default using S3-managed keys, you can now implement your own KMS keys for specific tables or entire table buckets. This feature enables better compliance with regulatory requirements, includes S3 Bucket Keys for cost efficiency, and provides AWS CloudTrail logging for security auditing.

Regional availability

S3 Tables are now available in thirty AWS Regions, with more coming soon. Check the documentation for the current list of AWS Regions supported.

We are listening and delivering

We are continuously gathering feedback from customers and partners to enhance S3 Tables. By incorporating these valuable insights, we’re improving S3 Tables’ performance for data lake workloads.

Many customers are using S3 Tables to scale their production workloads. Genesys, a global cloud leader in AI-powered experience orchestration, highlights how S3 Tables’ managed Iceberg support simplifies their complex data workflows while boosting performance. At Pendulum, where they analyze data from hundreds of millions of social channels, S3 Tables have transformed their data lake management by automating critical maintenance tasks, allowing their team to focus on deriving actionable insights. Healthcare technology provider Zus Health emphasizes how S3 Tables’ managed optimization capabilities are particularly valuable for handling frequently changing patient data, while SnapLogic notes how the feature helps companies optimize analytics costs while maintaining regulatory compliance.

Based on customer demand, we are also working with partners to build seamless integrations. Support for Apache Iceberg REST APIs enables straightforward interoperability with Dremio and DuckDB. Snowflake highlights how their customers can now seamlessly read and process S3 Tables data with remarkable simplicity, while StreamNative emphasizes how the integration makes real-time, AI-ready data more accessible and cost-effective. Partners across the spectrum, from Starburst to PuppyGraph, are using S3 Tables to enhance their offerings in areas ranging from graph analytics to industrial DataOps, demonstrating the versatility in supporting diverse use cases and workloads.

Conclusion

The rapid evolution of Amazon S3 Tables demonstrates our commitment to simplifying data lake management while enabling powerful analytics capabilities. These improvements are already helping organizations across industries unlock new insights from their tabular data. We’re excited to continue innovating based on your feedback – stay tuned for more developments!

To learn more:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top