pg_mooncake v0.1, out now• pg_mooncake v0.1, out now• 
pg_mooncake v0.1, out now• pg_mooncake v0.1, out now• 
pg_mooncake v0.1, out now• pg_mooncake v0.1, out now• 
pg_mooncake v0.1, out now• pg_mooncake v0.1, out now• 

wth are S3 tabes

Zhou Sun
s3
Iceberg
Cover image for wth are S3 tabes

Amazon announced table buckets during re-invent last year.

An S3 bucket that's also a table? What does that mean?

TLDR: S3 tables allow Iceberg tables to exist without a catalog.

Let me explain.

Iceberg vs Delta Lake

What? I thought iceberg and delta are just table formats? Why are we comparing the way they're implemented?

A key difference between Iceberg and Delta Lake is "Delta table can exist without a catalog, but iceberg cannot ", due to their metadata design:

1. Delta Lake relies on the file system to get ACID in the metadata layer.

The metadata layer in delta_log is an infinite list of json files. If a directory contains a delta table, you just need to read the files (based on file_names) in delta_log, one by one, to get the current state of the table.

You don’t need any extra information or 'catalog' to read/write to them.

2. Iceberg, on the other hand, is storage agnostic, and relies on catalog to get ACID.

The metadata layer is a reversed linked list of snapshot nodes (From SNAPSHOT N+1, you can fnd SNAPSHOT N, but not vice versus).

You need an external ‘catalog’ to atomically insert and retrieve the last snapshot.

Wait, what if the file system can consistently ‘remember’ the snapshot pointer and atomically swap it? Bingo, we got a s3 table bucket.

How does writing to S3 table buckets actually work?

I spent some time on s3 table docs so you don't have to. P.S it's hard to understand, even deepseek struggles.

We'll compare writing to an S3 Table with a 'normal' iceberg catalog:

1. [Both] Engine write data_files (parquet)

2. [Both] Engine write iceberg metadata files.

3. [Catalog Only] Request the catalog to commit.

4. [S3 Table Only] Call S3API update-table-metadata-location to perform the commit.

Yep, that's the only difference.

What are the benefits of S3 tables then?

I am sure Amazon is doing a lot of work on performance (both on compaction and just bucket layout).

But the functionality difference is:

S3 tables allow you to read and write any file to the bucket, but not list files.

I thought it’s dumb at first, but it actually makes sense.

It means the only way to ‘probe’ a table bucket is to read the metadata-location specified by the bucket. So by design, any user can only read files that are part of the table, because there is no way to find any other file.

So, who should use S3 tables?

I think the use cases might be very different from what people think.

For the enterprise scale lakehouse, it actually makes sense to have a proper centralized external catalog and table maintenance services. And S3 table is still more expensive than regular S3.

I don't see much use of S3 tables there.

To me S3 Table is great for smaller teams and light-weight analytics, potentially embedded into other workflows. It gives you Iceberg without managing an external catalog.

Here's my take: the first engine to write S3 Table is Spark, and second will be Postgres.

🥮