The data world, my friends, is in a fascinating state of flux. For years, we've chased the elusive "single source of truth," battling data silos and wrestling with the inherent trade-offs between the flexibility of data lakes and the robust governance of data warehouses. But recent developments, particularly in the realm of open table formats and the platforms embracing them, tell a compelling story: the lakehouse vision is not just coming into focus, it's becoming a deeply practical, sturdy reality.
Having spent countless hours digging into the latest iterations, testing the boundaries, and yes, occasionally banging my head against the wall, I'm genuinely excited about where we are. This isn't about marketing fluff; it's about tangible engineering advancements that are fundamentally changing how we build and operate data platforms. Let's dive deep into what's truly shaping the open data stack right now.
The Lakehouse Paradigm Shift: From Vision to Production Reality
The conceptual allure of the lakehouse has always been strong: combine the vast, cost-effective storage of data lakes with the ACID transactions, schema enforcement, and governance capabilities of data warehouses. For a long time, this was easier said than done. But the maturation of open table formats, especially Apache Iceberg, has been the linchpin in making this architectural pattern not just viable, but efficient and genuinely practical for production workloads.
The core problem open table formats solve is bringing transactional integrity and metadata intelligence to files stored in cloud object storage. Without them, your data lake is, well, just a pile of files. It's a Wild West of ad-hoc schema changes, inconsistent reads, and manual data management nightmares. Iceberg, Delta Lake, and Hudi have transformed this by introducing a rich metadata layer that tracks file manifests, snapshots, and schema evolution, enabling atomicity, consistency, isolation, and durability (ACID) properties directly on your S3, ADLS, or GCS buckets. This is genuinely impressive because it means we no longer need to move data between systems for different workloads; the same data can power batch analytics, real-time dashboards, and machine learning models, all with consistent semantics and strong data quality guarantees.
Apache Iceberg's Ascent: Performance and Specification Evolution
Apache Iceberg continues its relentless march towards becoming the open table format standard. The project's focus in late 2025 and early 2026 has been on solidifying its core capabilities, enhancing performance, and expanding its specification to handle increasingly complex data types and workloads. We've seen significant strides in how Iceberg manages underlying data files to optimize query performance, moving beyond basic partitioning to more sophisticated techniques. This is a key component of the DuckDB, Arrow, and Parquet: The Ultimate Analytical Stack for 2026 that many teams are adopting.
One of the most notable recent advancements is the introduction of deletion vectors in Iceberg Format Version 3. This is a big deal for mutable data. Previously, row-level deletes or updates often necessitated rewriting entire data files, which could be resource-intensive and lead to write amplification, especially in high-churn scenarios. With deletion vectors, Iceberg can track deleted rows without immediately rewriting the base data files. Instead, it maintains a separate, small file (the deletion vector) that marks specific row positions as deleted. Query engines then apply these deletion vectors at read time. This mechanism significantly improves the efficiency of DELETE and UPDATE operations, making Iceberg tables much more performant for transactional workloads that frequently modify existing records.
Furthermore, Format Version 3 has also expanded type support, notably including a VARIANT type for semi-structured data and GEOSPATIAL types. This is crucial for handling the increasingly diverse data payloads we encounter, especially from streaming sources or API integrations.
Scan Planning via REST Catalog: A Game Changer for Interoperability
I've been waiting for this: the evolution of the Iceberg REST Catalog specification to include a Scan Planning endpoint. This is a fundamental shift in how query engines interact with Iceberg tables and promises to dramatically improve interoperability and performance across the ecosystem.
With the Scan Planning endpoint, the responsibility for generating the list of files to be scanned can be delegated to the catalog itself. This opens up incredible possibilities:
- Optimized Scan Planning with Caching: The catalog can cache frequently accessed scan plans, reducing redundant metadata reads.
- Enhanced Interoperability: By centralizing scan planning, the catalog can potentially bridge different table formats.
- Decoupling: It further decouples the query engine from the intricacies of the table format's physical layout.
Snowflake's Hybrid Play: Unistore and First-Class Iceberg Tables
Snowflake, a traditional data warehouse powerhouse, has made truly impressive moves to embrace the open lakehouse. Initially, Snowflake's support for Iceberg was primarily through external tables. That has changed significantly. In a major development, Snowflake announced full DML support (INSERT, UPDATE, DELETE, MERGE) for externally managed Iceberg tables via catalog-linked databases and the Iceberg REST Catalog.
But the real showstopper is the introduction of Hybrid Tables as part of their Unistore initiative. This is genuinely impressive because it blurs the lines between OLTP and OLAP within a single platform. Hybrid tables are optimized for both low-latency, high-throughput transactional workloads and analytical queries.
The technical nuance here lies in their dual-storage architecture:
- Row-based storage: Primarily used for transactional applications, offering fast retrieval and modification of individual rows.
- Columnar storage: Used for analytical queries, optimized for data aggregation and large scans.
To integrate with external Iceberg, Snowflake uses new account-level objects: EXTERNAL VOLUME and CATALOG INTEGRATION. While this integration is robust, a minor quibble remains: for externally managed Iceberg tables, if Snowflake isn't the primary catalog, you still need to manage metadata refreshes carefully.
Databricks and the Open Lakehouse: Unity Catalog's Iceberg Embrace
Databricks, the originator of Delta Lake, has made significant strides in embracing Apache Iceberg, particularly through its Unity Catalog. This isn't just about coexistence; it's about deep integration and providing a unified governance layer across formats.
A major announcement was the Public Preview of full Apache Iceberg support in Databricks, enabling read and write operations for Managed Iceberg tables via Unity Catalog's implementation of the Iceberg REST Catalog API. When configuring your catalog properties, you can use this JSON to YAML converter to ensure your configuration files are formatted correctly for different deployment environments.
The configuration for connecting a Spark client to Unity Catalog as an Iceberg REST Catalog typically involves setting Spark properties like:
spark.sql.catalog.<catalog-name> = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.catalog-impl = org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.<catalog-name>.uri = <databricks-workspace-url>/api/2.1/unity-catalog/iceberg-rest
spark.sql.catalog.<catalog-name>.credential = <oauth_client_id>:<oauth_client_secret>
spark.sql.catalog.<catalog-name>.warehouse = <path-to-uc-managed-storage>
Databricks' UniForm and Cross-Format Reality
Databricks' commitment to interoperability is also evident in UniForm, a feature that allows Delta Lake tables to be read as Iceberg or Hudi tables without data duplication. UniForm essentially generates Iceberg (or Hudi) metadata for an existing Delta Lake table, enabling engines that primarily speak Iceberg to query Delta tables.
While UniForm is a practical solution for enabling read interoperability, it's important to acknowledge the trade-offs. It acts as a metadata translation layer, but it doesn't fundamentally alter the underlying data organization. For instance, advanced Iceberg-specific optimizations or deletion vector capabilities might not be fully leveraged when reading a UniForm-enabled Delta table as Iceberg.
The Unseen Force: Advanced Indexing and Query Optimizers
Beyond the table formats themselves, the major cloud data platforms are relentlessly pushing the boundaries of query performance. For Apache Iceberg, the community is actively exploring advanced indexing capabilities. While Iceberg's metadata already provides file-level statistics for powerful pruning, the addition of features like Bloom filters for high-cardinality columns is a key area of development.
Snowflake's query engine is being extended to work seamlessly with Iceberg tables, leveraging their existing Search Optimization Service and Query Acceleration Service. Databricks, too, has a suite of query optimization techniques including:
- Cost-Based Optimizer (CBO): Leverages table statistics for efficient execution plans.
- Dynamic File Pruning (DFP): Skips irrelevant data files during query execution based on runtime filters.
- Auto Optimize: Includes Optimized Writes and Auto Compaction to manage file sizes.
Schema Evolution and Data Contracts: Navigating Change with Confidence
One of Iceberg's most celebrated features is its robust and safe schema evolution. Iceberg allows you to add, drop, rename, and update column types at the metadata level, meaning they don't require expensive, full table rewrites. Instead of manually altering Parquet files, you issue simple SQL DDL commands:
ALTER TABLE my_iceberg_table
ADD COLUMN event_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP();
These changes are transactional and immediately available. However, with great flexibility comes the need for strong governance. Best practices include planning for future growth, using meaningful default values, and maintaining version control for auditing and rollback.
Expert Insight: The Converging Metadata Layer and Streaming-First Future
Looking into 2026 and beyond, my expert prediction is that the open data ecosystem will increasingly gravitate towards a converging metadata layer. Projects like Apache Polaris (currently incubating) are at the forefront of this trend. Polaris aims to be a shared catalog and governance layer for Iceberg tables across multiple query engines.
Furthermore, the shift towards "streaming-first" lakehouses is undeniable. We're moving away from treating streaming as an afterthought to expecting continuous ingestion, processing, and query serving as the default. This demands robust, incremental commits and efficient changelog management.
Reality Check: The Road Ahead and Lingering Quirks
While the advancements are exhilarating, it's crucial to maintain a reality check. The journey to a truly seamless open lakehouse still has its rough edges:
- Interoperability Trade-offs: Fundamental differences between formats mean that perfect, feature-for-feature interoperability isn't always a given.
- Operational Complexity: Managing compaction and expiring snapshots still requires careful planning.
- Snowflake Iceberg Limitations: Snowflake-managed Iceberg tables still have restrictions compared to externally managed ones.
- Catalog Fragmentation: Multiple catalogs still exist, adding a layer of configuration complexity.
Practical Walkthrough: Automating Iceberg Table Maintenance (Compaction)
One of the most critical aspects of maintaining Iceberg table performance is effective file compaction. We'll use Spark's RewriteDataFiles action to consolidate small files into larger ones.
-- Assuming a catalog named 'spark_catalog' is configured for Iceberg
USE spark_catalog.my_db;
-- Execute the compaction for cold partitions
-- Target file size is set to 512 MB
ALTER TABLE event_logs
EXECUTE REWRITE DATA FILES
WHERE event_hour < date_trunc('hour', current_timestamp() - INTERVAL '2 HOURS')
OPTIONS (
'target-file-size-bytes' = '536870912',
'strategy' = 'binpack'
);
This ALTER TABLE ... EXECUTE REWRITE DATA FILES command is a transactional operation. It reads the specified data files, writes new, larger files, and then atomically commits a new snapshot to the table's metadata. After a successful compaction, you might also consider running REMOVE ORPHAN FILES to reclaim space. The open data stack is no longer just a promise; it's a rapidly evolving reality.
Sources
This article was published by the DataFormatHub Editorial Team, a group of developers and data enthusiasts dedicated to making data transformation accessible and private. Our goal is to provide high-quality technical insights alongside our suite of privacy-first developer tools.
🛠️ Related Tools
Explore these DataFormatHub tools related to this topic:
- CSV to SQL - Prepare data for lakehouses
- JSON to YAML - Format schema definitions
