🔍 **The Missing Links in Open Table Formats with Open Source Catalogs** As the data ecosystem continues to evolve, open table formats like Apache Iceberg, Delta Lake, and Apache Hudi have gained significant traction. However, when paired with open-source catalogs, several critical features are still missing: 1. **RBAC (Role-Based Access Control):** Security is paramount, but many open-source catalogs lack robust RBAC support, making it challenging to effectively control access to sensitive data. Although polaris and unity are working on this it is still far from being production ready but a great work in progress Will rbac include row level security and not only table level, this is indeed a need for enterprises. 2. **Automatic Maintenance:** Automated tasks like compaction, vacuuming, and optimization are essential for maintaining performance over time. Yet, these features are often absent or require complex configurations in open-source catalogs. And the question will this be something that will be enabled by the catalogs or will they have integrations to external engines that will allow the triggering of these procedures? A question for discussion 3. **Column Masking:** Protecting sensitive data at the column level is crucial, especially in industries dealing with personal or financial data. Unfortunately, straightforward column masking capabilities are missing in these environments. 4. **Schema Evolution Management:** While some open formats handle schema evolution well, open-source catalogs often struggle, leading to potential data inconsistencies. 5. **Governance and Auditing:** Comprehensive data governance and auditing capabilities are still lacking in many open-source tools, making it difficult to track data lineage and ensure compliance. These missing features highlight areas where the open data community needs to focus as we strive to build more robust and secure data infrastructures. Addressing these gaps will be key to unlocking the full potential of open table formats. Would love to hear your thoughts
Chango(http://www.cloudchef-labs.com/) supports iceberg related perfectly: storage security with rbac, automatic iceberg table maintenance, rest catalog, query audit.
Unify Lakehouse technology and Data Mesh principles. Build real governance based on actionable data contracts
3moRBAC and authZ is crucial and will be one of the first features to implement. Important is that catalog will not become IdP. Automatic Maintenance is rather part of query engine. Spark will write new data files unless catalog will use some embedded query engine for this type of the tasks. #DuckDB and #DataFusion are quite interesting approach