Amit Gilad’s Post

Data Engineer / Apache Iceberg Enthusiast

4mo

🔍 **The Missing Links in Open Table Formats with Open Source Catalogs** As the data ecosystem continues to evolve, open table formats like Apache Iceberg, Delta Lake, and Apache Hudi have gained significant traction. However, when paired with open-source catalogs, several critical features are still missing: 1. **RBAC (Role-Based Access Control):** Security is paramount, but many open-source catalogs lack robust RBAC support, making it challenging to effectively control access to sensitive data. Although polaris and unity are working on this it is still far from being production ready but a great work in progress Will rbac include row level security and not only table level, this is indeed a need for enterprises. 2. **Automatic Maintenance:** Automated tasks like compaction, vacuuming, and optimization are essential for maintaining performance over time. Yet, these features are often absent or require complex configurations in open-source catalogs. And the question will this be something that will be enabled by the catalogs or will they have integrations to external engines that will allow the triggering of these procedures? A question for discussion 3. **Column Masking:** Protecting sensitive data at the column level is crucial, especially in industries dealing with personal or financial data. Unfortunately, straightforward column masking capabilities are missing in these environments. 4. **Schema Evolution Management:** While some open formats handle schema evolution well, open-source catalogs often struggle, leading to potential data inconsistencies. 5. **Governance and Auditing:** Comprehensive data governance and auditing capabilities are still lacking in many open-source tools, making it difficult to track data lineage and ensure compliance. These missing features highlight areas where the open data community needs to focus as we strive to build more robust and secure data infrastructures. Addressing these gaps will be key to unlocking the full potential of open table formats. Would love to hear your thoughts

3 Comments

Viktor Kessler

Unify Lakehouse technology and Data Mesh principles. Build real governance based on actionable data contracts

3mo

RBAC and authZ is crucial and will be one of the first features to implement. Important is that catalog will not become IdP. Automatic Maintenance is rather part of query engine. Spark will write new data files unless catalog will use some embedded query engine for this type of the tasks. #DuckDB and #DataFusion are quite interesting approach

1 Reaction

Kidong Lee

Founder of Cloud Chef Labs | Chango | Unified Data Lakehouse Platform | Iceberg centric Data Lakehouses

4mo

Chango(http://www.cloudchef-labs.com/) supports iceberg related perfectly: storage security with rbac, automatic iceberg table maintenance, rest catalog, query audit.

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Manish Kumar Singh

BI Specialist @BOLD, Ex-Harman, Ex-PWC US | Business Intelligence Specialist | 9+ Years of experience into various roles - BI Lead, ETL Developer, Data Engineer & Data Analyst | Big data | Python | SQL | Azure
6mo
Report this post
Apache Iceberg - Open Source & Powerful Data Storage Apache Iceberg is an open-source data table format for data warehouses. It acts like a super-organized filing system, keeping track of your data, eliminating duplicates and working with various data tools. This makes managing massive datasets efficient and flexible. Understand, Apache Iceberg is primarily a table storage format, not a data processing tool. Let’s understand it this way, Imagine you have a warehouse full of boxes containing all sorts of products. Apache Iceberg is like a super-organized system for this warehouse . Here's what it does: ✅ Keeps things labeled -> Iceberg tracks what's in each box (data file or table) and adds labels describing the contents(data format, date added) in it. This makes finding specific information much easier and faster. ✅ No duplicates -> Iceberg ensures you don't have the same product in multiple places, saving space and avoiding confusion. ✅ Plays well with others -> Since everything is well organised anyone (other tools) can come and access any product (data) without much difficulty. ✅ Keeps track of changes -> If you add or remove boxes (data files), Iceberg remembers these changes, so you can always see the history of your data. In simple terms, Apache Iceberg helps you manage your massive amounts of data efficiently. But do you know what makes Apache iceberg stand out from other data management solutions. Let’s go through this as well -> ✅ Focus on Open Source -> Unlike many others, Apache Iceberg is entirely open-source. ✅ Flexibility -> Connects seamlessly with various compute layers, giving users more choice in how they analyze and interact with their data. ✅ Focus on Efficiency -> Efficient File Management, ACID properties, Time Travel. Manish Kumar Singh #apacheiceberg #opendata #bigdata #datamanagement
Like Comment
To view or add a comment, sign in
Sameer Wadkar

Principal Engineer | Distributed Systems Engineering, Data Pipelines, IAM, K8s, Cloud Native Monitoring, MLOps,Feature Store, Large Language Models
2w
Report this post
The excitement around S3 tables got me reading about Iceberg. I had the impression that it was just another Hive like system to support ACID transactions for file based databases. But Hive LLAP (Live Long and Prosper) had this capability back in 2017. So what was all the excitement about? Turns out Iceberg take it a lot further and made is reasonably (not to the level of OLTP) performant. It claims to support a lot of use-cases (including ETL) and it does. But the real value of Iceberg is in two of its capabilities: 1. It supports time-travel queries efficiently because of how it stores metadata about parquet files 2. Evolving schemas, handling partitions and bucketing is a lot more transparent to the user than it was with Hive Essentially its real value proposition is you can stop paying licensing fees to vendor based Data Warehouse Products and use your own. For Cloud Native architectures and products this is an incredible value proposition. But there is a caveat (there always is with such products). The flexibility you get from Iceberg, as well as near infinite horizontal scalability comes at a significant operational cost. With Vendor based Data Warehouse products, that part is simplified significantly. Now you have to own it. While you save on licensing costs, you will now divert your DW Operations Team to managing operations for Iceberg. There is also a disaster recovery angle. Because metadata is stored alongside the data files and decentralized, you cannot afford to get that corrupted, else your data will be inaccessible. S3 Tables provides some guarantees there due to the natural versioning that S3 supports. But that is a cost which will add up at Petabyte scale. Like most open-source products that have evolved in the data space over the past 15 years, Apache Iceberg requires strong in-house expertise. Managing Iceberg effectively demands deep knowledge of complex software and data architectures, cloud-native compute and storage systems, and robust access control management. While Iceberg offers flexibility and cost savings, vendor-provided data warehouse systems abstract away this complexity, providing fully managed solutions that handle performance optimization, security, and operational overhead out of the box.

1 Comment
Like Comment
To view or add a comment, sign in
Alex Campos

Digital Data Strategist
5mo Edited
Report this post
Lakehouse is getting wider adoption and supportability from vendors and technology providers. Help yourself with this "Lakehouse Landscape 2024", which I developed to summarize the current state of the art of the different components and flavors you can choose to evolve your Data Strategy. 💡 Format: Open source is predominant, and Iceberg, Hudi, and Delta Lake are leading the enterprise solutions. XTable is worth mentioning for cross-table interoperability. 💡 Catalog: as Lakehouse is a truly decoupled architecture, catalogs play a crucial role in maintaining the current table state and ensuring smooth operability with engines and tools, making them an integral part of the system's functionality. 💡 Data: in the Lakehouse landscape, the actual business data is compressed for performance purposes and, again, interoperability. Parquet, a popular columnar compression format, leads the race and preferences in addressing common use cases. 💡 Engines/Tools: The variety of tools and engines spans data ingestion, processing, integrations, ad-hoc querying, and more. Most of these tools were also very popular in the Data Lake era, and many of them are ready to work with Lakehouse. 💡 Vendors: Enterprise support and world-class software are required for most industries, and Data Lake incumbents and new players, alongside cloud service providers, are making their space in the Lakehouse era. 🌎 Notably, open source and interoperability are prediminant in the Lakehouse landscape, avoiding vendor lock-in and providing freedom to use the company's preferred tools. Read more here: https://lnkd.in/dnvSCyQB #Lakehouse #DataLakehouse #ApacheIceberg Apache Iceberg Apache Hudi Delta Lake
8 Comments
Like Comment
To view or add a comment, sign in
Bruno Laï
7mo
Report this post
In today's data landscape, open table formats are becoming increasingly popular. According to recent statistics, 51% of organizations have adopted Delta tables and 27% have adopted Apache Iceberg. Teradata's agnostic #OTF support and open catalog integration are designed to enable the platform to read various catalogs with predictable execution. The platform also boasts a unique implementation and approach to parallel processing, workload management, and query optimization of shared data, delivering the best performance in the market. Learn more about Teradata's embracement of open table formats and why it's the platform of choice for organizations looking for trusted solutions. #trustedA https://lnkd.in/eiPkKbxh

Teradata Embraces Open Table Formats, Iceberg and Delta Lake

teradata.com
Like Comment
To view or add a comment, sign in
Fabian Pascal

Editor &Publisher DATABASE DEBUNKINGS, Data and Relational Fundamentalist,Consultant, Analyst, Author, Educator, Speaker
5mo
Report this post
New paper in the PRACTICAL DATABASE FOUNDATION series ---------------------------------------------------------- “Facts about reality are statements that are unequivocally true or false, regardless of whether we know which is the case, or not. Guaranteeing that only facts and no falsehoods are represented in databases—and, therefore, query results are provably logically correct with respect to the real world—requires that authorized users assert only what they know to be true in the real world, and do not assert what they know to be false. If knowledge is imperfect and they do not know whether certain facts are true or false, there is no sound basis on which the DBMS can accept tuples in the database, or reject them. Either way could be wrong, with loss of data integrity and of provability of logical correctness. Simplicity and intuitiveness are also reduced. Neither the old default values treated as exceptions by applications, nor many-valued logic schemes such as SQL's flawed 3VL based on NULLs resolve correctly and cost-effectively the problem of missing data. Data users and professionals tend to focus on data representation and miss the costly implications for data manipulation, integrity enforcement and most other aspects of database management.” ---------------------------------------------------------- Available to order from the PAPERS page @DBDebunk.com Table of Contents Introduction 1. “Inapplicable Data”: Nothing's Missing 2. Missing Data: Into the Unknown 3. SQL NULL: What-Valued Logic? 4. Known Unknowns: Metadata 5. A Relational Solution 5.1. The Practicality of Theory 5.2. A Real World Example 5.3. Relation Proliferation 6. Questions/Comments/Objections Conclusion
Like Comment
To view or add a comment, sign in
Christin Nataly

Enterprise Sales Director, APJ at SingleStore
1mo
Report this post
SingleStore's database branching is like a Swiss Army knife for data management, boosting developer productivity, mitigating risks and enabling seamless testing environments. 🌿 Check out our new blog post and learn how to boost developer productivity!

SingleStore Database Branching: How to Boost Developer Productivity

singlestore.com
Like Comment
To view or add a comment, sign in
The New Stack

20,130 followers
4mo Edited
Report this post
As the race to build out data capabilities picks up, organizations find that DBaaS convenience comes with unexpected costs. https://lnkd.in/gMZZCfbD #CloudNative #Database by Ann Schlemmer thanks to Percona

Database as a Service: The Hidden Cost of Convenience

https://thenewstack.io
Like Comment
To view or add a comment, sign in
Vedran B.

Junior DBA and Full-Stack Developer
2mo Edited
Report this post
So, the new PostgreSQL 17 have optimized IN filters like for example id IN (1, 2, 3) Quote: > During lookups, a B-tree is scanned, with Postgres descending down through its hierarchy from the root until it finds a target value on one of its leaf pages. Previously, multi-value lookups like id IN (1, 2, 3) or id = any(1, 2, 3) would require that process be repeated multiple times, once for each of the requested values. If you're using pre-17 versions like everyone else at this point, a way around this performance limitation is to use arrays like this: id = any(array[1, 2, 3]) You're welcome.

Real World Performance Gains With Postgres 17 B-tree Bulk Scans | Crunchy Data Blog

crunchydata.com
Like Comment
To view or add a comment, sign in
Priyansh Khodiyar

DevRel @Datazip | @Vaquill
1mo
Report this post
Can Debezium Lose Events? Yes it can. Debezium, by design, should capture every change (insert, update, delete) from your database and send it downstream. It operates on an “at-least-once” basis, meaning duplicate events can happen if the connector shuts down improperly, but missing events would be a serious issue, marked as a bug to fix ASAP. When Might Event Loss Happen? Event loss can occur if part of the database’s transaction log gets deleted before Debezium captures it. This usually happens in two cases: Connector Downtime: If Debezium isn’t running for a while, and the transaction log reaches its max retention time, parts of it may get discarded. For example, in MySQL, the log retention can be set with binlog_expire_logs_seconds, and in SQL Server, it's managed with the CDC job. Disk Space Limit in Postgres: Postgres handles things differently using replication slots to retain logs until Debezium reads them. But if the connector is down for too long, these logs pile up, using more disk space. In extreme cases, this could lead to a full disk. Preventing Event Loss Log Retention Configuration: Adjust transaction log retention times if you expect any downtime for your Debezium connectors. Replication Slot Control (Postgres): Set a max size for WAL retention to prevent disk space issues. Monitoring: Set up alerts (like from the Kafka Connect REST API) to notify you if Debezium is down for too long.
Like Comment
To view or add a comment, sign in
Will Graves

Enterprise Account Executive @ Materialize
4mo Edited
Report this post
We all recognize the value of data-driven organizations, but still many struggle to act on fast-changing data. Providing customers with real-time insights (i.e. Where’s my order 📦? My luggage 🧳?) OR Identifying and acting on an anomaly in a dataset (i.e. a manufacturing issue ⛓️💥, a fraudulent transaction 💸) are A LOT easier said than done. How fast can you really move? Incremental View Maintenance Replicas, or IVMRs, are little known but provide 1000x performance for read-heavy workloads, without sacrificing data freshness, and do so at a fraction of the price of a traditional replica. Check out Nate Stewart's article on IVMRs, and why they are a critical (and probably missing) piece of your data strategy 📊 https://lnkd.in/eABztQmU #materialize #data #postgres #mysql #oltp

Incremental View Maintenance Replicas: Improve Database Stability and Accelerate Workloads

materialize.com
Like Comment
To view or add a comment, sign in

1,315 followers

61 Posts

View Profile Follow

Amit Gilad’s Post

More Relevant Posts

Explore topics