Kidong Lee’s Post

Founder of Cloud Chef Labs | Chango | Unified Data Lakehouse Platform | Iceberg centric Data Lakehouses

With data virtualization, your distributed data sources can be joined and queried without data movement. Nowadays, data virtualization is mandatory to build data lakehouses in your organization. But if you want to build your data lakehouses with the support of iceberg, you need to support theses for your data virtualization. - data lakehouse access control with storage security - iceberg catalog - automated iceberg table maintenance And you may also consider query engines used for your data virtualization with the following feature. - No downtime to run queries. - scale out query engines. The below picture shows how data virtualization works in Chango(https://lnkd.in/gHvp6Eud). Users send queries just to endpoints of Chango data virtualization. The queries will be executed by the query engines of trino and spark which can join your distributed data sources for your data virtualization. Chango Trino Gateway is used to provide no downtime to run trino queries with scaling out small trino clusters rather than one monolithic giant trino cluster. For iceberg support, Chango provides Chango REST Catalog which is an Iceberg REST Catalog and automated iceberg table maintenance. Strong data lakehouse access control is also provided with catalog, schema and table level.

To view or add a comment, sign in

More Relevant Posts

Cloud Chef Labs

3 followers
3w
Report this post
With data virtualization, your distributed data sources can be joined and queried without data movement. Nowadays, data virtualization is mandatory to build data lakehouses in your organization. But if you want to build your data lakehouses with the support of iceberg, you need to support theses for your data virtualization. - data lakehouse access control with storage security - iceberg catalog - automated iceberg table maintenance And you may also consider query engines used for your data virtualization with the following feature. - No downtime to run queries. - scale out query engines. The below picture shows how data virtualization works in Chango(https://lnkd.in/gHvp6Eud). Users send queries just to endpoints of Chango data virtualization. The queries will be executed by the query engines of trino and spark which can join your distributed data sources for your data virtualization. Chango Trino Gateway is used to provide no downtime to run trino queries with scaling out small trino clusters rather than one monolithic giant trino cluster. For iceberg support, Chango provides Chango REST Catalog which is an Iceberg REST Catalog and automated iceberg table maintenance. Strong data lakehouse access control is also provided with catalog, schema and table level.
Like Comment
To view or add a comment, sign in
CData Virtuality

20,211 followers
9mo
Report this post
#datavirtualization has played a pivotal role in helping organizations break down silos by integrating diverse and distributed data sources with ease. However, data virtualization has some challenges when used in isolation: ➡️Additional load on data sources ➡️Performance issues and lack of scalability ➡️Limitations in comprehensive #datamanagement The go-to approach proposed by data virtualization solutions to tackle these challenges is caching. While caching has its benefits it also has several shortcomings. To meet the full spectrum of business needs we need to integrate data virtualization and #datareplication. In our latest blog post, we explore the use of caching and replication and outline the ideal use cases for each approach. Check it out - https://lnkd.in/duXSCrVY #dataintegration

Enhancing Data Virtualization with Replication

https://datavirtuality.com/en/
Like Comment
To view or add a comment, sign in
Olmstead Associates

3,694 followers
7mo
Report this post
Considering data virtualization? Here are the top 3 advantages:

Top 3 Advantages of Data Virtualization | Olmstead

https://www.olmst.com
Like Comment
To view or add a comment, sign in
Sureworks Infotech Pvt Ltd - India

3,216 followers
8mo
Report this post
The impact of Big Data on infrastructure strategies is Expected to be Substantial The utilization of Big Data is transforming the way companies utilize information for their strategies and operations, yet it also presents significant challenges within the Data Center. These challenges can manifest in various aspects of operations due to the disruptive nature of Big Data. However, the most prominent impact on the Data Center is observed in the infrastructure segment of operations. Three significant obstacles in Big Data infrastructure According to a report from Data Center Knowledge, Big Data is now influencing the development of infrastructure strategies in various areas of the Data Center, particularly in storage. The challenges posed by Big Data in terms of scale and performance are significant and cannot be ignored. Accessibility Big Data poses a significant infrastructure challenge as it relies on users accessing information from various sources, leading to structured and unstructured data entering the storage environment. This necessitates a more accessible storage environment compared to traditional models. The importance of cross-referencing data and establishing connections between key information pieces is emphasized with the increasing prominence of Big Data, highlighting the need for infrastructure models capable of handling diverse source systems for operational success. Flexibility The report emphasizes the close relationship between flexibility and accessibility in the context of Big Data. IT managers are faced with the challenge of managing capacity and performance, necessitating a storage environment that is both adaptable and responsive. The ability to make significant changes without a complete data migration is crucial, especially considering the impact of the business intelligence layer on storage functionality and performance. Scalability IT managers facing the challenge of quickly expanding capacity for Big Data must be able to scale without disrupting other storage systems. The scalability challenge is compounded by the necessity of adding structure to data stored in Big Data archives, with metadata alone creating significant capacity demands. Tackling the Obstacles Presented by Big Data To reduce expenses, it is important for IT managers to find ways to cut costs without compromising on their storage architecture. This can be achieved by avoiding unnecessary hardware upgrades, expensive support warranties, and ensuring a reliable and resilient storage setup that can function beyond the end-of-service-life dates of equipment.
Like Comment
To view or add a comment, sign in
David Heller

M.Sc. Candidate in Data Science & AI @FIU | B.B.A. in Finance
2mo
Report this post
I just learned about a very cool topic in one of my classes: 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗱𝗮𝘁𝗮𝗯𝗮𝘀𝗲𝘀! In a nutshell, a distributed database is a system where data is stored across multiple physical locations—these could be different servers, data centers, or even geographical regions. Instead of having all your data in one place, distributed databases ensure data is spread out and can be accessed efficiently, no matter where it’s stored. 𝗛𝗼𝘄 𝗜𝘁 𝗪𝗼𝗿𝗸𝘀: • 𝗗𝗮𝘁𝗮 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻: Data is split into smaller chunks and stored across multiple servers (nodes), reducing the load on any single system. • 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻: Copies of the data are stored across multiple nodes to ensure availability and fault tolerance. If one server goes down, another can step in without data loss. • 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆: Changes to data must be synchronized across all copies to ensure everything stays up to date. • 𝗤𝘂𝗲𝗿𝘆𝗶𝗻𝗴: Even though data is spread out, you can access it as if it’s in one place. The system fetches data from across the network and combines it seamlessly. 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀: • 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆: As data grows, distributed databases allow for horizontal scaling by adding more servers without sacrificing performance. • 𝗥𝗲𝗹𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆: If one server or data center goes down, the system continues running smoothly thanks to replication. • 𝗦𝗽𝗲𝗲𝗱: Storing data closer to users reduces latency and improves response times. 𝗥𝗲𝗮𝗹-𝗪𝗼𝗿𝗹𝗱 𝗘𝘅𝗮𝗺𝗽𝗹𝗲𝘀: • 𝗚𝗹𝗼𝗯𝗮𝗹 𝗘-𝗖𝗼𝗺𝗺𝗲𝗿𝗰𝗲: When millions of users are accessing a website from different parts of the world, a distributed database helps by storing user and product data close to each region, speeding up page loads and checkout times. • 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗦𝗲𝗿𝘃𝗶𝗰𝗲𝘀: Platforms like Netflix use distributed databases to ensure users can stream content from the server closest to their location, reducing buffering times. • 𝗙𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝗧𝗿𝗮𝗻𝘀𝗮𝗰𝘁𝗶𝗼𝗻𝘀: Banks use distributed databases to ensure that transactions are recorded and replicated across multiple data centers, ensuring both reliability and security. Learning about distributed databases gave me a new appreciation for the systems that keep our apps and services fast, reliable, and scalable on a global level!
8 Comments
Like Comment
To view or add a comment, sign in
Arnab Karmakar

VCF Support Account Manager||2xVCP-DCV and VMC|VSAN Specialist|NCDA-Netapp
7mo Edited
Report this post
What is a flexcache volume in Netapp? It is a remote cache volume of an exisiting volume, that contains actively accessed or hot data. The origin can be on the same cluster or a remote cluster or on a CVO instance. This helps to defeat data silos and reduces data access time. The pre-requisites is almost similar to snapmirror which involves peering the clusters followed by peering the data SVMs. While peering the svm, you need to specify flexcache under applications. Flexcache data uses the intercluster lifs, hence no overhead on data lifs. The flexcace volume is a sparse volume which means it only keeps the hot data instead of the entire data on the origin volume. Some important terminologies: 1. Remote access layer(RAL): Enables or facilitates the read/write operations for flexcache 2. Remote entry file (REM): Located at the origin, keeps entries of data that are cached 3. Remote index metafile (RIM): Located at the destination, keeps entries of files that are cached locally on the flexcache volume 4. Remote lock entry metafile (RLEM): Located at both source and destination, holds lock authority information When a flexcache is created, how the data is access: 1. A read request comes from the NAS layer 2. Ontap understands that the request is for flexcache volume and it knows it cannot load the inode to serve the request. It invokes RAL. 3. RAL on the remote storage initiates a connection over to RAL that holds the origin volume over the intercluster LIF. 4. A remote storage operation is started and the data is sent to the node holding the flexcache volume. 5. RAL creates a entry into the REM file on the source 6. After the data is received, the inode is written to disk and RIM entry is made at destination. 7. Now the data is on the flexcache volume is served locally. Now, on the next read request the data will not be requested over the WAN but will be served locally by checking the RIM file. There can be a scenario where the data is modified at source. RAL will verify if there was any delegation to the data and hence the new write deltas will again be requested over the intercluster LIFs. What if there is a write request? It follows the write around concept, where the data is written at source. The REM and RIM files are in sync with eachother and for any changes in a file, the RAL will send the message across the caches that the data has been modified and a new read needs to come from source since it has been invalidated. Refer more to the white paper for more information: https://lnkd.in/gQDuuvsZ How to create a flexcache? https://lnkd.in/g-8Q4X-P

FlexCache in ONTAP 9.11.1 | TR-4743

netapp.com
Like Comment
To view or add a comment, sign in
Pranab Das

Student at SOA University Corporate Relation
3mo
Report this post
Namaste techies:- Data Partitioning is a technique used in system design to manage large volumes of data by dividing it into smaller, more manageable pieces. This approach helps improve performance, scalability, and maintainability of the system. important Topics for Data Partitioning Techniques in System Design Horizontal Partitioning/Sharding Vertical Partitioning Key-based Partitioning Range Partitioning Hash-based Partitioning Round-robin Partitioning Horizontal Partitioning/Sharding in Simple Words Horizontal Partitioning (or Sharding) is a way to manage large amounts of data by splitting it into smaller, more manageable pieces based on rows or records. Here’s how it works: What It Is: Dividing Data by Rows: Imagine you have a huge table with lots of rows of data. Instead of keeping everything in one big table, you split the rows into smaller tables, each with a portion of the data. Distributed Storage: These smaller tables (partitions or shards) are stored on different servers or storage systems. Advantages of Horizontal Partitioning/Sharding Greater Scalability: Explanation: Horizontal partitioning allows you to split a large dataset into smaller pieces. These pieces can be stored and processed on multiple servers. Benefit: As your data grows, you can simply add more servers to handle additional partitions, allowing the system to scale up easily. Load Balancing: Explanation: By dividing data into partitions, the workload is spread across multiple servers or nodes. Benefit: This helps prevent any single server from being overwhelmed, balancing the load and improving overall system performance. Data Separation: Explanation: Each partition operates independently, so problems or failures in one partition don’t affect the others. Benefit: This isolation improves fault tolerance and ensures that the system remains functional even if one part encounters issues.
Like Comment
To view or add a comment, sign in
Aman Singh

LinkedIn Top Voice | Vice President Technology
6mo
Report this post
🚀 Unlocking the Power of Sharding: Enhancing Database Performance and Scalability 🚀 In today's data-driven world, managing large datasets efficiently is crucial. One powerful technique to achieve this is sharding. Let's dive into what sharding is and why it's a game-changer for database architecture. What Is Sharding? Sharding is a database architecture pattern that involves partitioning a large database into smaller, more manageable pieces called shards. This approach significantly improves the performance, scalability, and manageability of extensive databases. Highlights: 💡 Performance Boost: Sharding distributes data across multiple databases or servers, enhancing query response times and reducing the load on individual servers. 💡 Scalability: It allows databases to scale horizontally. As the dataset grows, simply add more shards to handle the increased load. 💡 Manageability: Smaller, distributed databases are easier to manage, backup, and restore compared to a single monolithic database. 💡 Availability & Fault Tolerance: Sharding improves availability. If one shard goes down, others can continue operating, ensuring minimal disruption. 💡 Complexity: Implementing sharding introduces complexity in database design, query routing, and data distribution. 💡 Data Consistency: Maintaining data consistency across multiple shards can be challenging in distributed systems, but the benefits often outweigh the difficulties. By adopting sharding, organizations can achieve greater performance, flexibility, and reliability in their database management. It's a powerful tool for any engineer dealing with large-scale data systems. #DatabaseManagement #Sharding #TechInnovation #Scalability #DataManagement #PerformanceOptimization #FaultTolerance #BigData #CloudastraTechnologies #Cloudastra
Like Comment
To view or add a comment, sign in
Infotel Corp

936 followers
4mo
Report this post
Data virtualization is a new approach that allows users to access data from various sources without moving or copying it. It provides self-service for data users and reduces the workload for IT specialists. Data virtualization complements data warehouses and can be used for updating applications via service interfaces. #datavirtualization #datawarehouse #selfserviceanalytics https://lnkd.in/erWM7FtE

Data virtualization challenges traditional technologies - IBM Nordic Blog

https://www.ibm.com/blogs/nordic-msp
Like Comment
To view or add a comment, sign in
Animesh Shukla

GCP Data Engineer at Cognizant | GCP Dataflow | Distributed System Design | Machine Learning | Data Science | GCP 2X
8mo
Report this post
In distributed computing, data is replicated across multiple nodes because of the following reasons: 1) to improve latency so that data from the closest data-center (geographical locations) serves the read-request, 2) if one of the node goes down, the requests being served to this node can be routed to some other node (fault-tolerance). Thus, the key to distributed systems is replication of data across multiple nodes. However, the architect needs to decide whether to use synchronous- or asynchronous-replication in the distributed system.. I'll try to make it easy for you to understand the difference between synchronous and asynchronous replication with the example of write-request being sent to a distributed database. However, first we need to understand what leader-follower (master-slave) configuration is. Leader is the node where all the write requests are directed to and the changes are made in the local storage of the leader and are also stored in the replication log. This replication log is further sent to each of the followers so that the latest changes could be updated on them. Now, if we want to read the data, we have multiple replicas of the latest data. Thus, the read-requests are sent to the followers with the help of load-balancer. In this way, we can take advantage of the architecture containing multiple followers and single leader. This is how we get low latency and high throughput reads. As the read-request increases, we can attach new followers by taking the snapshot of one of the existing followers (which stores the current state of the data) and create a new node using that snapshot. Meanwhile when this node was getting created, whatever new changes were made should be stored in the leader's replication log. Thus, we can detect the incremental changes by comparing the leader's replication log with that of the newly created node's replication log and the incremental changes found can be updated in the newly created node. Now, the follower (node) stores the latest changes and hence, is ready to serve read-requests. In this way, we can go ahead and setup as many followers as required. This is how auto-scaling is done in distributed systems. #datascience #dataengineering #distributedcomputing #bigdata
Like Comment
To view or add a comment, sign in

889 followers

71 Posts

View Profile Connect

Kidong Lee’s Post

More Relevant Posts

Explore topics