ACM Books’ Post

View organization page for ACM Books, graphic

414 followers

6mo

An Architecture for Fast and General Data Processing on Large Clusters https://bit.ly/440NCnR proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Early cluster computing systems handled batch processing. Our architecture also enables streaming and interactive queries, while keeping scalability and fault tolerance. Author/Editor: Matei Zaharia, MIT and Databricks #dataprocessing #largeclusters #clustercomputing #computerarchitecture #streaming #interactive #Queries #Scalability #FaultTolerance ACM, Association for Computing Machinery

An Architecture for Fast and General Data Processing on Large Clusters (ACM Books)

amazon.com

To view or add a comment, sign in

More Relevant Posts

Mayank Singh Chauhan

Building CtrlB to help companies cut down their observability bills by 80% | DMs open
6mo
Report this post
A very beautiful talk on why distributed databases won't scale beyond a few 100s of TBs and how it's time for an architecture where storage isn't coupled with compute. ❗High operational overhead. ❗Log spikes. ❗Multi-tenancy. ❗Data reliability. ❗And always the high cost. Full talk here: https://lnkd.in/gtq9nA4Z
Like Comment
To view or add a comment, sign in
Waqas Ahmed

Technical Lead at DPL Pvt. Ltd
1mo
Report this post
The Case for Shared Storage In this post, I’ll start off with a brief overview of “shared nothing” vs. “shared storage” architectures in general. This discussion will be a bit abstract and high-level, but the goal is to share with you some of the guiding philosophy that ultimately led to WarpStream’s architecture. We’ll then quickly transition to discussing the trade-offs between the two architectures more specifically in the context of data streaming and WarpStream; this is the WarpStream blog after all! https://lnkd.in/g9jBpQxD

The Case for Shared Storage

warpstream.com
Like Comment
To view or add a comment, sign in
StarRocks

1,488 followers
3mo Edited
Report this post
🌟Unity Catalog breaks barriers with seamless interoperability across data formats and compute engines, opening the door to a more flexible and open data architecture. Our latest blog breaks down its impacts and why there’s never been a better time to embrace a more open approach to your infrastructure. https://lnkd.in/g2q-g9iN #DataAnalytics #DataEngineering #DataLakeAnalytics #DataLake #DataLakeHouse

Build a More Open Lakehouse With Unity Catalog

starrocks.io
Like Comment
To view or add a comment, sign in
StarRocks

1,488 followers
1mo Edited
Report this post
🌟 Unity Catalog breaks barriers with seamless interoperability across data formats and compute engines, opening the door to a more flexible and open data architecture. Our latest blog breaks down its impacts and why there’s never been a better time to embrace a more open approach to your infrastructure. https://lnkd.in/g2q-g9iN #DataAnalytics #DataEngineering #DataLakeAnalytics #DataLake #DataLakeHouse

Build a More Open Lakehouse With Unity Catalog

starrocks.io
Like Comment
To view or add a comment, sign in
Eyedle

285 followers
2mo Edited
Report this post
𝐀𝐜𝐜𝐞𝐥𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐑𝐞𝐚𝐝𝐬 𝐢𝐧 𝐈𝐜𝐞𝐛𝐞𝐫𝐠: 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐚𝐧𝐝 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 As data volumes grow, efficient data reads become crucial for high-performance applications. In our latest blog, we dive deep into optimizing read performance with PyIceberg. From caching strategies using FSSpec to eliminating network bottlenecks and addressing CPU-bound tasks, we walk through how to achieve significant performance improvements. Key highlights: - How to leverage caching with FSSpec for faster reads with PyIceberg - Tackling CPU bottlenecks with multiprocessing - Achieving a 10x+ speedup in data reads 📖 Read more: https://lnkd.in/dUcwSwsr by Koen Vossen #PyIceberg #Iceberg #DWH #FootballAnalytics

Accelerating Data Reads in Iceberg: Caching and Optimization Strategies

https://eyedle.ai
Like Comment
To view or add a comment, sign in
Girish V

Top Data Engineering voice 2023 🎖️| Writes to 26K+ | Data Engineer Consultant @Tredence | Python|Spark|PySpark|Azure data factory|ADLS Gen2|Databricks|Azure Synapse|Deltalake
3mo
Report this post
Spark changed it’s memory architecture in 2016 to a “dynamic” unified memory architecture. Why? and what does “dynamic” mean? Spark’s unified memory is divided into 2 components, one for execution (where joins, aggregations happen) and the other for storage (where caching happens). Initially (< Spark 1.6), the portions for execution and storage memory were fixed. This meant that if your job is execution heavy and you’ve space left on the storage component (because there’s not a lot to cache), you still end up not using the storage memory and vice versa. This prompted to keep the slider between storage and execution memory movable (>= Spark 1.6) with more priority given to execution memory because at the end of the day, that is where your joins, aggregations etc.. happens. If execution needs memory, it evicts out RDD blocks from storage memory using the LRU algorithm. While on the other hand if storage needs more memory but it’s already being using by execution, storage will evict it’s own blocks based on LRU to make room for new data. Simple, yet beautiful. Have a look at the image below to understand various scenarios :) 👉Follow Girish Gowda for more data engineering related materials and information. Better reach: Vaishnavi MURALIDHAR Shubham Wadekar Asheesh .. Ankita Gulati Ankur Bhattacharya Ankur Ranjan Sumit Mittal Karthik K. Deepak Goyal Sagar Prajapati Shashank Mishra 🇮🇳 Prashant Kumar Pandey Zach Wilson Rajat Gajbhiye Credit: Afaque Ahmad Save and reshare ✅ 𝑹𝒆𝒑𝒐𝒔𝒕 𝒊𝒇 𝒚𝒐𝒖 𝒇𝒊𝒏𝒅 𝒊𝒕 𝒖𝒔𝒆𝒇𝒖𝒍 #dataengineering #apachespark #spark #memorymanagement
3 Comments
Like Comment
To view or add a comment, sign in
Arka S.

🚀 Full Stack Developer | MERN Stack Specialist | AWS Enthusiast
4mo
Report this post
Curious about system design? Check out my latest blog on the CAP theorem! 🚀Learn how Consistency, Availability, and Partition Tolerance shape robust distributed systems #SystemDesign #TechInsights #CAPTheorem https://lnkd.in/gfDQqXe4

Understanding the CAP Theorem: A Comprehensive Guide with Real-World Applications

medium.com
Like Comment
To view or add a comment, sign in
Koen Vossen

Founder TeamTV / Founder PySport / Instructor KNKV / PyData Eindhoven
2mo
Report this post
We managed to bring PyIceberg read times down from 50 seconds to 4 seconds by improving how we handle caching and network requests. It was a practical solution that made a noticeable difference in performance. If you're working with Iceberg and want to optimize data reads, you can find the details here 👇

Eyedle

285 followers
2mo Edited

𝐀𝐜𝐜𝐞𝐥𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐑𝐞𝐚𝐝𝐬 𝐢𝐧 𝐈𝐜𝐞𝐛𝐞𝐫𝐠: 𝐂𝐚𝐜𝐡𝐢𝐧𝐠 𝐚𝐧𝐝 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 As data volumes grow, efficient data reads become crucial for high-performance applications. In our latest blog, we dive deep into optimizing read performance with PyIceberg. From caching strategies using FSSpec to eliminating network bottlenecks and addressing CPU-bound tasks, we walk through how to achieve significant performance improvements. Key highlights: - How to leverage caching with FSSpec for faster reads with PyIceberg - Tackling CPU bottlenecks with multiprocessing - Achieving a 10x+ speedup in data reads 📖 Read more: https://lnkd.in/dUcwSwsr by Koen Vossen #PyIceberg #Iceberg #DWH #FootballAnalytics

Accelerating Data Reads in Iceberg: Caching and Optimization Strategies

https://eyedle.ai

1 Comment
Like Comment
To view or add a comment, sign in
Pedro Mário Silva

Senior Solutions Architect
9mo
Report this post
Learn how GPUs can accelerate #ApacheSpark ETL and machine learning workloads by up to 40x at #GTC24. Dive into the benchmarking, architecture, and capabilities of the RAPIDS Accelerator for Apache Spark. Register today > https://nvda.ws/3SUOuGN
Like Comment
To view or add a comment, sign in
Darrin P Johnson, MBA
9mo
Report this post
Learn how GPUs can accelerate #ApacheSpark ETL and machine learning workloads by up to 40x at #GTC24. Dive into the benchmarking, architecture, and capabilities of the RAPIDS Accelerator for Apache Spark. Register today > https://nvda.ws/3SUOuGN
Like Comment
To view or add a comment, sign in

414 followers

View Profile Follow

ACM Books’ Post

More Relevant Posts

Explore topics