Houssem Korbi’s Post

5mo

🚀 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞: 𝐀𝐩𝐚𝐜𝐡𝐞 𝐈𝐜𝐞𝐛𝐞𝐫𝐠🚀 ✅ 𝐖𝐡𝐚𝐭 𝐞𝐱𝐚𝐜𝐭𝐥𝐲 𝐢𝐬 𝐀𝐩𝐚𝐜𝐡𝐞 𝐈𝐜𝐞𝐛𝐞𝐫𝐠? 🔺 It is a table format created in 2017 by Netflix’s Ryan Blue and Daniel Weeks, for massive analytic datasets. 🔺 It overcame challenges with performance, consistency, and many of the challenges previously stated with the Hive table format. 🔺 Became open source in 2018. ✅ 𝐖𝐡𝐚𝐭 𝐚𝐫𝐞 𝐢𝐭𝐬 𝐤𝐞𝐲 𝐟𝐞𝐚𝐮𝐫𝐞𝐬? 🔺 𝑺𝒄𝒉𝒆𝒎𝒂 𝑬𝒗𝒐𝒍𝒖𝒕𝒊𝒐𝒏 𝒂𝒏𝒅 𝑽𝒆𝒓𝒔𝒊𝒐𝒏𝒊𝒏𝒈 Iceberg supports schema evolution, allowing changes like adding, dropping, renaming columns, and updating column types without affecting query results or data consistency. It also provides versioning, enabling rollback to previous states. 🔺 𝑷𝒂𝒓𝒕𝒊𝒕𝒊𝒐𝒏𝒊𝒏𝒈 Iceberg offers hidden partitioning that abstracts the complexity, optimizing query performance by automatically selecting the most efficient partitioning strategy based on the query workload. 🔺 𝑨𝒕𝒐𝒎𝒊𝒄𝒊𝒕𝒚 𝒂𝒏𝒅 𝑪𝒐𝒏𝒔𝒊𝒔𝒕𝒆𝒏𝒄𝒚 Iceberg guarantees atomic operations and consistent reads through its design, which includes atomic commit protocols. This ensures that updates are all-or-nothing, preventing partial writes and maintaining data integrity. 🔺 𝑫𝒂𝒕𝒂 𝑳𝒂𝒚𝒐𝒖𝒕 𝒂𝒏𝒅 𝑰𝒏𝒅𝒆𝒙𝒊𝒏𝒈 Iceberg optimizes data layout and includes built-in indexing mechanisms, such as manifest files and metadata trees, which enhance query performance by pruning unnecessary data reads. 🔺 𝑻𝒊𝒎𝒆 𝑻𝒓𝒂𝒗𝒆𝒍 Iceberg supports time travel, allowing users to query data as it existed at any point in time. This facilitates easy analysis of historical data and recovery from accidental changes. 🔺 𝑰𝒏𝒕𝒆𝒓𝒐𝒑𝒆𝒓𝒂𝒃𝒊𝒍𝒊𝒕𝒚 Iceberg is designed to be compatible with multiple processing engines such as Apache Spark, Apache Flink, and Trino, making it versatile and easy to integrate into existing data infrastructure. ✅ 𝐑𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 & 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 Iceberg is designed to handle massive tables, often containing tens of petabytes of data by: 🔺 𝑬𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 𝑺𝒄𝒂𝒏 𝑷𝒍𝒂𝒏𝒏𝒊𝒏𝒈: Iceberg enables rapid scan planning, eliminating the need for a distributed SQL engine to read tables or locate files. 🔺 𝑨𝒅𝒗𝒂𝒏𝒄𝒆𝒅 𝑭𝒊𝒍𝒕𝒆𝒓𝒊𝒏𝒈: It optimizes data reading by pruning data files using partition and column-level statistics, leveraging table metadata to filter out unnecessary data. #dataengineering #تونس_أفضل

To view or add a comment, sign in

More Relevant Posts

Krzysztof Chruniak

Domain Architect, AI, ML, LLM, Big Data, Cloud, Serverless
7mo
Report this post
🌟 Data-Driven Revolution: Polars Leads with Lightning Speed in Latest TPC-H Benchmarks 🌟 The recent update on TPC-H benchmark results by Polars has set a new standard in the realm of data processing. This benchmark, crucial for decision support systems, evaluates the performance of database management and data processing systems by simulating complex query execution. Polars has emerged as a frontrunner, showcasing significant optimizations and performance enhancements. 🔍 Why It Matters: In a data-centric world, the ability to process large datasets efficiently is paramount. The TPC-H benchmarks test a variety of SQL queries involving joins, filters, and group-by operations, which are foundational for analytics workloads. Polars' performance in these benchmarks is a testament to its capabilities in handling complex data operations swiftly. 🚀 Key Insights: 🔹 Optimized Performance: Polars has demonstrated superior performance compared to other dataframe libraries, such as pandas and PySpark. This showcases Polars’ ability to handle large datasets with more speed and less resource consumption. 🔹 Benchmarking as a Tool for Improvement: The continual updating of these benchmarks pushes the envelope on data processing technology, driving innovations that trickle down to improved user experiences and application efficiency. 🔹 Implications for Data Professionals: For data scientists and engineers, understanding the strengths and limitations of various processing tools as per these benchmarks can lead to more informed decisions about the right tools for their data operations. 💼 Implications for Professionals: Data professionals should consider the implications of these benchmark results in their projects, particularly those involving large-scale data analytics. The insights gained from such benchmarks can greatly influence the architecture of data solutions, ensuring that they are not only robust but also cost-effective and scalable. 👉 https://lnkd.in/dkUPCkRN 👥 Join the Conversation: 🔹 How do you see the evolution of data processing tools affecting your industry? 🔹 Are there specific challenges in your data workflows that could benefit from the kind of performance Polars is promising? #DataScience #Analytics #BigData #TPCH #Polars #Benchmarking

Updated PDS-H benchmark results

pola.rs
Like Comment
To view or add a comment, sign in
Frisco Analytics

1,007 followers
1w
Report this post
𝗔𝗱𝗮𝗽𝘁𝗶𝘃𝗲 𝗤𝘂𝗲𝗿𝘆 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻: 𝗠𝗮𝗸𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝗸 𝗦𝗺𝗮𝗿𝘁𝗲𝗿 𝗮𝘁 𝗥𝘂𝗻𝘁𝗶𝗺𝗲💡 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗔𝗤𝗘? Apache Spark’s Adaptive Query Execution (AQE) takes performance to the next level by dynamically optimizing queries while they run. Unlike traditional static query plans, AQE adapts to runtime statistics, solving common big data challenges like skewed data and inefficient joins. 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗼𝗳 𝗔𝗤𝗘 1️⃣ 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗝𝗼𝗶𝗻 𝗥𝗲𝗼𝗿𝗱𝗲𝗿𝗶𝗻𝗴: Adjusts join orders based on the size of data. 2️⃣ 𝗦𝗸𝗲𝘄 𝗝𝗼𝗶𝗻 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴: Splits skewed partitions to ensure balanced workloads. 3️⃣ 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗶𝗻𝗴: Combines small partitions to reduce shuffle overhead. 𝗘𝘅𝗮𝗺𝗽𝗹𝗲: 𝗝𝗼𝗶𝗻𝗶𝗻𝗴 𝗦𝗮𝗹𝗲𝘀 𝗮𝗻𝗱 𝗥𝗲𝗴𝗶𝗼𝗻𝘀 𝗗𝗮𝘁𝗮 Imagine two DataFrames: sales: 10 million rows of sales data. regions: 50 rows of region mappings. 𝗪𝗶𝘁𝗵𝗼𝘂𝘁 𝗔𝗤𝗘: sales.join(regions, "region_id").count() Spark creates 200 shuffle partitions for both DataFrames, even though regions is tiny and runtime would be ~20 seconds. 𝗪𝗶𝘁𝗵 𝗔𝗤𝗘: sales.join(regions, "region_id").count() AQE recognizes regions is small and uses a broadcast join, avoiding unnecessary shuffles with 50% faster runtime. 𝗥𝗲𝗮𝗹-𝗪𝗼𝗿𝗹𝗱 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀 ✅ Faster Queries: AQE reduces runtime by making smarter decisions. ✅ Cost-Efficient: Uses cluster resources more effectively. ✅ Scalable: Handles complex queries and large datasets gracefully. 🚀 Ready for smarter, faster Spark jobs? #ApacheSpark#BigData#FriscoAnalytics#Databricks
Like Comment
To view or add a comment, sign in
Alexander Reelsen

Software & Search Engineer
3mo
Report this post
When I read the announcement for ES|QL, I was kind of "eh, one more after Query DSL, KQL, SQL, EQL, who cares?". Well, last week I was able to solve an isse that I could not solve with the Query DSL, Aggregations and Pipeline Aggregations, as I was missing another way of acting on the result of a pipeline aggregation. My final query was 5 lines of ESQL, basically adding the ability to sort on the output that my pipeline aggregation was able to do - just much more readable. Guess I'll be running ESQL more often in the future for more analytics use-cases. However it is still missing the ability to run a classical full text search - reeeally? #elasticsearch

ES|QL | Elasticsearch Guide [8.15] | Elastic

elastic.co

7 Comments
Like Comment
To view or add a comment, sign in
Krishna Gunupudi

Azure Data Engineer | Database Administrator
1mo
Report this post
🚀 𝗚𝗲𝘁𝘁𝗶𝗻𝗴 𝗦𝘁𝗮𝗿𝘁𝗲𝗱 𝘄𝗶𝘁𝗵 𝗔𝗽𝗮𝗰𝗵𝗲 𝗦𝗽𝗮𝗿𝗸 𝗗𝗮𝘁𝗮𝗙𝗿𝗮𝗺𝗲𝘀 🚀 If you're working with big data and exploring Apache Spark, mastering DataFrames is essential. These structured data abstractions make data processing efficient and intuitive. Here’s a quick breakdown: 𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗗𝗮𝘁𝗮𝗙𝗿𝗮𝗺𝗲𝘀 𝟭. 𝗖𝗿𝗲𝗮𝘁𝗲 𝗮 𝗦𝗽𝗮𝗿𝗸 𝗦𝗲𝘀𝘀𝗶𝗼𝗻: The entry point for all operations using higher-level APIs. 𝟮. 𝗟𝗼𝗮𝗱 𝗗𝗮𝘁𝗮 𝗶𝗻𝘁𝗼 𝗮 𝗗𝗮𝘁𝗮𝗙𝗿𝗮𝗺𝗲: Easily read data in formats like CSV, JSON, Parquet, or ORC: 𝗱𝗳 = 𝘀𝗽𝗮𝗿𝗸.𝗿𝗲𝗮𝗱.𝗰𝘀𝘃("/𝗽𝗮𝘁𝗵/𝘁𝗼/𝗳𝗶𝗹𝗲.𝗰𝘀𝘃", 𝗵𝗲𝗮𝗱𝗲𝗿=𝗧𝗿𝘂𝗲, 𝗶𝗻𝗳𝗲𝗿𝗦𝗰𝗵𝗲𝗺𝗮=𝗧𝗿𝘂𝗲) 𝟯. 𝗔𝗽𝗽𝗹𝘆 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: Transform your data by renaming columns, filtering rows, or selecting specific columns: # 𝗥𝗲𝗻𝗮𝗺𝗶𝗻𝗴 𝗰𝗼𝗹𝘂𝗺𝗻𝘀 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗱_𝗱𝗳 = 𝗱𝗳.𝘄𝗶𝘁𝗵𝗖𝗼𝗹𝘂𝗺𝗻𝗥𝗲𝗻𝗮𝗺𝗲𝗱("𝗼𝗹𝗱_𝗰𝗼𝗹𝘂𝗺𝗻", "𝗻𝗲𝘄_𝗰𝗼𝗹𝘂𝗺𝗻") # 𝗙𝗶𝗹𝘁𝗲𝗿𝗶𝗻𝗴 𝗿𝗼𝘄𝘀 𝗳𝗶𝗹𝘁𝗲𝗿𝗲𝗱_𝗱𝗳 = 𝗱𝗳.𝗳𝗶𝗹𝘁𝗲𝗿("𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿_𝗶𝗱 = 𝟭𝟭𝟱𝟵𝟵") 𝟰.𝗪𝗿𝗶𝘁𝗲 𝗗𝗮𝘁𝗮 𝗕𝗮𝗰𝗸 𝘁𝗼 𝗦𝘁𝗼𝗿𝗮𝗴𝗲: Save the processed data in the desired format: 𝗳𝗶𝗹𝘁𝗲𝗿𝗲𝗱_𝗱𝗳.𝘄𝗿𝗶𝘁𝗲.𝗽𝗮𝗿𝗾𝘂𝗲𝘁("/𝗽𝗮𝘁𝗵/𝘁𝗼/𝗼𝘂𝘁𝗽𝘂𝘁") 𝗣𝗿𝗼 𝗧𝗶𝗽𝘀: • 𝗦𝗸𝗶𝗽 𝗶𝗻𝗳𝗲𝗿𝗦𝗰𝗵𝗲𝗺𝗮 𝗳𝗼𝗿 𝗟𝗮𝗿𝗴𝗲 𝗙𝗶𝗹𝗲𝘀: Inferring schema can slow down execution. Instead, define schemas explicitly for better performance and accuracy. • 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻𝘀: Combine transformations like filtering and column selection in a single step to reduce execution overhead. 💡 𝗪𝗵𝘆 𝗨𝘀𝗲 𝗗𝗮𝘁𝗮𝗙𝗿𝗮𝗺𝗲𝘀? • 𝗙𝗮𝘀𝘁𝗲𝗿 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻: Built-in optimizations (Catalyst optimizer) make DataFrames more efficient than RDDs. • 𝗘𝗮𝘀𝗲 𝗼𝗳 𝗨𝘀𝗲: Simplified syntax and transformations streamline complex data workflows. • 𝗩𝗲𝗿𝘀𝗮𝘁𝗶𝗹𝗶𝘁𝘆: Read and write data in multiple formats, making integration seamless. #ApacheSpark #DataFrames #BigData #DataEngineering #DataProcessing
Like Comment
To view or add a comment, sign in
Yashodhar Dodabhangi

Data Engineer | Consultant at EY ||3xAzure | Problem Solving
4mo
Report this post
🚀 A Deep Dive into the Internals of Parquet: The Power Behind Efficient Data Processing 🚀 In the world of big data, the choice of file format is pivotal. Parquet, a columnar storage file format, excels in performance, storage efficiency, and fast querying. Here's a breakdown of what makes Parquet a game-changer: 📦 **Parquet’s File Structure** - **Row Groups:** Basic units of data enabling parallel processing for improved performance. - **Column Chunks:** Divides row groups into column chunks, ideal for analytical workloads. - **Pages:** Smallest units of data in Parquet: - Data Pages: Contain actual data. - Dictionary Pages: Utilized for dictionary encoding, reducing file size. - Index Pages: Metadata for efficient data access. 🧩 **Encoding and Compression Techniques** - Techniques like Dictionary Encoding, RLE, Bit-Packing, Delta Encoding, and Compression Codecs (e.g., Snappy, GZIP) optimize storage at the page level. 🛠️ **Schema Evolution & Data Types** - **Schema Evolution:** Allows schema modifications without affecting existing data, perfect for dynamic datasets. - **Complex Data Types:** Supports nested structures, enhancing versatility across various data types. 🔍 **Optimizations: Predicate Pushdown & Statistics** - **Predicate Pushdown:** Filters at the storage level improve query speed by skipping irrelevant data. - **Statistics:** Metadata like min/max values aid in determining data to read swiftly. 🌐 **Ecosystem Support** - Seamless integration with major big data tools: - Apache Spark (default for structured data) - Apache Hive (efficient storage for large datasets) - Pandas/PyArrow (Python user-friendly interfaces) - Presto/Trino (fast querying in data lakes) Explore how Parquet empowers efficient data processing and enhances big data operations! 🌟 #Parquet #BigData #Parquet #DataEngineering #ApacheSpark #DataStorage #DataProcessing Here below I have given sample image to understand.
Like Comment
To view or add a comment, sign in
Rohit U

Senior Data Engineer at Edward Jones. with expertise in AWS | Azure | Snowflake | ETL | PySpark | Python | SQL
7mo
Report this post
📊 Addressing Data Skewness Using Spark AQE Technique Data skewness can be a significant bottleneck in data processing pipelines, especially in scenarios where large datasets are involved. So, how can we tackle this challenge effectively? Enter Spark AQE (Adaptive Query Execution), a powerful technique that dynamically adjusts query plans to optimize performance, even in the face of skewed data distributions. 🔍 Understanding Data Skewness: Data skewness occurs when certain key values appear more frequently than others in a dataset, causing an imbalance in the distribution. In the finance sector, imagine a scenario where a stock trading platform processes transaction data. Due to the nature of trading activity, certain stocks may be traded more frequently than others, leading to data skewness. 🚀 The Spark AQE Advantage: Spark AQE is a game-changer in addressing data skewness. It dynamically adapts execution plans based on runtime statistics, redistributing workloads to nodes with less data, thereby mitigating the impact of skewed partitions. Let's illustrate this with our finance dataset example: 📈 Scenario: In our finance dataset, we have transaction records for various stocks. Due to high trading volumes, the dataset exhibits significant skewness, with a few stocks dominating the majority of transactions. 🔧 Traditional Approach: Without AQE, Spark might allocate tasks uniformly across nodes, leading to uneven processing times. Nodes handling skewed partitions experience delays, impacting overall job performance. ✨ Spark AQE in Action: With AQE enabled, Spark dynamically detects skewed partitions and adjusts the execution plan. Tasks are intelligently redistributed, ensuring that nodes handle comparable workloads. In our finance dataset, this means that even though certain stocks have more transactions, Spark optimizes processing to maintain efficiency. 💡 Key Takeaways: 1. Performance Boost: Spark AQE optimizes query execution, improving processing times and resource utilization. 2. Dynamic Adaptation: By dynamically adjusting to runtime conditions, AQE ensures efficient processing in the presence of data skewness. 3. Scalability: Whether it's finance, healthcare, or retail, Spark AQE scales seamlessly to handle diverse datasets and workloads. 📚 Learn more about Spark Performance Tuning: Check out the official documentation on SQL Performance Tuning https://lnkd.in/dmuj7SgQ. 📊 Conclusion: In the dynamic world of big data analytics, addressing data skewness is paramount. Spark AQE emerges as a potent solution, empowering organizations across sectors to unlock the full potential of their data, even in the face of challenges like skewed distributions. Let's spark a conversation! How has data skewness impacted your analytics projects, and what techniques have you employed to mitigate it? #DataSkewness #SparkAQE #BigData #Analytics #FinanceSector #DataScience #DataEngineering #Optimization

Performance Tuning

spark.apache.org

1 Comment
Like Comment
To view or add a comment, sign in
Roman MA

Data Engineer at Medicines Discovery Catapult
1mo
Report this post
As someone who frequently works with data, I often have to understand new data sources or refresh my memory about the data's structure. For tabular data, we have SQL tools to enforce the data structure and UML to help us understand the structure. We are not always so lucky with nested data like JSON, API or NoSQL data. Neither data discipline nor documentation is guaranteed. I opted for the "data as its documentation" approach as the data consumer. I created functions to help us navigate through the structure of nested data and find relevant data as we may. By focusing on the structure, not only can I quickly scan through the dataset by leaving the details aside, I can also use statistics to analyse the schema discipline and find the most promenent data. I have cleaned up and refactored these function for anyone to use. I am sure this can speed up your data exploration phase and help you deliver the data with high confidence. https://lnkd.in/eYM9EsSf

GitHub - malokroman/nested-data-helper: Help you find the data nested deep in your data.

github.com

1 Comment
Like Comment
To view or add a comment, sign in
Bhavesh Hariramani

Associate - Data and Analytics @ PwC 👨💻 Via - Oracle, Alfaleus, VIT-Vellore || PySpark || Python || SQL || Azure || ADB || ADF || ADLS Gen2 || Delta Lake || AWS || HDFS || Hive || FAW || OAC || BIP || EPM
6mo
Report this post
𝗪𝗵𝗮𝘁 𝗶𝘀 𝗔𝗤𝗘 𝗶𝗻 𝗦𝗽𝗮𝗿𝗸? 𝗞𝗲𝘆 𝗣𝗼𝗶𝗻𝘁𝘀 𝘁𝗼 𝗥𝗲𝗺𝗲𝗺𝗯𝗲𝗿 𝗔𝗯𝗼𝘂𝘁 𝗔𝗤𝗘 (𝗔𝗱𝗮𝗽𝘁𝗶𝘃𝗲 𝗤𝘂𝗲𝗿𝘆 𝗘𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻): - AQE is a feature introduced in Spark 3.0+ that dynamically updates the query plan using metadata information. - It is particularly useful for optimization in scenarios involving data shuffling. 𝗛𝗼𝘄 𝗗𝗼𝗲𝘀 𝗔𝗤𝗘 𝗛𝗲𝗹𝗽? 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲 𝟭: 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗶𝗻𝗴 𝗨𝗻𝘂𝘀𝗲𝗱/𝗕𝗹𝗮𝗻𝗸 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀 - Imagine you've used a wide transformation, such as `groupBy(dept)`. - If your DataFrame contains only 5 distinct departments, ideally, you should have 5 partitions after shuffling. However, the default setting of `spark.sql.shuffle.partitions` is 200, resulting in 200 partitions. - 𝘼𝙌𝙀 𝙙𝙚𝙩𝙚𝙘𝙩𝙨 𝙩𝙝𝙖𝙩 150 𝙥𝙖𝙧𝙩𝙞𝙩𝙞𝙤𝙣𝙨 𝙖𝙧𝙚 𝙚𝙢𝙥𝙩𝙮 𝙖𝙣𝙙 𝙙𝙮𝙣𝙖𝙢𝙞𝙘𝙖𝙡𝙡𝙮 𝙘𝙤𝙖𝙡𝙚𝙨𝙘𝙚𝙨 𝙩𝙝𝙚𝙢, 𝙘𝙤𝙢𝙗𝙞𝙣𝙞𝙣𝙜 𝙩𝙝𝙚 𝙚𝙢𝙥𝙩𝙮 𝙥𝙖𝙧𝙩𝙞𝙩𝙞𝙤𝙣𝙨. 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲 𝟮: 𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴/𝘀𝗽𝗹𝗶𝘁𝘁𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝘀 - Consider a scenario where one department holds 90% of the total employees after a `groupBy(dept)`, leading to a large partition (which is not ideal). - AQE identifies this imbalance and repartitions the data to manage skewness effectively. - 𝗡𝗼𝘁𝗲: AQE does repartition or splitting of partition only if a certain threshold is crossed. Read the following for more details. ''''' https://lnkd.in/dTed-_FE '''''' 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲 𝟯: 𝗕𝗿𝗼𝗮𝗱𝗰𝗮𝘀𝘁 𝗝𝗼𝗶𝗻 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 - Suppose you have two DataFrames and, after applying filters, you plan to join them. Initially, you might not know the size of the DataFrame post-filtering. - AQE can assess the size and, if it's smaller than the broadcast threshold you set, it dynamically decides to broadcast the table. 🅸🆂🅽'🆃 🆃🅷🅴 🅰🆀🅴 🅵🅴🅰🆃🆄🆁🅴 🅰🅼🅰🆉🅸🅽🅶? 𝗪𝗼𝘂𝗹𝗱 𝘆𝗼𝘂 𝗹𝗶𝗸𝗲 𝘁𝗼 𝗸𝗻𝗼𝘄 𝗵𝗼𝘄 𝘁𝗼 𝗲𝗻𝗮𝗯𝗹𝗲 𝗮𝗻𝗱 𝘂𝘀𝗲 𝗶𝘁? If so, follow me for updates! I’ll be sharing that information soon. 𝗣𝗦: This is not a copy-paste post; I’m sharing my understanding in the simplest words possible, so there might be some mistakes. If you spot any, don’t hesitate to let me know. #learningInPublic is fun, after all. Thank you for reading till the end. See you in the next post! #happyLearning #learning #learningInPublic #spark #AQE #bigdata #dataengineering #distributedComputing

Performance Tuning

spark.apache.org
Like Comment
To view or add a comment, sign in
Mikołaj Sędek

Data Architect @ Limango / Lakehouse Architecture / AWS / Databricks / Terraform / PySpark / SQL / Applied ML
1mo Edited
Report this post
https://lnkd.in/daBNTQw6 - Metadata-driven approach for building Delta Live Table pipelines has got tons of possible applications: - you can create whole complex pipelines for your source bronze tables in fully automated way for numerous source tables on scale (e.g. from CDC files loaded into S3 buckets) - validate and monitor data quality on scale with expectations - use best patterns for loading and CDC data processing (streaming AutoLoader with checkpoints) and then aggregate and join (static frames) within single automated pipeline - adding new rules and tables require very little changes once the project is automated with config objects - you can actually isolate testable Spark code from “humble objects” (classes and functions that are difficult to test locally) in elegant way - especially when you choose PySpark over SQL (I do) - final clean tables can be distributed to clean views or catalogs as dynamic views (no redundancy and data replication) - deployment, monitoring and alerting could be fully automated with Asset Bundles and your favourite CICD tool #DataEngineering #MetadataETL #ETLonScale #Databricks #Spark #DeltaLiveTables #ChangeDataCapture #AutomatingLakehouse

Metadata driven framework for Delta Live tables

anupamchand.medium.com
Like Comment
To view or add a comment, sign in

3,610 followers

153 Posts

View Profile Follow

Houssem Korbi’s Post

More Relevant Posts

Explore topics