Philippe Noël’s Post

Co-founder, CEO @ ParadeDB | PostgreSQL for Search & Analytics

1mo

It's a new week, it's a new MaterializedView newsletter. If you're into data infra I *highly* encourage reading this one. It's on DuckDB and data warehouses. As always Chris Riccomini shines with a clear, thoughtful and pragmatic analysis. On Chris' writing, DWHs, and Postgres: First, I'm new to the data infra space. When I first spoke to Chris 13 months ago, I knew nothing. We wanted to build an Elasticsearch competitor on Postgres, and as Chris pointed out we were "green." We had everything to learn. We still have everything to learn. Reading MaterializedView has been an incredible source of pragmatic analysis on the data infra space. In a space where cargo culting is prevalent, Chris keeps it real. I attribute a lot of our learning progress to Chris' advices. Seriously, read it. Building a startup is like finding a needle a haystack. The world is filled with tools, can you find where there's a gap? Then fill the gap and start pushing against complacent solutions to widen the gap into a big opportunity. Data infra in particular is a space where every tool does 90% of the same thing, but the 10% that's different matters tremendously. That 10% is "the gap". It's where there's room to innovate. In small, unsexy ways which can grow to be big over time. We started building an Elastic competitor because, well, Elastic has many flaws. It's great, but it has many flaws. So many investors have told us to build a data warehouse. The analytics market is so big, they say. "Phil, ES is 10b$, Snowflake is 50b$!". But as Chris points out: DWHs are good. People love them. They're expensive and good. They're expensive *because* they are good. We decided to focus on a "smaller" but more painful problem instead. We also played a lot with DuckDB. As Chris points out, it's not a DWH. And competing on price is hard. Meanwhile, Postgres is eating the small workloads market because, well, people love it and it's so extensible. To build a Postgres product, you need it to be 100% Postgres. Not 99%, but truly 100%. We learned this the hard way over the last year. We've kept our focus narrow -- a better Elastic --, and are soon releasing fast on-disk analytics in Postgres. The first version will come out with v0.12, this week. And we're building it in a true Postgres way. No shortcuts. It won't be a data warehouse. It won't be DuckDB-based. It's still part of our quest to build a better Elastic competitor. We're calling it fast facets. To stay up to date, follow our main repository, `paradedb/paradedb`. And to get a truly fresh read on the data infra space, read @materializedview. Chris' writing distills products and market dynamics into their core principles, and those are how teams and products win.

2 Comments

Matt Green

Building Denormalized, the easiest way to work with real-time data

1mo

Well said -- link for the lazy https://github.com/paradedb/paradedb

1 Reaction

Tarik Remila

Startup Investor

1mo

🔥

See more comments

To view or add a comment, sign in

More Relevant Posts

Crunchy Data

5,821 followers
2mo
Report this post
Query your Iceberg tables with your Postgres! The Iceberg data format is already prominent in the big-data space, mainly for accessing data lakes. Recently, it has gained momentum in all facets of the data industry. In the future, we expect Iceberg to be a tool used by application developers and traditional DBAs. (Aside: AWS has a product called “Glacier” — Iceberg and Glacier are entirely different) Iceberg: The Basics Iceberg is an open table format specification. Its purpose is to interact with files as if they were databases. The specification defines metadata for organizing the files and the structure within them. Iceberg is designed for large data sizes (i.e. big data) and supports analytical workloads on top of that. Because Iceberg is a specification, any data tools can integrate with Iceberg tables. How is Iceberg different from other files-on-disk? Iceberg offers several advantages like traditional databases. These features include: - querying with smart query scan planning - schema evolution - hidden partitioning - version rollback Iceberg & Parquet Iceberg tables are stored on disk or on Cloud Object Storage, mainly what’s S3 compatible. The most popular format is sets of Parquet files. Parquet is a columnar file format with built-in compression, optimized for analytical queries. Storing Parquet files in an open table format provides a crucial benefit: interoperability. This allows many tools and query engines to access the same data. For example, Spark jobs can process large-scale data transformations from the same Iceberg table that Postgres queries for analytics. **Why is a Postgres company talking about Iceberg?** Crunchy Data is betting on Iceberg, and we are betting on the interoperability of Crunchy Postgres with Iceberg tables. In July 2024, we launched the ability to query Iceberg tables with Postgres. Companies with data scientists and analysts use SQL to query Iceberg tables from their warehouse. You can use Postgres for the ET (extract-transform) of your Iceberg tables. Since launch we have released additional Postgres + Iceberg features, and have more on the roadmap. Iceberg will be a foundation of the data ecosystems for the future.
2 Comments
Like Comment
To view or add a comment, sign in
Craig Kerstiens

Product at Crunchy Data
2mo
Report this post
In my career there are a few things that you can tell will change the landscape of things. Modern frameworks Rails/Django/Laravel were that, Git was that, the cloud was that. Iceberg is no doubt in that category. If you're just waking up to it, or if you thought it was that AWS S3 storage thing then now is a great time to dig in and learn a bit. Or just roll up your sleeves and start working with it.
Crunchy Data

5,821 followers
2mo

Query your Iceberg tables with your Postgres! The Iceberg data format is already prominent in the big-data space, mainly for accessing data lakes. Recently, it has gained momentum in all facets of the data industry. In the future, we expect Iceberg to be a tool used by application developers and traditional DBAs. (Aside: AWS has a product called “Glacier” — Iceberg and Glacier are entirely different) Iceberg: The Basics Iceberg is an open table format specification. Its purpose is to interact with files as if they were databases. The specification defines metadata for organizing the files and the structure within them. Iceberg is designed for large data sizes (i.e. big data) and supports analytical workloads on top of that. Because Iceberg is a specification, any data tools can integrate with Iceberg tables. How is Iceberg different from other files-on-disk? Iceberg offers several advantages like traditional databases. These features include: - querying with smart query scan planning - schema evolution - hidden partitioning - version rollback Iceberg & Parquet Iceberg tables are stored on disk or on Cloud Object Storage, mainly what’s S3 compatible. The most popular format is sets of Parquet files. Parquet is a columnar file format with built-in compression, optimized for analytical queries. Storing Parquet files in an open table format provides a crucial benefit: interoperability. This allows many tools and query engines to access the same data. For example, Spark jobs can process large-scale data transformations from the same Iceberg table that Postgres queries for analytics. **Why is a Postgres company talking about Iceberg?** Crunchy Data is betting on Iceberg, and we are betting on the interoperability of Crunchy Postgres with Iceberg tables. In July 2024, we launched the ability to query Iceberg tables with Postgres. Companies with data scientists and analysts use SQL to query Iceberg tables from their warehouse. You can use Postgres for the ET (extract-transform) of your Iceberg tables. Since launch we have released additional Postgres + Iceberg features, and have more on the roadmap. Iceberg will be a foundation of the data ecosystems for the future.
Like Comment
To view or add a comment, sign in
Martin Goebbels

Over 20 years dedicated to data analytics and related activities, tools change but foundations stay. Good data quality may not make your decisions better ones, but bad data quality will definitely make them worse.
5mo
Report this post
Follow this article of Uli Bethke about Snowflake Polaris Catalog and Iceberg Tables. He has remarkable skills to get to the point. #snowflake #dataengineering #dataarchitecture

Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion
5mo Edited

Iceberg Ahead! All you need to know about Snowflake's Polaris Catalog Read the post to get all the details. Here are the key takeaways 🔍 Understanding Polaris and Horizon The Polaris Catalog is not a Data Catalog. The native Data Catalog in Snowflake is called Horizon. The Polaris Catalog is a technical catalog that needs to be understood in the context of Apache Iceberg. Apache Iceberg is an open table format. The original open table format is Hive and the Hive Metastore. Iceberg addresses a lot of the limitations of Hive. Open table formats make interoperability between different compute and data processing engines possible. Store your data in the same format (Iceberg). Process it with different engines, e.g. Snowflake, Dremio, Spark etc. Open table formats address the problem of data sprawl and multiplication of data copies. They also address vendor lock-in to proprietary storage formats. 🌐 Snowflake & Apache Iceberg: Snowflake added support for Apache Iceberg a while ago with the option of using a proprietary Snowflake catalog or an external catalog. But there was a catch. Using the external catalog, Snowflake could not write to Iceberg. Using the Snowflake Catalog external engines could not write to Iceberg and reads were limited to Spark via a separate SDK. The Polaris Catalog lifts that limitation. Snowflake and external engines can now read from and write to Iceberg. By announcing the Polaris Catalog, Snowflake has made clear that they support open standards in general and open table formats and the advantages of interoperability between different tools. 🔒 Security Model: You can define the security model and access policies inside Snowflake. The question is if this then also extends to other compute engines or if you need to define it separately each time leading to a multiplication of the metadata. ⚠️ Feature Parity: There is no feature parity between Snowflake native tables and Iceberg tables. A lot of features such as Dynamic Tables, Cloning etc. are not supported (yet?). 💡 Some use cases for Polaris Catalog: Data sharing between organizations will be much easier. Imagine data in the Snowflake Data Marketplace accessible to organizations that use Spark, Dremio, Trino etc. Offloading workloads to other engines for cost, performance, skillset reasons, e.g. run ETL on Snowflake and data science on AWS Sagemaker. You need to support multiple tools and compute engines inside your organization, e.g. one business unit uses AWS Athena another Snowflake. 🔍 Recommendation: For now, do not go all in on Iceberg. Wait until there is more clarity on the support for all features on both native and Iceberg tables. For now, use Iceberg for the use cases I have outlined and make sure to check if you are impacted by the feature limitations. Book recommendation: Apache Iceberg Definitive Guide by Alex Merced et. al. https://lnkd.in/ebTs5qmB

Iceberg Ahead! All you need to know about Snowflake's Polaris Catalog - Sonra

sonra.io
Like Comment
To view or add a comment, sign in
Uli Bethke

Follow me for SQL Data Pipelines, Snowflake, Data Engineering, XML Conversion
5mo Edited
Report this post
Iceberg Ahead! All you need to know about Snowflake's Polaris Catalog Read the post to get all the details. Here are the key takeaways 🔍 Understanding Polaris and Horizon The Polaris Catalog is not a Data Catalog. The native Data Catalog in Snowflake is called Horizon. The Polaris Catalog is a technical catalog that needs to be understood in the context of Apache Iceberg. Apache Iceberg is an open table format. The original open table format is Hive and the Hive Metastore. Iceberg addresses a lot of the limitations of Hive. Open table formats make interoperability between different compute and data processing engines possible. Store your data in the same format (Iceberg). Process it with different engines, e.g. Snowflake, Dremio, Spark etc. Open table formats address the problem of data sprawl and multiplication of data copies. They also address vendor lock-in to proprietary storage formats. 🌐 Snowflake & Apache Iceberg: Snowflake added support for Apache Iceberg a while ago with the option of using a proprietary Snowflake catalog or an external catalog. But there was a catch. Using the external catalog, Snowflake could not write to Iceberg. Using the Snowflake Catalog external engines could not write to Iceberg and reads were limited to Spark via a separate SDK. The Polaris Catalog lifts that limitation. Snowflake and external engines can now read from and write to Iceberg. By announcing the Polaris Catalog, Snowflake has made clear that they support open standards in general and open table formats and the advantages of interoperability between different tools. 🔒 Security Model: You can define the security model and access policies inside Snowflake. The question is if this then also extends to other compute engines or if you need to define it separately each time leading to a multiplication of the metadata. ⚠️ Feature Parity: There is no feature parity between Snowflake native tables and Iceberg tables. A lot of features such as Dynamic Tables, Cloning etc. are not supported (yet?). 💡 Some use cases for Polaris Catalog: Data sharing between organizations will be much easier. Imagine data in the Snowflake Data Marketplace accessible to organizations that use Spark, Dremio, Trino etc. Offloading workloads to other engines for cost, performance, skillset reasons, e.g. run ETL on Snowflake and data science on AWS Sagemaker. You need to support multiple tools and compute engines inside your organization, e.g. one business unit uses AWS Athena another Snowflake. 🔍 Recommendation: For now, do not go all in on Iceberg. Wait until there is more clarity on the support for all features on both native and Iceberg tables. For now, use Iceberg for the use cases I have outlined and make sure to check if you are impacted by the feature limitations. Book recommendation: Apache Iceberg Definitive Guide by Alex Merced et. al. https://lnkd.in/ebTs5qmB

Iceberg Ahead! All you need to know about Snowflake's Polaris Catalog - Sonra

sonra.io

2 Comments
Like Comment
To view or add a comment, sign in
Aniket Mane

VP, Data Engineering and Enterprise applications
4mo
Report this post
The Evolution of ThredUp's Data Platform: Laying the Foundation (2010- 2015) (Part 1 of 3) 🚀 This is the first post in a three-part series detailing how ThredUp’s data platform evolved from our early startup days through to IPO and beyond. In this post, I'll cover the foundation years, from 2010 to 2015, when we built the initial data platform that would support our growing business. Our data journey at ThredUp began in 2010 with a solid foundation on AWS MySQL, where we built Looker BI on top of MySQL replicas. Using a star schema (denormalized) modeled in Looker’s LookML, we created what we called "God models" for our core business entities. With no window functions in MySQL, we relied on creative LEFT join and subquery tricks to derive crucial insights like 1st and 2nd orders, NP (Never Purchased), 1TP (Total Purchase) users, and more. This setup worked well for current data, though historical queries were a bit challenging—but it worked fine as our analytics were mostly focused on operational reporting on actual transactions that happened (e.g., user sign-ups, orders placed, bags ordered, etc.). In late 2014, as our data volumes and needs grew 📈, we transitioned to AWS Redshift. This move was a game-changer—Redshift’s distributed storage, columnar storage, and processing power allowed us to keep our STAR schema design within Looker, simplifying our derived tables. During this period, we also began capturing behavioral analytics data, such as Clickstream, which came with large volumes of data—perfect for Redshift’s capabilities. To analyze user patterns and product funnels effectively, we created a backend fact table in Redshift for sessionizing this data. This allowed us to perform detailed session analytics, uncovering valuable insights into user behavior and further enhancing our understanding of customer journeys 🛤️ Additionally, we moved from using Amazon Data Pipelines for each data transfer from MySQL to Redshift, to using Fivetran for binlog replication to Redshift. At that time, Fivetran was very new, but it significantly simplified our data pipelines and made the replication process more efficient. These years were all about laying the groundwork and setting the stage for what was to come next—a period of rapid expansion and complexity in our data landscape. 🙏 I want to take a moment to thank all the amazing folks who contributed to this era of ThredUp's data platform (2010-2015). Your dedication, leadership and hard work laid the foundation for everything that followed. I’ll be tagging some of you here, and feel free to chime in with your memories and insights from this time! Michael Santhanam Al Ghorai Chris Homer Cameran Hetrick (Stay tuned for Part 2, where I’ll dive into how our data science needs led us to explore new platforms and tools!)

4 Comments
Like Comment
To view or add a comment, sign in
Derick Schmidt

Head of Product: Client Data Platform at Capitec Bank
10mo
Report this post
Thanks David Yaffe, this is the best explanation that I read in a long time. “Think of a database as something that keeps (mutable) state. A data warehouse is more like a collection of immutable facts; it’s meant to keep history and is read-optimized.” #dataengineering #datawarehousing

David Yaffe

Co-Founder at Estuary, Previously Co-Founder of Arbor (Acquired by LiveRamp)
10mo

Is the Modern Data Stack dying, or turning inside out? A funny thing happened a while back. Snowflake added change data capture support. We support it now, as customers requested to add onto their basic ELT pipelines by pushing to new destinations like databases, vector DB’s, SaaS, and other compute engines to process data not just for analytics, but for real-time operations, or for AI model training and execution. About the same time the modern data stack was taking off, Martin Kleppmann was talking about turning the database inside out – which was big inspiration for us when we created Estuary Flow. Think of a database as something that keeps (mutable) state. A data warehouse is more like a collection of immutable facts; it’s meant to keep history and is read-optimized. But what if you focused instead on working with historical data or events, changes, or facts, as they arrive? For a database, that’s a transaction or write-ahead log (WAL). Change data capture exposes that stream from a database, turning the database inside out. It’s possible to store that stream as a new log and keep adding to it forever. Joining streams together to form new ones enables real-time materialized views and it’s possible to create whatever pre-computed state you want to, enabling arbitrary views. Kleppmann talks about replication, secondary indexes, caching, and materialized views all as derived, up-to-date real-time “inside-out” views of data optimized for specific queries. This is exactly what’s happening to the modern data stack. It’s starting to go real-time and turning inside out. Materialized views have already been happening, as have caches. Snowflake, Databricks and data lakes, Amazon Athena, Starburst, and others can be used for data processing. They’re not quite real-time, but newer entrants like Materialize can provide a real-time materialized view. Back in 2014 the Gazette open source project was created to manage streams and batch data together as data with schema, inside out. It eventually became the foundation of Estuary for real-time ETL and CDC. A collection is a durable, append-only cloud store of a stream with exactly-once transactionally guaranteed delivery, just like a WAL. But you can also create new derived views with state called, you guessed it, derivations. These derivations are created by compute engines using SQL, TypeScript, and (soon) Python. Companies use them to do all kinds of processing for data warehouses, but also for real-time operational analytics, search, or processing data for AI. You can connect to many sources streaming or batch, and to many targets - a data warehouse, Elastic, MongoDB…or hundreds of others - streaming or batch. The modern data stack isn’t dead. Like Jurassic Park and other software, it’s .. found a way. It’s evolving and turning inside out, becoming compute engines with state, all wired together as streaming data using something like Estuary, to support real-time analytics and AI use cases.
Like Comment
To view or add a comment, sign in
Chuck Larrieu Casias

Conduktor - Real-time data management
9mo
Report this post
This is insightful! I was also inspired by Kelppmann’s “turn the database inside out” idea in my days at Confluent. In most implementations, it does come at a cost — you have to downgrade to eventual consistency, and you now have to write, maintain, and operate code to transform and manage state over an append-only event log. Materialize takes a different approach, still based on CDC, but letting the developer treat it as a standard database. Instead of developers having to deal with immutable log as a primitive, that real-time machinery is abstracted. It presents simply as a standard database that developers interact with using standard SQL that happens to keep those query results up-to-date in real time. Under the hood, it has a global, virtual timeline that serializes all the changes, so the outputs are exactly consistent with the inputs at all points in the timeline. You don’t have to accept eventual consistency, and you don’t have to rearchitect everything to act on an immutable event log. You can simply do standard CRUD operations on your transactional database, and those inserts, updates, and deletes are reflected in Materialize in near real time.

David Yaffe

Co-Founder at Estuary, Previously Co-Founder of Arbor (Acquired by LiveRamp)
10mo

Is the Modern Data Stack dying, or turning inside out? A funny thing happened a while back. Snowflake added change data capture support. We support it now, as customers requested to add onto their basic ELT pipelines by pushing to new destinations like databases, vector DB’s, SaaS, and other compute engines to process data not just for analytics, but for real-time operations, or for AI model training and execution. About the same time the modern data stack was taking off, Martin Kleppmann was talking about turning the database inside out – which was big inspiration for us when we created Estuary Flow. Think of a database as something that keeps (mutable) state. A data warehouse is more like a collection of immutable facts; it’s meant to keep history and is read-optimized. But what if you focused instead on working with historical data or events, changes, or facts, as they arrive? For a database, that’s a transaction or write-ahead log (WAL). Change data capture exposes that stream from a database, turning the database inside out. It’s possible to store that stream as a new log and keep adding to it forever. Joining streams together to form new ones enables real-time materialized views and it’s possible to create whatever pre-computed state you want to, enabling arbitrary views. Kleppmann talks about replication, secondary indexes, caching, and materialized views all as derived, up-to-date real-time “inside-out” views of data optimized for specific queries. This is exactly what’s happening to the modern data stack. It’s starting to go real-time and turning inside out. Materialized views have already been happening, as have caches. Snowflake, Databricks and data lakes, Amazon Athena, Starburst, and others can be used for data processing. They’re not quite real-time, but newer entrants like Materialize can provide a real-time materialized view. Back in 2014 the Gazette open source project was created to manage streams and batch data together as data with schema, inside out. It eventually became the foundation of Estuary for real-time ETL and CDC. A collection is a durable, append-only cloud store of a stream with exactly-once transactionally guaranteed delivery, just like a WAL. But you can also create new derived views with state called, you guessed it, derivations. These derivations are created by compute engines using SQL, TypeScript, and (soon) Python. Companies use them to do all kinds of processing for data warehouses, but also for real-time operational analytics, search, or processing data for AI. You can connect to many sources streaming or batch, and to many targets - a data warehouse, Elastic, MongoDB…or hundreds of others - streaming or batch. The modern data stack isn’t dead. Like Jurassic Park and other software, it’s .. found a way. It’s evolving and turning inside out, becoming compute engines with state, all wired together as streaming data using something like Estuary, to support real-time analytics and AI use cases.
Like Comment
To view or add a comment, sign in
CampusMonk

3,197 followers
2mo
Report this post
🚀 **Understanding the Types of Databases** 💾 Databases are the backbone of modern applications, enabling organizations to store, retrieve, and manage data efficiently. Here’s a look at the most common types of databases every developer and data professional should know: 1️⃣ **Relational Databases (RDBMS)** - Structured data is stored in tables (rows and columns) with predefined schemas. - Popular examples: **MySQL**, **PostgreSQL**, **Oracle**, **SQL Server**. - Ideal for structured data with complex relationships and transactional consistency. 2️⃣ **NoSQL Databases** - Designed to handle unstructured, semi-structured, and large-scale data. - Categories include document, key-value, wide-column, and graph databases. - Popular examples: **MongoDB** (document), **Cassandra** (wide-column), **Redis** (key-value), **Neo4j** (graph). - Great for flexibility, scalability, and high-velocity data. 3️⃣ **In-Memory Databases** - Store data in system memory (RAM) for faster access and performance. - Popular examples: **Redis**, **Memcached**. - Used for caching, real-time data analytics, and session management. 4️⃣ **Object-Oriented Databases** - Store data as objects, similar to object-oriented programming. - Suitable for applications requiring complex data relationships and integration with object-oriented languages. - Popular examples: **db4o**, **ObjectDB**. 5️⃣ **Graph Databases** - Store data in graph structures with nodes, edges, and properties, representing relationships between entities. - Popular examples: **Neo4j**, **Amazon Neptune**. - Best for applications like social networks, fraud detection, and recommendation engines. 6️⃣ **Columnar Databases** - Store data in columns rather than rows, optimized for read-heavy workloads and analytical queries. - Popular examples: **Apache Cassandra**, **Google Bigtable**, **HBase**. - Great for big data applications, distributed data, and high-performance analytics. 7️⃣ **Time-Series Databases** - Optimized for time-stamped or time-series data, commonly used in IoT, monitoring, and financial systems. - Popular examples: **InfluxDB**, **TimescaleDB**. - Best for applications requiring large-scale data with a time component. 8️⃣ **NewSQL Databases** - Combine the scalability of NoSQL with the consistency and ACID transactions of traditional SQL databases. - Popular examples: **Google Spanner**, **CockroachDB**. - Used for distributed applications requiring both horizontal scalability and strong consistency. Each type of database is suited for specific use cases, and understanding these types is key to choosing the right one for your application! Follow CampusMonk for more insights into databases and tech! 💡 #Database #RDBMS #NoSQL #SQL #GraphDatabases #BigData #CloudComputing #DataEngineering #DataStorage #TechSkills #DeveloperLife #LinkedInTech #TechInsights
Like Comment
To view or add a comment, sign in
T Likesh

Data Engineer | Python | SQL | AWS | PySpark | MongoDB
9mo
Report this post
🚀 Get ready to elevate your data systems to new heights with effective data modeling. Give a quick read to my latest article on MongoDB Data Modeling: Practical Approaches for Real-world Applications. Taking a few minutes to explore something new isn't a bad idea at all. Make sure to check it out and discover more! Data modeling is the backbone of every successful application, ensuring efficiency, scalability, and cost-effectiveness. In my comprehensive guide, I break down essential MongoDB data modeling concepts and strategies tailored for real-world scenarios. From understanding flexible schema models to optimizing schema design patterns, you'll gain practical insights to craft efficient and scalable data models. Here's what you'll find in the blog: 🔍 Understanding the fundamentals of data modeling and its importance for every application. 📊 Exploring MongoDB data storage and how data is organized into collections. 💡 Identifying the workload of your application and mapping schema relationships effectively. 🔄 Choosing between embedding and referencing methods for related data. 🛠️ Applying design patterns to optimize your data model for various access patterns. 🔍 Exploring the subset pattern and its benefits for organizing versatile data models. Whether you're building an e-commerce platform, a social media app, or any other application reliant on data, mastering data modeling is key to success. Don't miss out on this opportunity to level up your skills and build efficient data systems that align with your organization's needs. Check out the full article here and start revolutionizing your data models today! #DataModeling #MongoDB #DataManagement #dataengineering https://lnkd.in/g4sqsDWu

MongoDB Data Modeling: Practical Approaches for Real-world Applications

medium.com
Like Comment
To view or add a comment, sign in
Boniface Munga

Software Engineer💻 | Python, SQL, Gen AI | Crafting Robust, Scalable and Secure software Solutions
2mo
Report this post
🚀 Thrilled to announce another milestone on my Data Engineering journey! 🚀 I just completed the "Introduction to Snowflake" course on DataCamp, and it was a deep dive into one of the most powerful Data Cloud solutions, just as Sridhar Ramaswamy describes it 😃. Here’s what I covered in detail: 🔹 Understanding Snowflake’s Unique Architecture Snowflake stands out for its cloud-first architecture, which decouples compute and storage, offering both scalability and flexibility. I learned about the various architectural layers, including virtual warehouses, and how this design allows Snowflake to process data faster and more efficiently than many traditional data warehouses. It was fascinating to see how it stacks up against its competitors in the cloud data space! 🔹 Snowflake SQL Mastery This was a major highlight! I started with the basics—covering data definition language (DDL) and data manipulation language (DML) commands. I also learned how to connect to Snowflake, stage data, and structure databases effectively. Next, I got hands-on with Snowflake SQL: - Exploring joins (including NATURAL and LATERAL JOINs) - Writing subqueries and leveraging Common Table Expressions (CTEs) for reusable queries - Using Snowflake’s powerful string, date, and time functions for data transformation and manipulation - And my personal favourite: working with semi-structured data (like JSON), which is increasingly critical in modern data engineering pipelines. 🔹 Query Optimisation Techniques Another key takeaway was mastering query optimisation. I learned strategies for early filtering, reducing query runtime, and using Snowflake's query history to analyse and improve performance. This will be a game-changer when working with large datasets in real-world scenarios, as I can now ensure my SQL queries are both fast and efficient. 🔹 Handling Data at Scale The course went beyond traditional relational database concepts, teaching me how to handle massive volumes of data with ease using Snowflake’s cloud infrastructure. Whether it’s structured or semi-structured data, I now feel confident working with Snowflake in data pipelines and cloud ecosystems. 💡 Takeaways What really excites me about Snowflake is how it’s designed for the future of data—scalable, cloud-based, and perfect for managing complex datasets across multiple environments. Whether for analytics, reporting, or machine learning workflows, I can now leverage Snowflake's capabilities to deliver fast, reliable results. 🚀 Next Steps I’m looking forward to deepening my skills with more advanced data engineering tools and techniques, but this course has given me a solid foundation to start building robust, scalable data systems. Shoutout to the amazing instructors and DataCamp for this hands-on learning experience!🚀 #DataEngineering #Snowflake #CloudData #SQL

Boniface Munga's Statement of Accomplishment | DataCamp

datacamp.com

3 Comments
Like Comment
To view or add a comment, sign in

10,013 followers

View Profile Connect

Philippe Noël’s Post

More from this author

On Renaissance men and women: Why emerging markets need more of local expert-generalists entrepreneurs than ‘foreign’ development experts

The World Is Getting Better

Explore topics