🍉 James D. Bohrman’s Post

Portcullis Founder | Merge Tree Maxxer | Petabyte scale idealist

2w Edited

I've been very very quiet about this for a multitude of reasons but I think we're finally in a position to start talking about it. About three months ago, my co-founder and I decided we wanted to build an embedded ETL platform to make enterprise exports more efficient and scalable. This is where Portcullis originally started, but over time we realized how different the ClickHouse ecosystem is from other warehouses in the market, from the performance to the tooling. Today, we've decided to pivot away from just building a single tool to building entire petabyte-scale solutions inside the Clickhouse ecosystem. We go feral for bringing clients data projects to life, so if you're interested in exploring the fastest OLAP warehouse whether that's a migration or a specific solution, shoot me a DM or visit the webpage: https://lnkd.in/e-jZn7xJ #clickhouse #petabyte #terabytes #data #analytics #snowflake #datawarehouse #cdp #mining #marketing #sales #gtm

Why Portcullis is betting on Clickhouse for our petabyte-scale solution studio

runportcullis.co

To view or add a comment, sign in

More Relevant Posts

Vu Trinh

I write for 6k+ readers at vutr.substack.com
3mo
Report this post
🚀🚀 I spent 5 hours learning how ClickHouse built their internal data warehouse. You might have heard of ClickHouse. You might know that ClickHouse is fast for both real-time and batch analytics. But here’s something you might not know: how the engineers at ClickHouse — the company behind one of the world’s most powerful OLAP systems — build their internal data warehouse. From a 10,000-foot view, here’s the initial tech stack that ClickHouse used to build their data warehouse: ◉ Airflow was used as the scheduler. ◉ AWS S3 served as the intermediate data layer. ◉ Superset was their BI tool and SQL interface. ◉ And, of course, ClickHouse Cloud was used as the database and processing engine. Here’s the data flow: ◉ They captured 19 data sources into S3 buckets. ◉ The data was then inserted into the raw layer, maintaining the same structure as the source tables. ◉ Data transformations were performed using the ClickHouse engine, orchestrated by the Airflow scheduler. ◉ Transformed data was stored in mart tables, which represented business entities and met the needs of internal stakeholders. ◉ For data consumption, internal users queried the mart tables and created charts and dashboards using Superset. 💪 After a year, they introduced dbt into their stack to centralize data transformation logic. You can find my detailed article here: https://lnkd.in/gUeATcxu ♻️ If you find my work valuable, please repost it so it can reach more people. #dataengineering #dataanalytics #datawarehouse #clickhouse #olap
8 Comments
Like Comment
To view or add a comment, sign in
Nikolai Potapov

Data Engineer | ETL | SQL | Python | Cloud |
8mo
Report this post
Hello everyone! Time for a quick new post. With limited time on my hands lately, I've decided to share my thoughts in shorter articles that I can refer back to in the future. ClickHouse has long been part of my data routine. I'm quite fond of this tool and its capabilities. This post is about the functionality ClickHouse offers to make it an integral part of your ETL processes or even replace them entirely. https://lnkd.in/eVerGf6h #clickhouse #data #etl #dataengineering

ClickHouse as part of ETL/ELT process

medium.com

2 Comments
Like Comment
To view or add a comment, sign in
Poornima Manikandan

Data Engineer | Google Cloud Certified | Passionate About Building Scalable Data Solutions with Python, SQL, Spark, GCP, AWS, Snowflake & Innovating with GenAI & ML
1mo
Report this post
🚀 What is ClickHouse and Why It Stands Out Among Data Warehouses? 🚀 The data world is abuzz with ClickHouse, a standout among data warehouses for its exceptional capabilities. If you deal with massive datasets and crave swift, scalable analytics, ClickHouse is your go-to solution. Here's why ClickHouse shines bright in the realm of data warehouses: ⚙️ Key Advantages of ClickHouse: 1️⃣ Blazing Fast Queries 🔥: Leveraging columnar storage and vectorized execution, ClickHouse processes billions of rows in mere seconds, delivering prompt insights. 2️⃣ Real-Time & High-Frequency Data 📈: Seamlessly handles streaming data and intricate aggregations, making it perfect for dashboards, app metrics, and monitoring needs. 3️⃣ Cost-Efficient & Scalable 💰: With its distributed, high-compression storage architecture, ClickHouse maintains a balance between low costs and high performance. 4️⃣ Optimized for Complex Queries 🌐: Particularly adept in scenarios with extensive data volumes and frequent aggregations, ClickHouse stands out in handling complex queries efficiently. 💡 Why Choose ClickHouse Over Traditional DWH? While options like Redshift, BigQuery, and Snowflake boast power, ClickHouse takes the lead in ultra-low latency and real-time analysis at scale, making it the ultimate choice for swift, cost-effective OLAP workloads. While not tailored for transactional tasks, ClickHouse reigns supreme in the realm of fast, economical OLAP. 🚀 #DataEngineering #RealTimeAnalytics #ClickHouse
Like Comment
To view or add a comment, sign in
Aaron Phethean

Data Innovation Leader | Simplifying Data Strategy for Business Growth.
1mo
Report this post
You know what gets me excited about data - performance! Hard to explain, it's like solving a puzzle. I love it. At Matatika, we saved all our clients switching from FiveTran a massive 99.4% on their GA4 costs. Awesome. But we also hugely improved performance for everyone moving data from Big Query to Snowflake! You can read all the details in Reuben's article here: https://hubs.li/Q02XntVl0 Millions of rows an hour to millions a minute! Why does this matter? If your data platform uses row-based pricing, you could be paying far more than necessary. Fivetran charges based on the number of rows processed, which leads to inflated costs. At Matatika, we focus on cost-based pricing, meaning you only pay for the infrastructure you actually use. So with this huge performance improvement - you win and we win! With 99.99% uptime and real-time data insights, Matatika delivers high performance without the hidden fees or added complexity. Ready to stop overpaying for your data platform? #DataCostSavings #ETLPerformance #GA4Savings #MatatikaSuccess

Incredible BigQuery Extract Performance

medium.com
Like Comment
To view or add a comment, sign in
Ember Crooks

Principal Architect, Data Performance | Field CTO Office at Snowflake
1mo
Report this post
Yet another specific example of the performance improvements that Snowflake is making all the time. It is neat to have runtime decisions for joins based not just on compile-time estimates. There are also some great technical descriptions of joins in Snowflake in this article. https://okt.to/E9aQU6

Query Acceleration Through Smarter Join Decisions

snowflake.com
Like Comment
To view or add a comment, sign in
Tom Firth

Founder @ Cotera (YC W22) | Empower your product team with AI-powered insights and streamline your product development
8mo
Report this post
One of our founding engineers (GRANT P.) put out a great post recently on how we run "split queries" at Cotera. Often in analytics you run very similar queries back to back. For example, you might roll up some data by week at first, but then change your mind and what to go by month. Normally, both these queries have to go all the way back to the data warehouse, despite 90% of the query being identical. This makes for slow and frustrating experiences for data consumers. So we built split queries to make this better. The general idea is that we identify the chunk of query to reuse, and then cache it as a parquet file. The follow up query will then use the parquet file with duckdb in the browser to "finish off" the query, which feels almost instantaneous. It's somewhat similar to Cube's aggregation cache if you're familiar with that. Check out the post if you're into this kind of thing! https://lnkd.in/d7cibcVM

How ERA brings last mile analytics to any data warehouse via DuckDB | Cotera

cotera.co

2 Comments
Like Comment
To view or add a comment, sign in
Jason Hines

CEO & Co-founder at Gigasheet
2mo
Report this post
Super excited to share Gigasheet's new write-back capabilities for Snowflake and Databricks. Gigasheet now not only allows users to query/analyze data in real-time (from 8+ data stores), but now allows users to post back clean or prepared data. Gigasheet helps unify fragmented data, allowing users to join nearly any file into the warehouse (in a governed way), with support for over 40 file types, from CSVs and Excel to JSON and Apache logs and 250GB+ files. https://lnkd.in/e8eJcfgT

Announcing Write-Back for Databricks and Snowflake with Gigasheet Enterprise

gigasheet.com

5 Comments
Like Comment
To view or add a comment, sign in
Torsten Steinbach

VP, Chief Architect for Analytics & AI
6mo
Report this post
There is a lot of dynamics in the #Lakehouse stack these days. It is high time that the community comes to an agreement for a common open catalog standard that can replace the increasingly dated #hivemetastore. For instance, Databricks (#DeltaLake ecosystem) open sourced #UnityCatalog last week: https://lnkd.in/d48Qvm44. Now, this catalog here by HANSETAG is an open source implementation of the #Iceberg catalog REST API, which comes with many bells and whistles for #enterprise deployments such as secure credential mapping. These two are not necessarily just disjunct alternatives when you take into account #Databricks acquisition of Tabular (now part of Databricks) and the plans they shared last week for convergence of #DeltaLake and #Iceberg: https://lnkd.in/dSC2zEf7 At the end of the day people will want a catalog that supports all established table formats and comes with the critical enterprise bells and whistles!

HANSETAG

168 followers
6mo

🚀 𝐄𝐱𝐜𝐢𝐭𝐢𝐧𝐠 𝐓𝐢𝐦𝐞𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞 𝐖𝐨𝐫𝐥𝐝! 🚀 🆕 Its already Tuesday, and a suspicious absence looms: no new catalog has surfaced this week. But fear not❗ The 🧊 Iceberg Catalog landscape is evolving rapidly with significant announcements from Snowflake and Databricks. At HANSETAG, we're thrilled to introduce 𝐓𝐈𝐏 - our 🦀 #Rust-native #OSS #Iceberg #REST Catalog designed for superior data quality, governance, and flexibility. 🌟 𝐓𝐈𝐏 𝐩𝐫𝐨𝐯𝐢𝐝𝐞𝐬: • 𝐌𝐮𝐥𝐭𝐢-𝐭𝐞𝐧𝐚𝐧𝐜y: Manage multiple projects and warehouses effortlessly. • 𝐂𝐡𝐚𝐧𝐠𝐞 𝐄𝐯𝐞𝐧𝐭𝐬: Stay updated with every table change. • 𝐂𝐨𝐧𝐭𝐫𝐚𝐜𝐭 𝐕𝐚𝐥𝐢𝐝𝐚𝐭𝐢𝐨𝐧: Ensure seamless data operations. • 𝐂𝐮𝐬𝐭𝐨𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧: Extend TIP with ease using our built-in and public interfaces. Explore more: https://lnkd.in/ev5N_WgG #DataLakehouse #DataGovernance #OpenSource #Rust #IcebergCatalog #DataContract #DataMesh #DataProduct #Iceberg #DataProduct

Iceberg Catalog: The TIP of your Lakehouse

medium.com

1 Comment
Like Comment
To view or add a comment, sign in
Plainsight

1,119 followers
9mo
Report this post
🔔 Exciting Announcement! In collaboration with dataMinds.be and hosted by Schoenen Torfs, we warmly welcome all data enthusiasts for our next knowledge-sharing event: "The Analytical Landscape at Torfs & Database, the SeQueL: Should I Still Go SQL?" 🚀 Curious about what's in store? 🌟 Session 1: Join Jasper Verhulst, product owner and data platform specialist at Torfs, to learn from Torfs' journey of becoming more data-driven. Explore data sources, architecture evolution, analytical reports, and delve into Torfs’ merchandising engine. 🌟Session 2: Dive into SQL's relevance with Sander Allert, data platform architect. Discover SQL's evolution from traditional platforms to cloud-based ones like Azure. Explore trends like dbt and applications like Microsoft Fabric. Beware of Python in the garden of Eve! 🐍 🗓️ Tuesday 16th of April 🚩 De Zaat (Temse) ✍️ Don't miss this opportunity to expand your analytical horizons! Register now and secure your spot. #DataAnalytics #SQL #Azure #Torfs #Databricks #Networking

The Analytical Landscape at Torfs & Database, the SeQueL: Should I Still Go SQL?🚀 — Plainsight

plainsight.pro

1 Comment
Like Comment
To view or add a comment, sign in
Bob Mason

Investor, Founder, Software Engineer
2mo
Report this post
The future of enterprise BI? It's as simple as using a spreadsheet. Check out Gigasheet!

Jason Hines

CEO & Co-founder at Gigasheet
2mo

Super excited to share Gigasheet's new write-back capabilities for Snowflake and Databricks. Gigasheet now not only allows users to query/analyze data in real-time (from 8+ data stores), but now allows users to post back clean or prepared data. Gigasheet helps unify fragmented data, allowing users to join nearly any file into the warehouse (in a governed way), with support for over 40 file types, from CSVs and Excel to JSON and Apache logs and 250GB+ files. https://lnkd.in/e8eJcfgT

Announcing Write-Back for Databricks and Snowflake with Gigasheet Enterprise

gigasheet.com
Like Comment
To view or add a comment, sign in

3,519 followers

View Profile Connect

🍉 James D. Bohrman’s Post

Why Portcullis is betting on Clickhouse for our petabyte-scale solution studio

runportcullis.co

More from this author

Leveraging DNS magic to set up custom Polywork profiles for your team

What is fractional thought leadership?

Top 5 content mistakes your cloud-native organization is making

Explore topics