Alex Vesa’s Post

🌐 Co-founder & CTO @CogniSync | Senior AI Engineer | Code Architect | MLOps - Deep diver into complex AI paradigms for over a decade.

7mo

𝘚𝘵𝘳𝘦𝘢𝘮𝘪𝘯𝘨 𝘗𝘪𝘱𝘦𝘭𝘪𝘯𝘦𝘴 - 𝘵𝘩𝘦 𝘤𝘩𝘦𝘳𝘳𝘺 𝘰𝘯 𝘵𝘰𝘱 𝘰𝘧 𝘢 𝘓𝘓𝘔 𝘱𝘳𝘰𝘫𝘦𝘤𝘵 🍒 LLM projects often deal with a massive, never-ending stream of data – think social media feeds, news updates, or code repositories. 𝐀 𝐬𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 𝐩𝐢𝐩𝐞𝐥𝐢𝐧𝐞 is built to handle this constant flow, preventing your LLM from blocking on huge data dumps. It processes and embeds information on the fly, keeping your model up-to-date. Bytewax offers a central streaming flow, like the "graph" of your pipeline. Think input() -> process() -> output(). 😎 In my case, I ingested posts, articles, and code from RabbitMQ, cleaned them, chunked them, and embedded them for a Qdrant vector DB (feature store). 𝐅𝐥𝐞𝐱𝐢𝐛𝐢𝐥𝐢𝐭𝐲 𝐢𝐬 𝐊𝐞𝐲 🔑 The beauty of Bytewax? It handles diverse data types. We use a dispatcher to ensure posts, articles, and code are processed differently. Pydantic models ensure data validation at each step. 👌 Why the streaming pipeline with Bytewax? 👇 🔻 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐏𝐨𝐰𝐞𝐫: Built-in Rust for lightning speed! 🔻 𝐏𝐲𝐭𝐡𝐨𝐧 𝐏𝐚𝐫𝐚𝐝𝐢𝐬𝐞: Python bindings for all your favorite ML libraries. 🔻 𝐄𝐚𝐬𝐲 𝐁𝐫𝐞𝐞𝐳𝐲 𝐒𝐞𝐭𝐮𝐩: Plug-and-play, perfect for notebooks and projects. 🔻 𝐂𝐨𝐧𝐧𝐞𝐜𝐭𝐨𝐫 𝐂𝐡𝐚𝐦𝐩𝐢𝐨𝐧: 🔌 Out-of-the-box connectors for Kafka and more (or build your own!). If you're curious to level up your knowledge about streaming pipelines and data engineering 👊 𝗖𝗵𝗲𝗰𝗸 𝗼𝘂𝘁 𝐋𝐞𝐬𝐬𝐨𝐧 𝟒 𝐨𝐟 𝐃𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐌𝐋 𝐋𝐋𝐌 𝐓𝐰𝐢𝐧 𝐂𝐨𝐮𝐫𝐬𝐞. It's FREE, and no registration is required ↓↓↓ 🔗 𝘓𝘦𝘴𝘴𝘰𝘯 4 - https://lnkd.in/d32d9HUV 🔗 𝘓𝘓𝘔 𝘛𝘸𝘪𝘯 𝘎𝘪𝘵𝘩𝘶𝘣 𝘙𝘦𝘱𝘰𝘴𝘪𝘵𝘰𝘳𝘺 - https://lnkd.in/dtTeZHN7

9 Comments

Alex Vesa

🌐 Co-founder & CTO @CogniSync | Senior AI Engineer | Code Architect | MLOps - Deep diver into complex AI paradigms for over a decade.

7mo

🔗 Join 5.7k+ engineers in the 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝗠𝗟 𝗡𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿 for content on production-grade ML. 𝗘𝘃𝗲𝗿𝘆 𝘄𝗲𝗲𝗸: https://decodingml.substack.com

2 Reactions

Paul Iusztin

Senior ML/AI Engineer • MLOps • Founder @ Decoding ML ~ Posts and articles about building production-grade ML/AI systems.

7mo

Writing streaming pipelines was never easier due to technologies such as Bytewax 🔥🔥

4 Reactions

Alex Razvant

AI/ML Engineer | Founder @NeuralBits | Sharing free expert insights on AI Systems.

7mo

I don’t see myself switching bytewax with another tool, anytime soon 😅 - its got everything a dev needs when processing streams.

4 Reactions

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌

7mo

Salah Azekour

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Lazar Gugleta

Data Science & Business Intelligence
4mo Edited
Report this post
Programming differs from Software Engineering and especially Data Science, but the question is what connects them and what should you strive to be? Read my take on it: https://lnkd.in/dWdpNY9G

Why are Data Scientists afraid to use Test Driven Development?

levelup.gitconnected.com
Like Comment
To view or add a comment, sign in
Brahim Elhoube

Big Data & Cloud Computing Engineering Student @ ENSET | Software engineer @ DUP
5mo
Report this post
I've always been curious about the behind-the-scenes activities of network traffic on my machine, including protocols, IPs, and active ports. To satisfy this curiosity, I decided to build a simple data pipeline after spending a few months exploring data engineering field. 𝗧𝗵𝗶𝘀 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗶𝘀 𝗱𝗲𝘀𝗶𝗴𝗻𝗲𝗱 𝘁𝗼 𝗰𝗮𝗽𝘁𝘂𝗿𝗲, 𝘀𝘁𝗿𝗲𝗮𝗺, 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺, 𝘀𝘁𝗼𝗿𝗲, 𝗮𝗻𝗱 𝘃𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗲 𝗻𝗲𝘁𝘄𝗼𝗿𝗸 𝘁𝗿𝗮𝗳𝗳𝗶𝗰 𝗱𝗮𝘁𝗮. Using technologies like Python, Kafka, PostgreSQL, and Power BI, my goal is to create an intuitive solution for analyzing and monitoring network traffic on a host machine. Find more info here: https://lnkd.in/dQRCYWfJ Check out the project source code here: https://lnkd.in/d8WsQvvP

Network Traffic Visualization Data Pipeline

brahimelhoube.com

10 Comments
Like Comment
To view or add a comment, sign in
🎧 Eric Riddoch

Director of ML Platform @ Pattern
2mo
Report this post
MLOps question: I'm a dbt noob. Is anyone out there using dbt for data science? Would you be willing to show me your workflow? I'm thinking through whether it's worth adding another tool to a DS' life, when today they're calling SQL queries in scripts on their own. Here's my soup of questions lol - would it help DS communicate which project uses which data? - would it help an entire team/org be able to track the lineage of queries/tables? - can it be used as a (poor man's) data catalog? - do you run it in a DAG tool? What does that look like for you? - do/should you pay for a vendor? Or is it fine to use with an existing DAG tool? - are you using it for "data tests" of any kind? - is it worth integrating with GreatExpectations? What would that workflow look like? - is there substantial benefit over calling all the sql queries yourself with a python client? - would you ever NOT want to use dbt?

32 Comments
Like Comment
To view or add a comment, sign in
Farooq Ahmed Siddique

Data Scientist @ Wipro | PhD Scholar @ NIT Rourkela | NLP Engineer | Data Science Trainer
8mo
Report this post
A Real World ML Problem... 𝗧𝗵𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 🤔 An ML model can only generate business value once you plug it to a *live data sources*. There is no static CSV file, but constantly flowing data, that needs to be processed and fed into your ML model. And this is precisely what a *feature pipeline* does. 𝗧𝗵𝗲 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 🧪 Is a program that 1️⃣ 𝗳𝗲𝘁𝗰𝗵𝗲𝘀 𝗱𝗮𝘁𝗮 from a data warehouse, Kafka topic or websocket, among others. 2️⃣ 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝘀 this raw data into features 3️⃣ 𝘀𝗲𝗻𝗱𝘀 these feature data to storage (e.g. Feature Store) so the rest of the system can use it. Depending on the frequency at which the feature pipeline runs, we can distinguish between 2 types. - Batch feature pipeline - Streaming feature pipeline ➡️ 𝗕𝗮𝘁𝗰𝗵 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 🕒 A batch feature pipeline is a program, often written in Python or Spark that fetches data and generates features on a schedule, for example: - daily - hourly - every 10 minutes To implement a batch feature pipeline you need - 𝗰𝗼𝗺𝗽𝘂𝘁𝗶𝗻𝗴, for example, a GitHub VM or an AWS Lambda function. - 𝗼𝗿𝗰𝗵𝗲𝘀𝘁𝗿𝗮𝘁𝗶𝗼𝗻, to schedule and trigger the execution of the pipeline. Popular options are Apache Airflow and Prefect. ➡️ 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 ⚡ A streaming pipeline is a program that is 𝗰𝗼𝗻𝘀𝘁𝗮𝗻𝘁𝗹𝘆 ingesting data (e.g. from → an external web socket, or → a message bus like Kafka processing it, and serving it downstream, either to → a message bus (e.g. Apache Kafka), or → a Feature Store. Stream-processing can be implemented either with → Apache Spark Streaming (JVM) → Apache Flink (JVM) → Bytewax (Python on top of Rust) → Pathway (Python on top of Rust) → Quix (pure Python)
Like Comment
To view or add a comment, sign in
Jules Damji

DevRel, Developer Education & Distributed Computing. x-Hortonworks; x-Databricks; x-Anyscale; O’Reilly Author ✍️ ; currently freelancing …
1w Edited
Report this post
This blog explores, examines, and addresses a number of challenges faced by data engineers in the industry, especially in the Gas & Energy sector: 👉🏼 How to avoid building bespoke solutions for industry-specific and myriad data sources for collecting data for ingestion and examination. 👉🏼 How to extend PySpark Data Source API for custom Data sources (native and non-native) so you can use your de facto and scalable Apache Spark™ data processing engine--and that in Python--as your data source and sink. 👉🏼 How to extend the new Python Data Source API for your data sources, sinks, batch, and streaming. 📓Have read! Check out the docs links below, have a go at it, and let us know. cc: 💁♀️ Allison Wang & 💁♂️ Craig L. 🔗 https://lnkd.in/gtX_pZRc 🔗 https://lnkd.in/ghjRk-Tn https://lnkd.in/g5CbWcyE

Simplify Data Ingestion With the New Python Data Source API

databricks.com

1 Comment
Like Comment
To view or add a comment, sign in
VAIBHAW KHEMKA

ML Engineer | IIT Hyderabad | GSoC'21@Robocomp | MLops | LLM | NLP | Big Data | Kubernetes
6mo
Report this post
ML is powerless without big data ML Experiment 1: (Big data) I was building this project using Kafka to ingest data from multiple APIs, send real-time data to the Kafka cluster, create a streaming pipeline to process the data, and store it in a vectorDB. Three Takeaways: 1. Use Kafka for any real-time application Used Upstash Kafka cluster for handling streams of real-time events and messages and it easily maintains the data coming from different API threads. 2. Pydantic models for any data validation Not only for this project, But have seen Pydantic used in every open-source project to handle errors, exception handling, and making the system consistent. 3. Use bytewax for creating a streaming pipeline Recently, Came across with bytewax, a cool streaming pipeline framework in Python used for creating data transformation steps. Find the code Repo: https://buff.ly/3Rf7Czc
1 Comment
Like Comment
To view or add a comment, sign in
GANESH SAWANT

Data Engineer| SQL l Python l DataBricks l Scala l Hadoop l Hive | Spark | PySpark | AWS l EMR | S3 | Redshift | Athena | Glue | Airflow | Paid Mentorship I Bigdata Study Material |Talks about Hiring , BigData
3mo
Report this post
RDD (Resilient Distributed Dataset) Low-level API. Requires detailed coding with functions like map, flatMap, filter, reduceByKey. Not very developer-friendly and lacks optimizations. DataFrame Introduced in Spark 1.3. A higher-level API that simplifies coding. Challenges: Does not offer strong typing, so errors appear at runtime instead of compile-time. Limited flexibility (e.g., restricted use of anonymous/lambda functions). Converting DataFrames to RDDs is possible when flexibility or safety is needed. Conversion from DataFrame to RDD can be costly, and some optimizations are lost. Dataset Introduced in Spark 1.6. Provides compile-time safety and more flexibility with lower-level coding. Conversion between DataFrame and Dataset is efficient and maintains optimizations. Let me know if you need any specific details or further explanation on any of these points!
Like Comment
To view or add a comment, sign in
Abhishek Choudhary

Data Infrastructure Engineering in RWE/RWD | Healthtech DhanvantriAI
1mo
Report this post
I'm developing a Synthetic Data Generation Library designed to support diverse data and ML use cases, including e-commerce, healthcare, patient engagement, lead generation, and customer feedback. During development, I focused on scaling and optimizing to minimize resource usage. Even with Kubernetes as a deployment endpoint, I aim to control resource allocation tightly by fixing resources and pod numbers. - For data generation, I'm using the Faker library in Python, which is fast, simple, and supported by a robust community. - For data writing, I initially relied on Pandas for simplicity but am transitioning to Polars due to its superior performance. - To handle streaming use cases, I’m adding support for DuckDB. - Parquet is the default data format, but I'm also supporting Iceberg. Currently, I use Spark on a single node for Iceberg writes and am experimenting with Getdaft to write Iceberg tables. - For concurrency, I tested various approaches, including Redis queues and Python threading, but ultimately chose Celery as the executor, which moderates load effectively. - I switched from Redis to Dragonfly for caching, and all API compatibility checks have been successful, though I’m still evaluating performance. To ensure efficiency, cached data is reused instead of being regenerated, and I’m incorporating randomization across request entities in the data generation process. I'm also using GenAI with strict Pydantic models to maintain smooth execution within the library.

15 Comments
Like Comment
To view or add a comment, sign in
Sergi Gomez

Founder of Saivo | Data & AI
7mo
Report this post
Good takeaways from our session with the one and only David Jayatillake about Semantic Layers. Thanks Data Action Mentor for having us!

Data Action Mentor

987 followers
8mo

Thanks a million to the hosts of our recent expert session David Jayatillake and Sergi Gomez. The session "Intro to the Semantic Layer - Comparison between cube and dbt metricflow" was a great success! 👏 Here are are a few of the takeaways from the session: 💡 A semantic layer is a technology layer that maps complex data into understandable business terms, acting as an intermediary between data and end-users 💡 This layer facilitates improved data quality, operational efficiency, and enhances information security by structuring data access and management ✅ Use semantic layers for unified definitions of metrics and dimensions to ensure a single source of truth. ✅ Leverage semantic layers to eliminate redundant efforts and facilitate self-service by providing a common business language. ❌ Avoid deploying the semantic layer for one-time exploratory analyses; reserve it for frequently used, well-defined and agreed-upon metrics and definitions. ❌ Adhere to the “don’t repeat yourself” principle by conducting complex metric calculations more upstream in the data pipeline to improve efficiency and reduce duplication. 💡 The cube semantic layer targets midmarket, is open-source with flexible API-support, advanced caching and offers Python and JavaScript support but it has some weaknesses when performing period to period comparisons. 💡 The dbt MetricFlow semantic layer targets enterprise, integrates seamlessly with dbt, offers enterprise level support and a sophisticated development experience but the proprietary aspect of its compilation could result in platform lock-in. Both David and Sergi are veteran semantic layer experts and Mentors at Data Action Mentor.

1 Comment
Like Comment
To view or add a comment, sign in
Joseph M.

Data Engineer, startdataengineering.com | Bringing software engineering best practices to data engineering.
3mo
Report this post
🚨 Working on a large codebase without any tests can be nerve-wracking. One wrong line of code or a seemingly harmless library update can bring down your entire production pipeline! Data pipelines often start simple, so engineers might skip tests initially. But as complexity increases, the lack of tests can severely slow down your feature delivery speed. It’s especially daunting to introduce testing in a large legacy codebase with little to no existing tests. And in long-running data pipelines, bad code can take hours to identify and fix, leading to frustrated stakeholders! 💡 Imagine confidently pushing changes to production without the fear of breaking your pipeline. Picture a world where you deliver features quickly, keep your stakeholders happy, and empower your team to move fast. By incorporating tests into your pipeline, you can dramatically reduce stress, avoid bugs slipping into production, and maintain a smooth, efficient workflow. ✅ Testing might not catch every possible issue, but it can prevent a significant number of production problems. In this post, we’ll explore the different types of tests and how to effectively test PySpark data pipelines with `pytest`. We’ll walk through creating unit, integration, and end-to-end tests for a simple data pipeline, covering key concepts like fixtures and mocking. By the end, you’ll know how to identify and test critical parts of your data pipeline, giving you the peace of mind to push changes with confidence. #dataengineering #testing #pyspark #pytest #unittesting #techtips #devops https://lnkd.in/gCgZCsi9

How to test PySpark code with pytest

startdataengineering.com

4 Comments
Like Comment
To view or add a comment, sign in

8,567 followers

78 Posts

View Profile Follow

Alex Vesa’s Post

More Relevant Posts

Explore topics