Wes Madrigal’s Post

Co-Founder/CEO at Kurve. Building data fracking tools.

1mo Edited

Check out how Kurve automates schema inference between raw files and integrates with open catalogs and standards like Apache Iceberg for end to end ML workloads:

Kurve

108 followers

1mo Edited

If you are a data professional trying extract value from tabular data leveraging AI then you should check out Kurve for Snowflake Open Catalog/Polaris Automated Schema Inference | Data Prep and Feature Engineering. #dataengineering #ai #datascience #snowflake #databricks #mlops #metadata

Kurve for Snowflake Open Catalog/Polaris Automated Schema Inference and Feature Engineering

medium.com

To view or add a comment, sign in

More Relevant Posts

JOHN M. CORRIGAN

Entrepreneur | Co-Founder, COO Kurve Inc.
1mo
Report this post
If you are a data professional trying extract value from tabular data leveraging AI. Then you need to check out Kurve for Snowflake Open Catalog/Polaris Automated Schema Inference | Data Prep and Feature Engineering. #dataengineering #ai #datascience #snowflake #databricks #mlops #metadata

Kurve for Snowflake Open Catalog/Polaris Automated Schema Inference and Feature Engineering

medium.com
Like Comment
To view or add a comment, sign in
Kurve

108 followers
1mo Edited
Report this post
If you are a data professional trying extract value from tabular data leveraging AI then you should check out Kurve for Snowflake Open Catalog/Polaris Automated Schema Inference | Data Prep and Feature Engineering. #dataengineering #ai #datascience #snowflake #databricks #mlops #metadata

Kurve for Snowflake Open Catalog/Polaris Automated Schema Inference and Feature Engineering

medium.com
Like Comment
To view or add a comment, sign in
Skylar Rios-Merwin

Business, Data, & Analytics
3mo
Report this post
🥁 Introducing Sigma’s Data Models! 👉 Say goodbye to manual data prep and hello to faster insights. 😎 Scale analytics across your org and unlock the full potential of your data. 🤘 Check out how our latest feature makes it easier than ever to build trusted data models and get answers faster. 🔥 #data #businessintelligence #python #SQL #advancedanalytics #sigmacomputing #powerbi #tableau #looker #AI #BI #snowflake #databricks #cloudcomputing #rockandroll #datamodels

Semantic Modeling with Sigma

sigmacomputing.com
Like Comment
To view or add a comment, sign in
Bikram Sahoo

Senior AI Engineer at IBM Technology Expert Labs | MLOPS | GENAI
2mo
Report this post
The key difference in data handling in pyspark and pandas. ✅ Data Size Handling: Pandas: Primarily designed for small to medium-sized datasets that can fit into a single machine's memory. Its performance significantly declines with large datasets. PySpark: Optimized for big data processing, capable of handling large datasets that can be distributed across multiple machines, making it more suitable for large-scale data processing. ✅Parallelism and Speed: Pandas: Runs single-threaded on a single machine, limiting its performance on large datasets. PySpark: Built on top of Apache Spark, it leverages distributed computing and can run tasks in parallel across a cluster, leading to faster processing for large datasets. ✅Ease of Use: Pandas: User-friendly and provides a simpler API with intuitive functions for data manipulation. Ideal for smaller projects or when working on local machines. PySpark: Requires a more complex setup with Spark clusters but provides an API similar to Pandas for large-scale data processing. ✅Error Handling: Pandas: Errors are generally easier to understand and debug since everything is local. PySpark: Errors can be more complex due to distributed processing, making debugging harder as issues may arise across multiple machines. ✅Use Cases: Pandas: Ideal for exploratory data analysis, prototyping, and smaller datasets. PySpark: Better suited for large-scale data processing tasks, including ETL (Extract, Transform, Load) workflows and large data pipelines in production environments. Follow:Bikram Sahoo more such contents on Data #spark #pandas #data #datascience #dataengineering #ai #ml
Like Comment
To view or add a comment, sign in
Sem Sinchenko

Data Engineer | MLOps | PySpark
8mo
Report this post
In my previous post, I discussed how to effectively update feature store tables in PySpark. Today, I want to dive deeper into the topic of efficient feature lookup, which is a critical aspect of working with feature stores. Feature lookup is essentially an asOfJoin problem, where we need to join features to a given dataset based on a timestamp. In my latest blog post, I explore the challenges of feature lookup in detail and explain why generic asOfJoin implementations may tend to sub-optimal performance on that specific usecase. I then present a custom asOfJoin algorithm designed specifically for features lookoup. The key takeaway is that as data engineers, it's crucial to leverage our domain knowledge about partitioning, table sizes, and other dataset characteristics to write custom, optimized code instead of relying solely on generic algorithms. While generic algorithms aim to be optimized for all cases, they may not perform as well as tailored solutions in specific scenarios. Just like there's no free lunch, blindly using generic data algorithms without considering the unique requirements of our use case can lead to bad performance. I encourage you to read my blog post to learn more about the custom asOfJoin algorithm and how it can significantly improve the efficiency of feature lookup in your ML feature store using PySpark. https://lnkd.in/dnuxmWTU #DataEngineering #MachineLearning #FeatureStore #PySpark #asOfJoin

Effective asOfJoin in PySpark for Feature Store

semyonsinchenko.github.io
Like Comment
To view or add a comment, sign in
Privacera

6,522 followers
4mo
Report this post
Things are heating up over there in the Databricks factory 🔥🏭 Their new GenAI-powered DataBricks Assistant allows you to chat in plain English to conjure up SQL or Python code, create snazzy visualizations, and fix those tricky data errors 🪄 Experience their #DataManagement magic for yourself in this Acceleration Economy article: https://bit.ly/4cmtSyY #DataAnalytics #AIData

Databricks Chatbot Brings the Power of GenAI to Querying Data, Generating Code

accelerationeconomy.com
Like Comment
To view or add a comment, sign in
Rob Beckwith

Data Analytics Specialist @ Google | Driving Data-Driven Business Growth
6mo Edited
Report this post
I'm seeing the topic of synthetic data come up more and more with customers. It's a great use case for genAI to help plug gaps in your own data, expand on existing data sets, or create net new data that closely resembles the real world. In this blog post we demonstrate how BigQuery DataFrames can be used to leverage a popular Python library for generating synthetic data. This lightweight example shows how BigQuery DataFrames can make it easier to generate data for ETL-like use cases and quick experimentation. The article covers everything from setting up your environment, to generating custom datasets tailored to your specific needs with a link to a github repo.

Generate synthetic data with BigQuery DataFrames and LLMs | Google Cloud Blog

cloud.google.com
Like Comment
To view or add a comment, sign in
Sem Sinchenko

Data Engineer | MLOps | PySpark
8mo
Report this post
Are you working on an ETL pipeline to update your Machine Learning Feature Store using PySpark? In my latest blog post, I dive into a powerful technique that can streamline your aggregations and make your code more efficient. Discover how case-when expressions can be used as an alternative to groupBy expressions when generating aggregations over multiple groups. By leveraging this approach, you can simplify your PySpark code and optimize your ETL pipeline for Feature Store updates. In the blog post, I provide examples and step-by-step explanations to help you understand how to implement it. You'll learn how this technique can lead to cleaner, more concise code while still achieving the desired results. Whether you're a data engineer, data scientist, or ML practitioner, this blog post will equip you with a valuable tool to enhance your PySpark skills and streamline your Feature Store development process. Read the full article here: https://lnkd.in/dauaYUym #PySpark #DataEngineering #FeatureStore

Computing ML Feature Store in PySpark

semyonsinchenko.github.io
Like Comment
To view or add a comment, sign in
Aniket 🟢

Sr. ML Engineer @ IBM 🧑💻 | Ex : Infosys | Mentor | Machine Learning | NLP | Deep Learning | LLM's | SLM's | LMM's | Python | SQL | AWS | TensorFlow2.x | Scikit-Learn | Keras | LangChain | PyTorch
2mo
Report this post
#Day_17: Fetching Data from an API in Machine Learning 🖥️ When working on real-world ML projects, one of the common tasks I face is fetching data from APIs 🌐. APIs provide a gateway to interact with external services and access real time data 📡. Let me walk you through how I handle this process! 🔑 Step 1: Understanding the API First, it's crucial to understand the API’s documentation 📄. You’ll get details on endpoints, required parameters, and the data structure that’s being returned. I usually look for JSON responses, as they are easier to handle in Python! 🔑 Step 2: Making the Request Once I have the necessary info, I use Python’s requests library to send HTTP requests to the API server. Here’s a simple example: """ import requests response = requests.get('https://lnkd.in/d4Gbe5b2') data = response.json() # Extracting the data from the response """ In my experience, handling errors 🛑 is important here! Always check for response status codes (e.g., 200 OK means everything is good to go). 🔑 Step 3: Authentication Many APIs require an API key 🔑 for authentication. Make sure to include it in your request header, like so: """" headers = {'Authorization': 'Bearer YOUR_API_KEY'} response = requests.get('https://lnkd.in/d4Gbe5b2', headers=headers) """ 🔑 Step 4: Parsing and Cleaning Data Once the data is fetched, it’s rarely in a format directly usable for machine learning 🧹. I always clean and format the data as per my model's requirements. Tools like pandas help a lot in converting JSON data into structured DataFrames: """ import pandas as pd df = pd.json_normalize(data) # Flattening nested JSON into a DataFrame """ 🔑 Step 5: Automating Data Fetching For real-time machine learning models, automating this process is critical! You can schedule your API calls using APScheduler or similar libraries 🕒. Fetching data from APIs opens the door to endless possibilities for real-time data analysis 📊. Hope this helps you understand the importance of API in ML projects! Follow me for more daily ML tips as part of my #100DaysOfMachineLearning series! 🔥 #MachineLearning #API #DataFetching #Python #RealTimeData #TechTalk #MLTips
Like Comment
To view or add a comment, sign in
Ujjwal Sontakke Jain

LWD - 13th January 2025 (Immediate Joiner) | AWS Data Engineer @HCLTech | 🌟 LinkedIn Top Voice'2024 🏅| PySpark | SparkSQL | Python | SQL | HDFS | AWS | Databricks | 130K+ Post Impressions
5mo
Report this post
🚀 PySpark Transformations & Actions 🛠️ Transformations (Lazy operations that define new RDDs/DataFrames): Map-like: 1.map(): Apply a function to each element. 2.flatMap(): Apply a function that returns multiple items per element. 3.mapPartitions(): Apply a function to entire partitions. 4.mapValues(): Apply a function to values of (K, V) pairs. Filter-like: 1.filter(): Select elements based on a predicate. 2.distinct(): Return distinct elements. 3.sample(): Return a sample subset. 4.union(): Combine elements from two RDDs/DataFrames. Set Operations: 1.intersection(): Elements common to both RDDs/DataFrames. 2.subtract(): Elements in one RDD/DataFrame but not in another. 3.cartesian(): Cartesian product of two RDDs/DataFrames. Aggregate: 1.reduceByKey(): Merge values for each key using a function. 2.groupByKey(): Group values for each key. 3.combineByKey(): Combine elements for each key with custom functions. 4.aggregateByKey(): Similar to combineByKey() with different return types. 5.sortByKey(): Sort RDD by key. Joins: 1.join(): Inner join by key. 2.leftOuterJoin(): Left outer join by key. 3.rightOuterJoin(): Right outer join by key. 4.fullOuterJoin(): Full outer join by key. 5. cogroup(): Group data from both RDDs/DataFrames sharing the same key. Others: 1.coalesce(): Reduce the number of partitions. 2.repartition(): Reshuffle data into a specified number of partitions. 3.pipe(): Transform elements by piping to a shell command. 4.partitionBy(): Partition RDD by key. Actions (Trigger execution and return results): Aggregate: 1.reduce(): Reduce elements using a binary function. 2.collect(): Return all elements as an array. 3.count(): Count elements. 4.first(): Return the first element. 5.take(): Return the first n elements. 6.takeSample(): Return a random sample of n elements. 7.takeOrdered(): Return the first n sorted elements. 8.countByKey(): Count elements for each key. Side Effects: 1.saveAsTextFile(): Save RDD as a text file. 2.saveAsSequenceFile(): Save RDD as a Hadoop sequence file. 3.saveAsObjectFile(): Save RDD as serialized Java objects. 4.saveAsHadoopFile(): Save RDD as a Hadoop file. 5.foreach(): Apply a function to each element. 6.foreachPartition(): Apply a function to each partition. Others: 1.top(): Return the top n elements. 2.max(): Return the maximum element. 3.min(): Return the minimum element. 4.mean(): Return the mean of elements. 5.sum(): Return the sum of elements. 6.stdev(): Return the standard deviation. 7.variance(): Return the variance. #DataEngineer #DataEngineering #BigData #ETL #ELT #DataPipelines #CloudComputing #DataWarehouse #ApacheSpark #PySpark #AWS #DataIntegration #DataProcessing #DataStrategy #DataManagement #DataTransformation
4 Comments
Like Comment
To view or add a comment, sign in

716 followers

41 Posts

View Profile Follow

Wes Madrigal’s Post

More Relevant Posts

Explore topics