It was good to see AWS getting back to basics somewhat at re:Invent 2024 by introducing some good, old-fashioned, new and updated data analysis and processing services. After all, that's what originally built the company up to what it is now. AI was still a big theme, too, of course, but not on the same level as last year. Anyway, from a data analytics/engineering POV, here are the top re:Invent 2024 announcements that I was most excited about. 1) AWS Glue 5.0 - AWS's main ETL tool 2) DSQL - a new Postgres-compatible distributed SQL database 3) S3 Tables - fully managed Iceberg tables on S3 4) S3 metadata - query objects by key, size, tags, etc ... using Athena, Redshift, and Spark Putting on my AI hat, two enhancements to Bedrock caught my eye, too. 1) Bedrock knowledge bases support GraphRAG - improved RAG processing 2) Automated reasoning checks - help prevent LLM hallucinations
Tom Reid’s Post
More Relevant Posts
-
Is it not a wholesome and convenient solution? 🤔 You write your consistently received data to Kinesis and S3. Data scientists and ML engineers use AWS Athena to ingest data from S3 and then use SageMaker to preprocess and build your models 🤖. Data and business analysts query their data from Redshift and generate quick and effective insights using QuickSight 📊. You practically have it all covered! 🌐 #AWSIntegration
To view or add a comment, sign in
-
I see a lot of posts like "364 days to become a data engineer." or "Everything you need to know on AWS to get better at data" (which then lists out all 627 things on AWS). It doesn't need to be complicated. Here's the gold stack for a data pipeline. 1. Extract from Sources. (Either Mage or [Fivetran, Stitch or Hevo Data] depending on what your sources are) 2. Data Warehouse. (Snowflake. No viable alternative. hit up Ryan Laurain if you're based in Utah) 3. Transform. (DBT core. Using either dbt Labs or Y42 to do the orchestration. But TBH, I'm not a fan of y42's price per asset) 4. Visualization. (There are a ton of options, mostly determined by your org size and budget since this will be the most expensive step)
To view or add a comment, sign in
-
Tired of spending hours on data cleansing and preparation? Our latest video, by Einstein A. Millán Jr., shows you how to streamline the process using AWS DataBrew's codeless transformations. Whether you're a data analyst, scientist, or anyone handling large datasets, this tool will save you time, money, and a lot of hassle! In this hands-on demo, we cover: 🔸Simplifying data cleansing with pre-built transformations 🔸Creating reusable data recipes for AWS Glue and DataBrew 🔸Preparing data for analytics and machine learning, all without writing code! 🎥Watch now: https://lnkd.in/dGBsaDgw Don’t miss out on this game-changing tool for your data pipeline. Let’s make your data work smarter, not harder! 💡 #AWS #DataBrew #DataScience #DataAnalytics #TechInnovation #DataPreparation
Cleansing Data with AWS DataBrew | Online Tech Talk
https://www.youtube.com/
To view or add a comment, sign in
-
Unleashing the Power of Regular Expressions in AWS Athena! 💡 Are you a data engineer working with Athena and S3? Ever wondered how to efficiently clean, extract, or validate textual data in your SQL queries? 🤔 In my latest article on The Data Architect's Notebook, I explore Regular Expression Functions in AWS Athena with practical examples tailored for AWS data engineers. 🔍✨ ✅ What’s Inside? Key regex functions like regexp_like, regexp_extract, and regexp_replace. Real-world use cases, including validating email formats, parsing URLs, and cleaning log data. Tips to optimize your regex queries for performance and precision. Advanced techniques for tokenization and anomaly detection. 💼 Whether you're cleaning messy datasets or extracting insights from raw logs, mastering regex can make your workflows faster and more effective. 🔗 Read the full article here: [https://lnkd.in/gq58Nh_w] Let’s level up our Athena skills together! 💪 Share your favorite regex tips or challenges in the comments. I'd love to hear how you use regex in your projects. 👇 #AWS #Athena #DataEngineering #RegularExpressions #SQL #TheDataArchitectsNotebook #CloudComputing
To view or add a comment, sign in
-
💡 DBT vs. Delta Live Tables: Deciding Your Data Transformation Strategy! 💻 In the era of cloud data processing, ELT has taken center stage, but what about the "T" in ELT? Let's explore two modern options: 🔹 dbt (Data Build Tool): - Purpose: Focuses solely on data transformations using SQL, making ELT more accessible. - Features: SQL++, Jinja templating, autogenerated docs, and support for multiple environments. - Pros: SQL familiarity, flexibility, and integration with various platforms. - Cons: SaaS model, SQL-centric approach. 🔹 Delta Live Tables (DLT): - Purpose: Introduced by Databricks, DLT combines Delta, Live, and Tables, offering SQL and Python-based transformations. - Features: Python support, built-in data quality checks, and seamless integration with Databricks workflows. - Pros: Python flexibility, data quality control, and reduced SaaS footprint. - Cons: Vendor lock-in to Databricks, relatively new compared to dbt. Choosing between dbt and DLT depends on your stack, technical talent, and platform preferences. If you're on Databricks, DLT might be your go-to for seamless integration, while dbt shines for its versatility. Ultimately, it's about empowering data professionals to achieve more with their skill set! 💬 #DataEngineering #DataTransformation #dbt #DeltaLake #Databricks #ELT #DLT
To view or add a comment, sign in
-
The dirty little secret most data scientists tend not to address is the fact most of their work ends up being siloed in some dark corner of some unmarked GitHub repository, never to be seen again. The reality is most data science projects are hard and usually involve a team of multiple professionals, all of whom could fall under the umbrella of data scientists. Recently I had a similar experience. I was handed a project to optimize an existing suit of ML models which were not performing very well. After months of hyperparameter tuning, adjusting for imbalances in data sets, algorithmic feature selections, and even some feature engineering, we were able to get a decent enough F score. Eventually we were left with an xgboost trained model in a pickled format. Next came the million dollar question. What to do with it. Enter MLFlow. MLflow is an MLops framework that enables data scientists to not only train their models, but also log performance metrics, hyperparameters, and model artifacts like the aforementioned pickle files. It also provides a suit of tools via a tracking UI that allows for model versioning(fancy way of saying keeping track of multiple models). More importantly it allows for easy deployment of these models to various platforms like AWS sagemaker and Microsoft azure. Once deployed potential stakeholders could make post requests to these REST endpoints for desired inference. Using MLFlow (in conjunction with sagemaker) for the first time, allowed me to get a better view of a data science lifecycle.
To view or add a comment, sign in
-
Just finished Data Engineering with AWS Part 2! Check it out: https://lnkd.in/dNx7g7pF #dataengineering #amazonwebservices Continuation from Data Engineering with AWS Part 1 which was a hands on course on data ingestion and storage, Part 2 covers Data cataloguing, processing, and visualisation. Built an end to end event-driven (present of a file in S3) data pipeline using AWS Kinesis family, AWS Lambda and S3. Real time analysis of data in motion using SQL.
To view or add a comment, sign in
-
🇧🇷 vídeo em português a seguir! I’m currently pursuing a Post-Graduate degree in Machine Learning Engineering, and in this second phase of the program, we are diving deep into AWS services for cloud-based data solutions.☁️ One of our key projects is building a data pipeline for ETL. In this video, I’ll show how we’re leveraging AWS Glue to streamline the transformation process, all without writing any code! Instead, we’re using the visual ETL interface, which allows us to design workflows and perform data transformations in an intuitive, drag-and-drop environment. #machinelearning
To view or add a comment, sign in
-
An efficient data ingestion pipeline consumes data from the point of origin, cleans it up, and writes it to a destination – allowing you to gain insights and make timely and informed decisions. This white paper compares three different approaches that can build a solid data ingestion pipeline, using combinations of Apache Spark, Amazon EMR, Databricks, and Databricks Notebook. Click on the link below and download the paper to learn how you can maximize the potential of your data ingestion pipeline. https://lnkd.in/d4tva7f3 Content Contributor: Srabani Malla #DataIngestion #DataIngestionPipeline #Databricks #ApacheSpark #Spark #EMR #AmazonEMR #DatabricksNotebook #DataPipeline #ManagedSpark #WhitePaper #GSPANN #BI #BusinessIntelligence #DataValue
Maximize the Value of Your Data: Managed Spark with Databricks vs. Spark with EMR vs. Databricks Notebook
gspann.com
To view or add a comment, sign in
-
🚀 Discovering Databricks Autoloader: A Game-Changer for Data Ingestion! 🚀 I'm excited to share my recent experience with Databricks Autoloader, a feature that's revolutionized my approach to data ingestion. Here’s why it stands out: 🔹 **Scalable:** Manages large data volumes effortlessly. 🔹 **Schema Management:** Automatically adapts to schema changes. 🔹 **Incremental Processing:** Processes only new data efficiently. 🔹 **File Notification:** Integrates with Azure Event Grid or AWS SNS/SQS for timely updates. 🔹 **Checkpointing:** Ensures exactly-once processing, so no data is missed. **How I Used It:** I configured Autoloader to read data from an S3 bucket and write it to a Delta Lake table. It was both simple and efficient! ```python schema = StructType([...]) df = (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .schema(schema) .load("s3://your-bucket/path")) (df.writeStream .format("delta") .option("checkpointLocation", "/path/to/checkpoint") .start("/path/to/delta-table")) ``` **Benefits I Experienced:** 🔹 **Efficiency:** Processes only new data, saving time and resources. 🔹 **Flexibility:** Supports various file formats, making it versatile. 🔹 **Reliability:** Built-in checkpointing ensures no data loss. 🔹 **User-Friendly:** Easy to configure with automatic schema management. Databricks Autoloader has truly streamlined my data ingestion process. #DataEngineering #Databricks #DataIngestion #DeltaLake
To view or add a comment, sign in