Tom Reid’s Post

Data Engineer

AWS just made auditing S3 object changes a whole lot easier with their recent S3 metadata announcement at re:Invent 2024. Currently, in preview, once the metadata is set up, you can use other AWS data analysis services, such as Redshift and Spark on EMR, to query when S3 objects were added, deleted or updated. I just wrote up a detailed post on how to set it up on the Level Up Coding blog. The link to the post is in the first comment. #AWS #S3 #metadata #s3_auditing

1 Comment

Tom Reid

Data Engineer

https://medium.com/gitconnected/auditing-s3-object-changes-made-easy-86c340816635

1 Reaction

To view or add a comment, sign in

More Relevant Posts

Pouya AzaranMehr

Backend Developer | Express, NestJS, Mongodb, Mysql, ElasticSearch, Redis | Linux System Administrator | Proxmox, LXC | +10 years of experience
2mo Edited
Report this post
Why we shoud use Object Storage? 🤔 What is it? Object storage, also known as object-based storage, is a computer data storage architecture designed to handle large amounts of unstructured data. And Why? 1. Custom Metadata & Searchability 2. Resiliency 3. Archive Management 4. Automated Information Life Cycle Management 5. Unlimited Scalability 6. Cost-effective 7. Convenient the best way to use Object Storage is AWS S3 or Minio. #minio #s3 #aws #object_storage
Like Comment
To view or add a comment, sign in
Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber
4mo Edited
Report this post
🔧 Optimizing AWS DMS S3 Targets for CDC Workflows for Hudi Streamer / Glue jobs🔧 If you're using AWS DMS to capture Change Data Capture (CDC) events from databases and deliver them into your preferred data lake, having the right S3 target settings is crucial. Whether you're using Streamer to handle deletes and updates or leveraging AWS Glue to build a medallion architecture, the following configurations can help streamline your process: Why These Settings Matter: Date Partitioning: Organizing CDC events by year/MM/DD in your S3 bucket allows easy reprocessing if needed, enhancing data management and recovery. IncludeOpForFullLoad: This ensures the Op column is available during full loads, making your CDC logic simpler and more effective. Optimized Data Format: By setting DataFormat=parquet and CompressionType=GZIP, you reduce file size and storage costs while maintaining query performance. File Size Management: The CdcMinFileSize setting ensures that files are at least 64MB, which is crucial for efficient querying and avoiding performance issues with small files. Pro Tip: Implement a 30-day or 60-day lifecycle policy on your raw data zone to automatically archive or delete older files, keeping your storage optimized and cost-effective. By configuring these settings, you can maximize the efficiency of your CDC ingestion process and ensure your data lake runs smoothly! 🚀 #AWS #DMS #DataEngineering #BigData #Cloud #DataLakes #S3 #ApacheHudi #GlueJobs #CDC Apache Hudi
Like Comment
To view or add a comment, sign in
Kavitha Suresh Kumar

Principal Software Engineer, Director
5mo
Report this post
Completed AWS Edge Storage, Data Transfer, and File Transfer Services Getting Started Course #learningneverstops #aws #storage #data #transfer #keeplearning
Like Comment
To view or add a comment, sign in
Sebastian Flak

📝I post about data |❄️Snowflake Squad Member | 💾 Data Engineer
8mo
Report this post
❄ Here's how to automate your data loading from AWS S3 to Snowflake. 👇🏻 In my latest video I've explained the concept of Snowpipe. You will learn about: ✅ What is Snowpipe ✅ How to setup event notification in S3 ✅ Different possibilities of usage ✅ Concerns to be cautios about 🔍 And of course - step by step configuration of Snowpipe from scratch. Find the video link here: https://lnkd.in/dtqWch8B Mention someone who could benefit from this 👇🏻 #Snowflake #SnowflakeNinja #dataengineering

Automate your data loading from AWS S3 to Snowflake | Snowpipe | #SnowflakeNinja

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Raghuraj Boda

Senior Data Engineer | Data Analyst | Big Data & Cloud Specialist | Expert in SQL, ETL, Data Warehouse, Hadoop, Spark, AWS, Azure, Snowflake, Python | AWS Certified Data Engineer & Cloud Practitioner |PCAP Certified
3mo
Report this post
Leveraging AWS Lambda for Real-Time ETL Workflows AWS Lambda has been a game-changer for creating serverless ETL workflows. By using Lambda to trigger events in real-time, we improved data flow latency by 20% without worrying about managing servers. It’s perfect for scalable, cost-efficient data processing. #AWS #Lambda #ETL #Serverless #DataEngineering
Like Comment
To view or add a comment, sign in
IT-TechTalk

100 followers
1mo
Report this post
The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables | Amazon Web Services

shha.re
Like Comment
To view or add a comment, sign in
Shiva Balaji Murugesan

Technical Lead - App Dev
2w Edited
Report this post
Exciting insights from AWS re:Invent 2024 on serverless data processing with AWS Lambda and Apache Kafka! 🌐 This hands-on session demonstrated how to seamlessly process data from Apache Kafka streams using AWS Lambda, leveraging its native support as an event source for real-time streaming analytics. Key takeaways included optimizing data processing workloads with Amazon MSK, managing self-managed Kafka clusters, and learning best practices for error handling and automatic retries. Was a must-attend session for anyone in data engineering or serverless architecture! ✨ zeb💡💻 #AWSreInvent #Serverless #AWSLambda #Kafka #DataProcessing #EventDriven #AWS #MSK #AppSync #zeb
Like Comment
To view or add a comment, sign in
Abdul Basit Paracha

Data Engineer | 3x Microsoft Azure Certified | Data Analytics | ETL | Data Warehousing | Spark | 1x Databricks Certified | 1x Apache Airflow Certified | Kubernetes | Pyspark | Azure Devops | CICD | Ex-Systems limited
6mo
Report this post
🎉Excited to share that I've officially earned my Apache Airflow certification! Achieving this milestone has significantly enhanced my expertise in workflow orchestration and data pipeline management. A special thanks to Marc Lamberti for his exceptional learning material, which played a crucial role in helping me achieve this goal 🚀. #ApacheAirflow #DataEngineering #Astronomer #Certified #ProfessionalGrowth #Azure #Aws #CloudComputing

12 Comments
Like Comment
To view or add a comment, sign in
The Data Domain Blog

32 followers
3w
Report this post
Discover how Apache XTable enables efficient conversion between open table formats in AWS data lakes. Learn about its integration with AWS Glue Data Catalog for seamless background conversions, eliminating data duplication and reducing costs. https://lnkd.in/eAWyA34u

Apache XTable: Seamless Conversion Between Data Lake Table Formats on AWS

https://datadomain.blog
Like Comment
To view or add a comment, sign in

1,330 followers

View Profile Follow

Tom Reid’s Post

More from this author

Free ebook

OpenAI GPT-4o vision capabilities are good

Altman talks GPT-5

Explore topics

Tom Reid’s Post

More Relevant Posts

Automate your data loading from AWS S3 to Snowflake | Snowpipe | #SnowflakeNinja

https://www.youtube.com/

More from this author

Free ebook

OpenAI GPT-4o vision capabilities are good

Altman talks GPT-5

Explore topics