AWS just made auditing S3 object changes a whole lot easier with their recent S3 metadata announcement at re:Invent 2024. Currently, in preview, once the metadata is set up, you can use other AWS data analysis services, such as Redshift and Spark on EMR, to query when S3 objects were added, deleted or updated. I just wrote up a detailed post on how to set it up on the Level Up Coding blog. The link to the post is in the first comment. #AWS #S3 #metadata #s3_auditing
Tom Reid’s Post
More Relevant Posts
-
Why we shoud use Object Storage? 🤔 What is it? Object storage, also known as object-based storage, is a computer data storage architecture designed to handle large amounts of unstructured data. And Why? 1. Custom Metadata & Searchability 2. Resiliency 3. Archive Management 4. Automated Information Life Cycle Management 5. Unlimited Scalability 6. Cost-effective 7. Convenient the best way to use Object Storage is AWS S3 or Minio. #minio #s3 #aws #object_storage
To view or add a comment, sign in
-
🔧 Optimizing AWS DMS S3 Targets for CDC Workflows for Hudi Streamer / Glue jobs🔧 If you're using AWS DMS to capture Change Data Capture (CDC) events from databases and deliver them into your preferred data lake, having the right S3 target settings is crucial. Whether you're using Streamer to handle deletes and updates or leveraging AWS Glue to build a medallion architecture, the following configurations can help streamline your process: Why These Settings Matter: Date Partitioning: Organizing CDC events by year/MM/DD in your S3 bucket allows easy reprocessing if needed, enhancing data management and recovery. IncludeOpForFullLoad: This ensures the Op column is available during full loads, making your CDC logic simpler and more effective. Optimized Data Format: By setting DataFormat=parquet and CompressionType=GZIP, you reduce file size and storage costs while maintaining query performance. File Size Management: The CdcMinFileSize setting ensures that files are at least 64MB, which is crucial for efficient querying and avoiding performance issues with small files. Pro Tip: Implement a 30-day or 60-day lifecycle policy on your raw data zone to automatically archive or delete older files, keeping your storage optimized and cost-effective. By configuring these settings, you can maximize the efficiency of your CDC ingestion process and ensure your data lake runs smoothly! 🚀 #AWS #DMS #DataEngineering #BigData #Cloud #DataLakes #S3 #ApacheHudi #GlueJobs #CDC Apache Hudi
To view or add a comment, sign in
-
Completed AWS Edge Storage, Data Transfer, and File Transfer Services Getting Started Course #learningneverstops #aws #storage #data #transfer #keeplearning
To view or add a comment, sign in
-
❄ Here's how to automate your data loading from AWS S3 to Snowflake. 👇🏻 In my latest video I've explained the concept of Snowpipe. You will learn about: ✅ What is Snowpipe ✅ How to setup event notification in S3 ✅ Different possibilities of usage ✅ Concerns to be cautios about 🔍 And of course - step by step configuration of Snowpipe from scratch. Find the video link here: https://lnkd.in/dtqWch8B Mention someone who could benefit from this 👇🏻 #Snowflake #SnowflakeNinja #dataengineering
Automate your data loading from AWS S3 to Snowflake | Snowpipe | #SnowflakeNinja
https://www.youtube.com/
To view or add a comment, sign in
-
Leveraging AWS Lambda for Real-Time ETL Workflows AWS Lambda has been a game-changer for creating serverless ETL workflows. By using Lambda to trigger events in real-time, we improved data flow latency by 20% without worrying about managing servers. It’s perfect for scalable, cost-efficient data processing. #AWS #Lambda #ETL #Serverless #DataEngineering
To view or add a comment, sign in
-
The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables
To view or add a comment, sign in
-
Exciting insights from AWS re:Invent 2024 on serverless data processing with AWS Lambda and Apache Kafka! 🌐 This hands-on session demonstrated how to seamlessly process data from Apache Kafka streams using AWS Lambda, leveraging its native support as an event source for real-time streaming analytics. Key takeaways included optimizing data processing workloads with Amazon MSK, managing self-managed Kafka clusters, and learning best practices for error handling and automatic retries. Was a must-attend session for anyone in data engineering or serverless architecture! ✨ zeb💡💻 #AWSreInvent #Serverless #AWSLambda #Kafka #DataProcessing #EventDriven #AWS #MSK #AppSync #zeb
To view or add a comment, sign in
-
🎉Excited to share that I've officially earned my Apache Airflow certification! Achieving this milestone has significantly enhanced my expertise in workflow orchestration and data pipeline management. A special thanks to Marc Lamberti for his exceptional learning material, which played a crucial role in helping me achieve this goal 🚀. #ApacheAirflow #DataEngineering #Astronomer #Certified #ProfessionalGrowth #Azure #Aws #CloudComputing
To view or add a comment, sign in
-
Discover how Apache XTable enables efficient conversion between open table formats in AWS data lakes. Learn about its integration with AWS Glue Data Catalog for seamless background conversions, eliminating data duplication and reducing costs. https://lnkd.in/eAWyA34u
Apache XTable: Seamless Conversion Between Data Lake Table Formats on AWS
https://datadomain.blog
To view or add a comment, sign in
Data Engineer
6dhttps://medium.com/gitconnected/auditing-s3-object-changes-made-easy-86c340816635