🚀 Day 36/100 - Data Engineering Journey Hey 👋 !! Today, while I was working on a data pipeline, I found myself grappling with an important question: What are some of the best practices and considerations when extracting data from a URL using HTTP connectors in Azure. As data engineers, fetching data from online sources is a common task, but ensuring efficiency, reliability, and data integrity in this process requires careful consideration of various factors. Here are some key best practices and things to keep in mind when working on such a task 1. Use of HTTP Connectors: When utilizing HTTP connectors in Azure or any other platform, it's essential to leverage established connectors or libraries provided by the platform. These connectors often come with built-in functionality for handling authentication, rate limiting, and error handling. 2. Authentication and Authorization: Ensure that you properly authenticate with the target URL if authentication is required. This may involve using API keys, OAuth tokens, or other authentication mechanisms provided by the data source. 3. Respect Rate Limits: Many APIs impose rate limits to prevent abuse and ensure fair usage. Implementing exponential backoff strategies can help gracefully handle rate limit errors. 4. Data Validation: Perform thorough validation of the fetched data to ensure its integrity and accuracy. Check for missing or invalid values, unexpected data formats, and other anomalies that may affect downstream processes. 5. Error Handling: Implement robust error handling mechanisms to gracefully handle network errors, server errors, and other exceptions that may occur during data extraction. Log error messages and relevant metadata for debugging and troubleshooting purposes. 6. Monitoring and Alerting: Set up monitoring and alerting systems to track the performance and health of your data extraction process. Monitor metrics such as latency, throughput, and error rates, and configure alerts to notify you of any anomalies or issues. 7. Compliance and Legal Considerations: Ensure that your data extraction activities comply with the terms of service and legal requirements of the data source. Respect copyright laws, intellectual property rights, and any other relevant regulations. By following these best practices and considerations, we can ensure that our data extraction process is efficient, reliable, and compliant with the guidelines and policies of the data sources we interact with. Stay tuned for more insights and advancements as we continue our 100 Days of Data Engineering journey! [ Picture source and an engaging read : https://lnkd.in/gCApqW8B ] #DataEngineering #Azure #DataExtraction #HTTPConnectors #BestPractices #Day36 #100daysofdataengineering
Anshuman Sharma’s Post
More Relevant Posts
-
Ready to tackle your data’s schema problems? Here’s a step-by-step guide to Expectations that catch schema issues before they can cause trouble in your pipelines ➡️ https://hubs.li/Q02Y0bW10 #GXCloud #GXCore #dataquality #dataengineering
To view or add a comment, sign in
-
Data engineering stuff - when interacting with external APIs it doesn't always make sense to retry. There is apparently 63 different HTTP response codes, but only 7 of those warrant a retry and even that's probably too many. In this particular case, we were retrying 403-s, which is "Forbidden". But with a retry of the whole job, it went away. Didn't pay much attention to it, until today. Turns out that Azure had expanded the IP address ranges and in about 1/1000 cases we got an IP address that was not on the allow-list. The full job retry got a different instance and IP, though, so it worked. Worth retrying: 408 Request Timeout 425 Too Early 429 Too Many Requests 500 Internal Server Error 502 Bad Gateway 503 Service Unavailable 504 Gateway Timeout
To view or add a comment, sign in
-
We knew we could make things faster. But not 1,000 times faster... *** When the team observed that sharing large notebooks and team workspaces in Evernote was frustratingly slow, they challenged themselves to make it faster than ever. The first step was to investigate the existing service to understand the underlying inefficiencies. Senior engineer at Evernote, Mattia Gentil, explains: “A few things were contributing to an overall slow experience: • First, the legacy service was breaking down each shared notebook or workspace into individual entities (such as notes and tasks) and fetching them one by one during the sharing process. • Then, for each of those entities, the service was performing two SQL queries: one to a database, and one to a secondary microservice to fetch additional metadata. • Finally, the service relied on a custom tree-like data structure to compute the new permissions resulting from the share. All of this—pulling individual entities, multiple SQL queries, an external miscroservice, and complex data structures—resulted in a laggy and unreliable customer experience. We were faced with a choice: We could work on incremental improvements. Or we could scrap the whole service and start from scratch.” In the end, the team decided to go all-in and rewrite the whole service. This was the more difficult path, but it had the potential to create the greatest value for our customers. “This was a tricky data engineering problem on many fronts,” Mattia continues. “We had to repackage the queries to go from six to two. We had to find a way to make a single request to our database, no matter the size of the notebook. And we had to introduce HashMaps to replace the custom data structures, integrating a new microservice into the main sharing service.” Everyone expected these changes to make the sharing experience smoother and faster, but it wasn’t until Mattia ran the first tests on the new service that he realized just how much speed they had unlocked. “When I saw the sharing speed was a thousand times faster in some cases, I couldn't believe it. I re-ran the tests at least five times before I was convinced it wasn’t some kind of fluke. From there, the improvements continued to pile up: Sharing times for smaller notebooks also got noticeably faster, and a common cause of timeout errors was eliminated completely, improving overall reliability. It was fantastic to see all our hard work pay off in such a clear and immediate way for our customers.” Huge bravo to the team for choosing the more challenging option and successfully executing such meaningful improvements. 👏
To view or add a comment, sign in
-
Data integrity is the cornerstone of reliable Pub/Sub messaging. Ably’s architecture is built to ensure exactly-once delivery, strict message ordering, and resilience to failures—all while maintaining sub-50ms latency globally. Our latest engineering blog from senior engineer Zak Knill dives into the architectural internals that make it possible, like: - Primary and secondary message persistence for durability - Idempotent publishing to eliminate duplicates - Global message replication across regions for fault tolerance Discover how Ably’s Pub/Sub architecture guarantees data integrity at scale. 👉 https://hubs.la/Q02Z2K5x0
Data integrity in Ably Pub/Sub
ably.com
To view or add a comment, sign in
-
Datadog Log Workspaces enables teams to seamlessly analyze log data from any number of sources using SQL and natural language querying, as well as flexible data transformations, joins, and visualizations. Learn more: https://lnkd.in/e6k5SpEe Datadog #log #analysis #monitoring
Take enhanced control of your log data with Datadog Log Workspaces
datadoghq.com
To view or add a comment, sign in
-
📢 Has anybody else noticed that this week, Datadog rolled out not one but two products focused on data operations? Here's what they do and why they matter, with insights from Bram Elfrink, our Tech Lead at DataChef. 𝐃𝐚𝐭𝐚 𝐉𝐨𝐛𝐬 𝐌𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 (June 24, 2024) Helps data teams monitor Spark and Databricks jobs with: 🟣 Real-time Alerts for job failures and latency spikes. 🟣 Detailed Trace Views to resolve issues quickly. 🟣 Cost Optimization by analyzing resource utilization. Bram says: “Spark has a large footprint in the data engineering world. We use it heavily for our batch and stream processing pipelines. Being able to detect latency spikes before our end-users experience a negative impact is a game changer”. 𝐃𝐚𝐭𝐚𝐝𝐨𝐠 𝐀𝐩𝐩 𝐁𝐮𝐢𝐥𝐝𝐞𝐫 (June 25, 2024) Enables teams to create self-service apps within their monitoring stacks, featuring: 🟣 Custom App Creation with UI components and templates. 🟣 Enhanced Incident Response to reduce response times. 🟣 Wide Integration with over 550 tools and platforms. Bram says: "Datadog App Builder makes it possible to act on insights directly within Datadog. This allows us to focus on the matter at hand instead of constantly switching contexts.” Follow for more Friday updates on the data world, and if you need help with these tools, let us know! 🔢🧑🍳 #DataChef #OneKitchenOneTeam #DataOps #Datadog
To view or add a comment, sign in
-
🗣 Hey #FluentFam check out our latest blog: Avoiding Data Loss and Backpressure with #FluentBit, by Sharad Regoti. 🛠️ For DevOps pros and new developers alike, this guide is packed with essential tips for smarter data management. 📑 https://hubs.li/Q02jVbD_0
Avoiding data loss and backpressure problems with Fluent Bit
calyptia.com
To view or add a comment, sign in
-
Data Observability 🌐 According to a Datadog survey, 80% of organizations now view data observability as crucial for informed decision-making. Data engineers must adopt advanced observability tools and practices to effectively manage and resolve data issues. 👉 Read the full article here: https://lnkd.in/guMk254d Migesa: Grow Digitally Visit: https://migesa.com #Nearshore #SoftwareEngineering #Data #DevOps #CloudWorld #Microsoft #AzureSolutions #SoftwareDevelopment #DataEngineering #CloudComputing #Migesa #TechInnovation
To view or add a comment, sign in
-
How would you Re-Architect below use case for a global logistics workflow? Overview Nenran, is a data pipeline that serves as a critical input for various optimization models within the logistics domain. This article presents the current challenges faced by Nenran and proposes a re-architecture to align it with the 1C framework. Pain Points and Reasons for Re-architecture Technical Standards and Manual Work: > Hardcoded S3 folder names: This limits scalability and makes parallel execution cumbersome. > Manual code execution: Relies on individual Data Engineers, creating bottlenecks and potential errors. > Time-consuming data retrieval: Finding complete datasets for past runs is inefficient. Dependency on Manual Files and Datanet: > Manual files: Introduces potential errors and inconsistencies. > Datanet limitations: Query performance issues and lack of control over data sources. Closed Ecosystem: > S3 storage: Hinders data exploration and accessibility. > Historical assumptions: May not align with current requirements or other tools. Comment below "Solution", I will send the customised solution for above pain points to you! #Big #Data #Architecture #Data Pipeline #Re-architecture #Global #Logistics #Workflow #Optimization Models #Data Engineering #Data #Governance #Data #Sources #Data #Quality #S3 #Datanet #Redshift #Cloud-Native #Technologies #Workflow Engine #Data Lake #Warehouse
To view or add a comment, sign in
-
Transform & enrich your logs on-premise with the new integrations and out-of-the-box processing features in Datadog Observability Pipelines Worker 2.1. #Datadog #logs #observability #integration #dataprocessing
Transform and enrich your logs with Datadog Observability Pipelines
datadoghq.com
To view or add a comment, sign in