opensource.google.com

Menu

Fluent Bit WriteAPI Connector: Lowering the barrier to streaming data

Wednesday, August 21, 2024

Automating ingestion processes is crucial for modern businesses that handle vast amounts of data daily. In today's fast-paced digital landscape, the ability to seamlessly collect, process, and analyze data can make the difference between staying ahead of the competition and falling behind. To simplify ingestion, tools such as Fluent Bit enable customers to route data between pluggable sources and sinks without needing to write a single line of code. Instead, data routing is managed via a config file. The Fluent Bit WriteAPI Connector is a pluggable sink built on top of the BigQuery Storage Write API that enables organizations to rapidly develop a data ingestion pipeline.


What are the BigQuery Storage Write API and Fluent Bit?

The BigQuery Storage Write API is a high-performance data-ingestion API for BigQuery. It leverages both batching and streaming methods to ingest records into BigQuery in real-time. The WriteAPI offers features such as ability to scale and provides exactly-once delivery to guarantee that data is not duplicated. Using the Write API directly typically requires technical expertise, as users must navigate one of the client SDKs. This can create a high barrier to entry for some customers to stream data into BigQuery.

Fluent Bit is a widely-used open-source observability agent known for its lightweight design, speed, and flexibility. It operates by collecting logs, traces and metrics through various inputs such as local or network files, filtering and buffering them, and then routing them to designated outputs. Fluent Bit's high-performance parsing capabilities allow for data to be processed according to user specifications. The output component is a configurable plugin that directs data to different destinations, such as various tables in BigQuery. There can be multiple WriteAPI outputs and each output can be independently configured to use a specific write mode, enabling seamless data streaming into BigQuery based on tag/match pairs.


Why Use the Fluent Bit WriteAPI Connector?

Our solution to the technical challenges posed by using the WriteAPI is the Fluent Bit WriteAPI Connector. This connector automates the data ingestion process, eliminating the need for customers to write any code. The entire pipeline is managed through a single configuration file, making it easy to use. The flow of data is depicted in the diagram below.

Fluent Bit Flow Diagram

Example Use Case

Say we wish to monitor a log file containing JSON data, and we would like to ingest this data into a BigQuery table that has a single column titled “Text” of type String. A line from the log file looks like this:

{"Text": "Hello, World"}

Setup Process

    1. Setting Up Fluent Bit: The first step is to install and configure Fluent Bit. Once installed, Fluent Bit must be configured to collect data from your desired sources. This involves defining inputs, such as log files or system metrics, that Fluent Bit will monitor. This is explained below.
    2. Cloning the Google Git Repository: Next, clone the Google Git Repository that contains the Fluent Bit WriteAPI Connector. This repository includes all the necessary files to set up the connector, along with an example configuration file to help you get started. Let’s say the git repo is cloned at /usr/local/fluentbit-bigquery-writeapi-sink. Edit the file in the git repo named plugins.conf to provide the full path to the writeapi plugin. For example, the contents of the file can now look like this: 
    [PLUGINS]
      Path    /usr/local/fluentbit-bigquery-writeapi-sink/out_writeapi.so 
    3. Setting Up BigQuery Tables: Ensure that your BigQuery tables are set up and ready to receive data. This might involve creating new tables or configuring existing ones to match the data schema you intend to use. For example, create the BigQuery table with a schema containing the column Text of type STRING. Let’s say the table is created at myProject.myDataset.myTable.
Destination table schema
click to enlarge

    4. Prepare the input file: We will be reading data from a log file at /usr/local/logfile.log. Let’s start with an empty log file. Create the log file as follows: 
    touch /usr/local/logfile.log
    5. Configuring the Plugin: The most critical step is setting up the configuration file for the Fluent Bit WriteAPI Connector. This singular file controls the entire data pipeline, from input collection to data filtering and routing. The configuration file is straightforward and highly intuitive. It allows you to define various parameters, such as input sources, data filters, and output destinations. Create a configuration file in, say /usr/local, and call it demo.conf. See details on how to format a configuration file. It looks like this:
      Sample Config File

This routes the data from /usr/local/logfile.log to the BigQuery table at myProject.myDataset.myTable. There are additional configurable fields that control the stream, such as chunking, asynchronous response queue, and also the type of stream. These fields let you control how your data is streamed.

To run the pipeline, use the command:

fluent-bit -c /usr/local/demo.conf

As the log file is updated new lines will automatically appear in the BigQuery table. For example, to populate the log file you can run the following command:

echo "{\"Text\": \"Hello, world\"}" >> /usr/local/logfile.log

Note that the default flush interval in Fluent Bit is 1 minute, so it might take a minute before the log file is flushed. The BigQuery table will now be updated as follows:

Populated BigQuery table
click to enlarge

Key Features

The connector supports a wide variety of features including multi-instancing, dynamic scaling, exactly-once delivery, and automatic retry.

    1. Multi-Instancing

    • The multi-instancing feature of the Fluent Bit WriteAPI Connector is designed to offer flexibility in routing data. Specifically, users can configure the connector to handle multiple data inputs and outputs in various combinations. This feature also supports more complex configurations, such as multiple inputs feeding into multiple outputs, allowing data to be aggregated or distributed as needed. An input connector is labeled with a tag field. In our example, this has value log1. Data is routed to an output connector based on the value of its match field. In our example, this also has value log1, meaning there is a 1-to-1 correspondence between the input and output connector. The match field is a regex so it can be used to connect with multiple inputs. For example, if this was set to * then data from all inputs would flow to this output.

    2. Dynamic Scaling

    • Handling large volumes of data efficiently is crucial for modern pipelines. The dynamic scaling feature addresses the issue of potential overloads in the Write API. As data is streamed into BigQuery, there may be times when the API queue becomes full—by default, it can hold up to 1000 pending responses. When this limit is reached, no new data can be appended until some of the pending responses are processed, which can create back pressure in the system. To manage this, the connector automatically scales up its capacity by creating an additional network connection when it detects that the number of pending responses has reached the threshold.

    3. Exactly-Once

    • The "exactly-once" feature ensures that each piece of data is sent and recorded in BigQuery exactly once. This feature ensures no data is duplicated. If the connector encounters an intermittent issue while sending a specific piece of data, it will synchronously retry sending it until it is successful. This ensures data is delivered correctly.

    4. Retry Functionality

    • The retry functionality allows the connector to handle temporary failures gracefully. The retry mechanism is configurable, meaning users can set how many times the system should attempt to resend the data before giving up. By default, the connector will retry sending failed data up to four times. In the default stream mode, if a row of data fails to send, it is retried while other rows continue to be processed. However, in the "exactly once" mode, the retry process is synchronous, meaning the system will wait for the failed row to be successfully sent before moving on to subsequent rows.

    5. Error Handling

    • Error handling in the connector is designed to catch and manage issues that may arise during data transmission. The connector will continue processing incoming data even if earlier data had a failure. Any permanent issues that are encountered are logged to the console.

Conclusion

The ability to efficiently collect, process, and analyze data is a critical factor for business success. The Fluent Bit WriteAPI Connector stands out as a powerful solution that simplifies and automates the data ingestion process, bridging the gap between Fluent Bit's versatile data collection capabilities and Google BigQuery's robust analytics platform.

By eliminating the need for complex coding and manual data management, the Fluent Bit WriteAPI Connector lowers the barrier to entry for businesses of all sizes. Whether you're a small startup or a large enterprise, this tool allows you to effortlessly set up and manage your data pipelines with a single configuration file. Its features like multi-instancing, dynamic scaling, exactly-once delivery, and error handling ensure that your data is ingested accurately, reliably, and in real-time.

The straightforward setup process, combined with the flexibility and scalability of the connector, make it a valuable asset for any organization looking to harness the power of their data. By automating the ingestion process, businesses can focus on what truly matters: deriving actionable insights from their data to drive growth and innovation.

By Tanishqa Puhan, BigQuery WriteAPI

2023 Open Source Contributions: A Year in Review

Tuesday, August 13, 2024


At Alphabet, open source remains a critical component of our business and internal systems. We depend on thousands of upstream projects and communities to run our infrastructure, products, and services. Within the Open Source Programs Office (OSPO), we continue to focus on investing in the sustainability of open source communities and expanding access to open source opportunities for contributors around the world. As participants in this global ecosystem, our goal with this report is to provide transparency and to report our work within and around open source communities.

In 2023 roughly 10% of Alphabet’s full-time workforce actively contributed to open source projects. This percentage has remained roughly consistent over the last five years, indicating that our open source contribution has remained proportional to the size of Alphabet over time. Over the last 5 years, Google has released more than 7,000 open source elements, representing a mix of new projects, features, libraries, SDKs, datasets, sample code, and more.


Most open source projects we contribute to are outside of Alphabet

In 2023, employees from Alphabet interacted with more than 70,000 public repositories on GitHub. Over the last five years, more than 70% of the non-personal GitHub repositories receiving Alphabet contributions were outside of Google-managed organizations. Our top external projects (by number of unique contributors at Alphabet) include both Google-initiated projects such as Kubernetes, Apache Beam, and gRPC as well as community-led projects such as LLVM, Envoy, and web-platform-tests.

In addition to Alphabet employees supporting external projects, in 2023 Alphabet-led projects received contributions from more than 180,000 non-Alphabet employees (unique GitHub accounts not affiliated with Alphabet).


Open source remains vital to industry collaboration and innovation

As the technology industry turns to focus on novel AI and machine learning technologies, open source communities have continued to serve as a shared resource and avenue for collaboration on new frameworks and emerging standards. In addition to launching new projects such as Project Open Se Cura (an open-source framework to accelerate the development of secure, scalable, transparent and efficient AI systems), we also collaborated with AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Anyscale, Apple, Arm, Cerebras, Graphcore, Hugging Face, Intel, Meta, NVIDIA, and SiFive to release OpenXLA to the public for use and contribution. OpenXLA is an open source ML compiler enabling developers to train and serve highly-optimized models from all leading ML frameworks on all major ML hardware. In addition to technology development, Google’s OSPO has been supporting the OSI's Open Source AI definition initiative, which aims to clearly define 'Open Source AI' by the end of 2024.


Investing in the next generation of open source contributors

As a longstanding consumer and contributor to open source projects, we believe it is vital to continue funding both established communities as well as invest in the next generation of contributors to ensure the sustainability of open source ecosystems. In 2023, OSPO provided $2.4M in sponsorships and membership fees to more than 60 open source projects and organizations. Note that this value only represents OSPO's financial contribution; other teams across Alphabet also directly fund open source work. In addition, we continue to support our longstanding programs:

  • In its 19th year, Google Summer of Code (GSoC) enabled more than 900 individuals to contribute to 168 organizations. Over the lifetime of this program, more than 20,000 individuals from 116 countries have contributed to more than 1,000 open source organizations across the globe.
  • In its fifth year, Google Season of Docs provided direct grants to 13 open source projects to improve open source project documentation. Each organization also created a case study to help other open source projects learn from their experience.
A map of the world with highlighting every country that has had Google Summer of Code participants

Securing our shared supply chain remains a priority

We continue to invest in improving the security posture of open source projects and ecosystems. Since launching in 2016, Google's free OSS-Fuzz code testing service has helped discover and get over 10000 vulnerabilities and 34,000 bugs fixed across more than 1200 projects. In 2023, we added features, expanded our OSS-Fuzz Rewards Program, and continued our support for academic fuzzing research. In 2023, we also applied the generative power of LLMs to improve fuzz testing. In addition to this project we’ve been:

  • Helping more projects adopt security best practices as well as identify and remediate vulnerabilities: Over the last year, the upstream team has proposed security improvements to more than 181 critical open source projects including widely-used projects such as NumPy, etcd, XGBoost, Ruby, TypeScript, LLVM, curl, Docker, and more. In addition to this work, GOSST continues to support OSV-Scanner to help projects find existing vulnerabilities in their dependencies, and enable comprehensive detection and remediation by providing commit-level vulnerability detail for over 30,000 existing CVE records from the NVD.

Our open source work will continue to grow and evolve to support the changing needs of our communities. Thank you to our colleagues and community members who continue to dedicate personal and professional time supporting the open source ecosystem. Follow our work at opensource.google.


Appendix: About this data

This report features metrics provided by many teams and programs across Alphabet. In regards to the code and code-adjacent activities data, we wanted to share more details about the derivation of those metrics.

  • Data sources: These data represent the activities of Alphabet employees on public repositories hosted on GitHub and our internal production Git service Git-on-Borg. These sources represent a subset of open source activity currently tracked by Google OSPO.
      • GitHub: We continue to use GitHub Archive as the primary source for GitHub data, which is available as a public dataset on BigQuery. Alphabet activity within GitHub is identified by self-registered accounts, which we estimate underreports actual activity.
      • Git-on-Borg: This is a Google managed git service which hosts some of our larger, long running open source projects such as Android and Chromium. While we continue to develop on this platform, most of our open source activity has moved to GitHub to increase exposure and encourage community growth.
  • Driven by humans: We have created many automated bots and systems that can propose changes on various hosting platforms. We have intentionally filtered these data to focus on human-initiated activities.
  • Business and personal: Activity on GitHub reflects a mixture of Alphabet projects, third-party projects, experimental efforts, and personal projects. Our metrics report on all of the above unless otherwise specified.
  • Alphabet contributors: Please note that unless additional detail is specified, activity counts attributed to Alphabet open source contributors will include our full-time employees as well as our extended Alphabet community (temps, vendors, contractors, and interns). In 2023, full time employees at Alphabet represented more than 95% of our open source contributors.
  • GitHub Accounts: For counts of GitHub accounts not affiliated with Alphabet, we cannot assume that one account is equivalent to one person, as multiple accounts could be tied to one individual or bot account.
  • *Active counts: Where possible, we will show ‘active users’ defined by logged activity (excluding ‘WatchEvent’) within a specified timeframe (a month, year, etc.) and ‘active repositories’ and ‘active projects’ as those that have enough activity to meet our internal active-project criteria and have not been archived.

By Sophia Vargas – Analyst and Researcher, OSPO

Introducing the Pigweed SDK: A modern embedded development suite

Thursday, August 8, 2024

Back in 2020, Google announced Pigweed, an open-source collection of embedded libraries to enable a faster and more reliable development experience for 32-bit microcontrollers. Since then, Pigweed’s extensive collection of middleware libraries has continuously evolved and now includes RTOS abstractions and a powerful RPC interface. These components have shipped in millions of devices, including Google’s own Pixel suite of devices, Nest thermostats, DeepMind robots, as well as satellites and autonomous aerial drones.

Today, we introduce the first developer preview of the Pigweed SDK, making it even easier to leverage Pigweed’s libraries to develop, debug, test, and deploy embedded C++ applications. Using the included sample applications and comprehensive tutorial, you can easily get started prototyping simple programs and build up to more complex applications that leverage advanced Pigweed functionalities. Pigweed’s modern and modular approach makes it easy to design applications with significantly reduced debugging and maintenance overhead, thus making it a perfect choice for medium to large product teams.

We are also thrilled to contribute to the Raspberry Pi Pico 2 and RP2350 launch, providing official support in Pigweed for RP2350 and its predecessor, the RP2040. Building on the success of the Pico 1 and RP2040, the Pico 2 introduces the RP2350 microcontroller, bringing more performance and an exciting set of new capabilities in a much lower power profile. We’ve worked closely with the Raspberry Pi team to not only provide a great experience on Pigweed, but also upstreamed a new Bazel-based build system for Raspberry Pi’s own Pico SDK.

Raspberry Pi Pico 2 (RP2350) with Enviro+ pack hat.
Raspberry Pi Pico 2 (RP2350) with Enviro+ pack hat.

What's in the SDK

The Pigweed SDK aims to be the best way to develop for the Pico family of devices. The SDK includes the Sense showcase project, which demonstrates a lot of our vision for the future of sustainable, robust, and rapid embedded system development, such as:

  • Hermetic building, flashing, testing, and cross-platform toolchain integration through Bazel.
  • Fully open-source Clang/LLVM toolchain for embedded that includes a compiler, linker, and C/C++ libraries with modern performance, features, and standards compliance
  • Efficient and robust device communication over RPC
  • An interactive REPL for viewing device logs and sending commands via command-line and web interfaces
  • Visual Studio Code integration with full C++ code intelligence
  • GitHub Actions support for continuous building and testing
  • Access to pico-sdk APIs when you need to drop down to hardware-specific functionality
Moving image of the Pigweed CLI console engaging with the device through interactive Remote Procedure Calls (RPCs).
Utilize the Pigweed CLI console to communicate with your device through interactive Remote Procedure Calls (RPCs).

By building your project with the Pigweed SDK (using the Sense showcase as your guide), you can start on readily available hardware like the Pico 1 or 2 today. Then when you’re ready to start prototyping your own custom hardware, you can target your Pigweed SDK project to your custom hardware without the need for a major rewrite.

Try Sense now

Bazel for embedded

Pigweed is all-in on Bazel for embedded development. We believe Bazel has great potential to improve the productivity (and mental wellbeing) of embedded development teams. We made the "all-in" decision last September and the Raspberry Pi collaboration was a great motivator to really flesh out our Bazel strategy:

  • We contributed to an entirely new Bazel-based build for the Pico SDK to make it easy for the RP2 ecosystem to use Bazel and demonstrate how Bazel takes care of complex toolchains and dependencies for you.
  • The new Sense showcase demonstrates Bazel-based building, testing, and flashing.
  • Our GitHub Actions guide shows you how to build and test your Bazel-based repo when pull requests are opened, updated, or merged.

Head over to Bazel's launch blog post to learn more about the benefits of Bazel for embedded.


Clang/LLVM for embedded

Pigweed SDK fully leverages the modern Clang/LLVM toolchain. We are especially excited to include LLVM libc, a fully compliant libc implementation that can easily be decomposed and scaled down for smaller systems. The team spent many months developing and contributing patches to the upstream project. Their collaboration with teams across Google and the upstream LLVM team was instrumental in making this new version of libc available for embedded use cases.

The sample applications, Pigweed modules, host-side unit tests, and showcase examples already use Clang, LLD, LLVM libc and libc++. Thus, developers can take advantage of Clang’s diagnostics and large tooling ecosystem, LLD’s fast linking times, and modern C and C++ standard library implementations which support features such as Thread Safety Analysis and Hardening.


IDE integration

With full Visual Studio Code support through pw_ide, you can build and test the Sense showcase from the comfort of a modern IDE and extend the IDE integration to meet your needs. Full target-aware code intelligence makes the experience smooth even for complicated embedded products. Automatic linting, formatting, and code quality analysis integrations are coming soon.


Parallel on-device testing with PicoPico

As you would expect from a team with the mission to make embedded development more sustainable, robust, and rapid for large teams, we are of course obsessed with testing. We have hundreds of on-device unit tests running all the time on Picos. The existing options were a bit slow so we whipped up PicoPico in a week (literally) to make it easier to run all these tests in parallel.

A “PicoPico” node
One “PicoPico” node for running parallel on-device tests

RP2 support

The goal behind our extensive catalog of modules is to make it easy to fully leverage C++ in your embedded system codebases. We aim to provide sensible, reusable, hardware-agnostic abstractions that you can build entire systems on top of. Most of our modules work with any hardware, and we have RP2 drivers for I2C, SPI, GPIO, exception handling, and chrono. When Pigweed's modules don't meet your needs, you can still fallback to using pico-sdk APIs directly.


Get started

Clone our Sense showcase repo and follow along with our tutorial. The showcase is a suitable starting point for learning what a fully featured embedded system built on top of the Pigweed SDK looks like.


What’s next

The Pigweed team will continue to have regular and on-going preview releases, with new features, bug fixes, and improvements based on your feedback. The team is working on a comms stack for all your end-to-end networking needs, factory-at-your-desk scripting, and much, much more. Stay tuned on the Pigweed blog for updates!


Learn more

Questions? Feedback? Talk to us on Discord or email us at [email protected].

We have a Pigweed Live session scheduled on August 26th, 13:00 PST where the Pigweed team will talk more about the Pigweed SDK and answer any questions you have. Join [email protected] to get an invite to the meetings.


Acknowledgements

We are profoundly grateful for our passionate community of customers, partners, and contributors. We honor all the time and energy you've given us over the years. Thank you!

By Amit Uttamchandani – Product Manager, Keir Mierle – Software Engineer, and the Pigweed team.

.