Google Open Source Blog: 2024

Posts from 2024

Keys to a resilient Open Source future

Wednesday, September 18, 2024

In today’s world, free and open source software is a critical component of almost every software solution. Research shows that 70% of modern software relies on open source components, with over 97% of applications leveraging open source code. Unsurprisingly, 90% of companies are using or applying open source code in some form.

These statistics highlight the importance of open source software in modern technology and software development. At the same time, they demonstrate that as its relevance grows, so do the challenges associated with keeping it safe. At Open Source Summit EU, we discussed these challenges and how open source security could be improved. Let’s begin by breaking down the landscape of open source.

The open source ecosystem is fragmented, with diverse languages, build systems, and testing pipelines, making it difficult to maintain consistent security standards. This fragmentation forces developers to juggle multiple roles, such as managing security vulnerabilities, often without adequate tools and support. As a result, inconsistencies and security gaps arise, leaving open source projects vulnerable to attacks. Creating consistent security practices across the board is key to addressing vulnerabilities, which standardization helps to minimize while streamlining the development process.

Google’s SLSA (Supply Chain Levels for Software Artifacts) framework and OSV (Open Source Vulnerabilities) schemas are prime examples of how de facto standardization can transform open source security. SLSA has united several companies to create a standard that enables developers to improve their supply chain security posture, helping prevent attacks like those experienced by SolarWinds and Codecov.

The OSV schema has also been successful, with more than 20 language ecosystems adopting it. This schema allows vulnerabilities to be exported in a precise, machine-readable format, making them easier to manage and address. Thanks to its standardized format, over 150,000 vulnerabilities in open source software have been aggregated and made accessible to anyone in the world via a single API call.

However, many tasks remain manual, making them time-consuming and more prone to human error. Developers must integrate multiple tools at different stages of the software development cycle. The future of open source security lies in creating a fully integrated platform—a tool suite that integrates the best-in-industry tools and solutions, and provides simple hooks for continuous operation in the CI/CD system. Automation is crucial.

The key to revolutionizing open source security is AI, as it can automate manual and error-prone tasks, and reduce the burden on developers.

Google has already started leveraging AI in open source security by successfully using it to write and improve fuzzer unit tests. Google's OSS-Fuzz has been a game changer with a 40% increase in code coverage for over 160 projects. Since its inception, it has identified over 12,000 vulnerabilities with a 90% fix rate. Its effectiveness is due to its close integration with the developer’s workflow, testing the latest commits and helping to fix regressions quickly.

While AI remains an area of active research, and it has not yet solved all security challenges, Google is eager to collaborate with the community to push the boundaries of what AI can achieve in open source security.

Google's approach to open source security is now focused on long-term thinking and scalable solutions. To make a meaningful difference at scale, it is focusing on three key aspects:

Simplifying and applying security best practices consistently: Common, usable standards are key to reducing vulnerabilities and maintaining a secure ecosystem.

Developing an intelligent and integrated platform: A seamless, integrated platform that automates security tasks and naturally integrates into the developer workflow.

Leveraging AI to accelerate and enhance security: Reducing the workload on developers and catching vulnerabilities that might go undetected.

By maintaining this focus and continuing to collaborate with the community, Google and the open source ecosystem can ensure that FOSS remains a secure, reliable foundation for the software solutions of tomorrow.

By Abhishek Arya – Principal Engineer, Open Source and Supply Chain Security

Google Open Sources Smart Buildings Simulator and Dataset to Accelerate Sustainable Innovation

Tuesday, September 17, 2024

In our ongoing commitment to sustainability and technological advancement, Google is excited to announce a significant step forward in the realm of smart buildings. Today, we are open-sourcing two invaluable resources:

1. TensorFlow Smart Buildings Simulator: A powerful tool designed to train reinforcement learning agents to optimize energy consumption and minimize carbon emissions in buildings.

2. Smart Buildings Dataset: A comprehensive collection of six years of telemetry data from three Google buildings, providing real-world insights for developing and validating optimal control solutions.

Empowering the Future of Smart Buildings

Buildings account for a substantial portion of global energy consumption and greenhouse gas emissions. As we strive to create a more sustainable future, optimizing the energy efficiency of buildings is paramount. Artificial intelligence and machine learning offer promising solutions, and Google is dedicated to accelerating progress in this field.

The TensorFlow Smart Buildings Simulator provides researchers and developers with a realistic and customizable environment to train reinforcement learning agents. These agents can learn to make intelligent decisions about heating, cooling, ventilation, and lighting systems, balancing occupant comfort with energy efficiency and carbon reduction goals. By open-sourcing this simulator, we aim to empower the community to develop innovative control strategies that can be applied to real-world buildings.

Complementing the simulator, the Smart Buildings Dataset offers a wealth of real-world data collected from three Google buildings over six years. This dataset encompasses a wide range of telemetry, including temperature, humidity, occupancy, lighting levels, and energy consumption. By making this data available, we hope to enable researchers to develop data-driven models, validate their simulations, and gain deeper insights into the complex dynamics of building systems.

Collaboration for a Sustainable Future

We believe that open collaboration is key to driving innovation and progress in the smart buildings domain. By open-sourcing these resources, Google aims to foster a vibrant ecosystem of researchers, academics, and industry professionals working together to enhance sustainability and advance the field of smart buildings.

We envision universities leveraging these resources to conduct cutting-edge research, develop new algorithms, and train the next generation of engineers. Industry partners can utilize the simulator and dataset to test and validate their solutions, accelerate development cycles, and bring more efficient and sustainable products to market.

Google's Commitment to Sustainability

This open-source initiative aligns with Google's broader commitment to sustainability. We have set ambitious goals to operate on 24/7 carbon-free energy by 2030 and achieve net-zero emissions across all our operations and value chain by 2040. By sharing our tools and data, we hope to contribute to a global effort to reduce the environmental impact of buildings and create a more sustainable future for all.

Get Involved

We invite researchers, developers, and industry professionals to explore these open-source resources and join us in our mission to build a more sustainable world. Together, we can harness the power of AI and data to transform the way we design, operate, and interact with buildings, creating a future where energy efficiency, occupant comfort, and environmental responsibility go hand in hand.

Let's collaborate, innovate, and build a brighter future for smart buildings!

By John Sipple – Google Core Enterprise Machine Learning Team

Empowering etcd Reliability: New Downgrade Support in Version 3.6

Thursday, September 12, 2024

In the world of distributed systems, reliability is paramount. etcd, a widely used key-value store often critical to infrastructure, has made strides in enhancing this aspect. While etcd's reliability has been robust thanks to the Raft consensus protocol, the same couldn't be said for upgrades/downgrades – until now.

The Challenge of etcd Downgrades

Historically, downgrading etcd has been a complex and unsupported process. There is no way to safely downgrade etcd data after it was touched by a newer version. Upgrades, while reliable, weren't easily reversible, often requiring external tools and backups. This lack of flexibility posed a significant challenge for users who encountered issues after upgrading.

Enter etcd 3.6: A New Era of Downgrade Support

etcd 3.6 introduces a groundbreaking solution: built-in downgrade support. This innovation not only simplifies the upgrade and downgrade processes but also significantly enhances etcd's reliability.

How Does It Work?

Storage Versioning: A new storage version (SV) is persisted within the etcd data file. This version indicates compatibility, ensuring safe upgrades and downgrades.

Schema Evolution: A comprehensive schema tracks all fields in the data file and acts as a source of truth about which version a particular was introduced in, allowing etcd to understand and manipulate data across versions.

etcdutl migrate: A dedicated command-line tool, etcdutl migrate, streamlines skip-level upgrade and downgrade process, eliminating the need for complex manual steps.

Benefits for Users

The introduction of downgrade support in etcd 3.6 offers a range of benefits for users:

Improved Reliability: Upgrades can be safely reverted, reducing the risk of data loss or operational disruption.

Simplified Management: The upgrade and downgrade processes are streamlined, reducing the complexity of managing etcd clusters.

Increased Flexibility: Users have greater flexibility in managing their etcd environments, allowing them to experiment with new versions and roll back if necessary.

Under the Hood: Technical Details

To achieve downgrade support, etcd 3.6 implements a strict storage versioning policy. This means that etcd data is versioned, etcd will no longer be allowed to load data generated by version higher than its own, and must rely on cluster downgrade process instead. This ensures that all the DB and WAL files would not have any information that could be incorrectly interpreted.

During the downgrade process, new fields from the higher version in DB files will be cleaned up. The etcd protocol version will be lowered to allow older versions to join. All new features, rpcs and fields would not be used thus preventing older members from interpreting replicated logs differently. This also means that entries added to the Wal log file should be compatible with lower versions. When a wal snapshot happens, all older incompatible entries should be applied, so they no longer need to be read and the storage version can be downgraded.

The etcdutl migrate command tool is added to simplify etcd data upgrade and downgrade process on 2+ minor version upgrades/downgrades scenarios, by validating the WAL log compatibility with the target version, and executing any necessary schema changes to the DB file and updating the storage version.

Implementation Milestones

The rollout of downgrade support is planned in three milestones:

Snapshot Storage Versions: Storage versioning is implemented for snapshots.

Version Annotations: etcd code is annotated with versions, and a schema is created for the data file.

Full Downgrade Support: Downgrades can be fully implemented using the established storage versioning and schema.

We are currently working on finishing the third milestone.

Looking Ahead

etcd 3.6 marks a significant step forward in the reliability and manageability of etcd clusters. The introduction of downgrade support empowers users with greater flexibility and control over their etcd environments. As etcd continues to evolve, we can expect further enhancements to the upgrade and downgrade processes, further solidifying its position as a critical component in modern distributed systems.

By Siyuan Zhang – Software Engineer

Kubernetes 1.31 is now available on GKE, just one week after Open Source Release!

Wednesday, August 28, 2024

Kubernetes 1.31 is now available in the Google Kubernetes Engine (GKE) Rapid Channel, just one week after the OSS release! For more information about the content of Kubernetes 1.31, read the official Kubernetes 1.31 Release Notes and the specific GKE 1.31 Release Notes.

This release consists of 45 enhancements. Of those enhancements, 11 have graduated to Stable, 22 are entering Beta, and 12 have graduated to Alpha.

Kubernetes 1.31: Key Features

Field Selectors for Custom Resources

Kubernetes 1.31 makes it possible to use field selectors with custom resources. JSONPath expressions may now be added to the spec.versions[].selectableFields field in CustomResourceDefinitions to declare which fields may be used by field selectors. For example, if a custom resource has a spec.environment field, and the field is included in the selectableFields of the CustomResourceDefinition, then it is possible to filter by environment using a field selector like spec.environment=production. The filtering is performed on the server and can be used for both list and watch requests.

SPDY / Websockets migration

Kubernetes exposes an HTTP/REST interface, but a small subset of these HTTP/REST calls are upgraded to streaming connections. For example, both kubectl exec and kubectl port-forward use streaming connections. But the streaming protocol Kubernetes originally used (SPDY) has been deprecated for eight years. Users may notice this if they use a proxy or gateway in front of their cluster. If the proxy or gateway does not support the old, deprecated SPDY streaming protocol, then these streaming kubectl calls will not work. With this release, we have modernized the protocol for the streaming connections from SPDY to WebSockets. Proxies and gateways will now interact better with Kubernetes clusters.

Consistent Reads

Kubernetes 1.31 introduces a significant performance and reliability boost with the beta release of "Consistent Reads from Cache." This feature leverages etcd's progress notifications to allow Kubernetes to intelligently serve consistent reads directly from its watch cache, improving performance particularly for requests using label or field selectors that return only a small subset of a larger resource. For example, when a Kubelet requests a list of pods scheduled on its node, this feature can significantly reduce the overhead associated with filtering the entire list of pods in the cluster. Additionally, serving reads from the cache leads to more predictable request costs, enhancing overall cluster reliability.

Traffic Distribution for Services

The .spec.trafficDistribution field provides another way to influence traffic routing within a Kubernetes Service. While traffic policies focus on strict semantic guarantees, traffic distribution allows you to express preferences (such as routing to topologically closer endpoints). This can help optimize for performance, cost, or reliability.

Multiple Service CIDRs

Services IP ranges are defined during the cluster creation and can not be modified during the cluster lifetime. GKE also allocates the Service IP space from the VPC. When dealing with IP exhaustion problems, cluster admins needed to expand the assigned Service CIDR range. This new beta feature in Kubernetes 1.31 allows users to dynamically add Service CIDR ranges with zero downtime.

Acknowledgements

As always, we want to thank all the Googlers that provide their time, passion, talent and leadership to keep making Kubernetes the best container orchestration platform. From the features mentioned in this blog, we would like to mention especially Googlers Joe Betz, Jordan Liggitt, Sean Sullivan, Tim Hockin, Antonio Ojea, Marek Siarkowicz, Wojciech Tyczynski, Rob Scott, Gaurav Ghildiyal.

By Federico Bongiovanni – Google Kubernetes Engine

Fluent Bit WriteAPI Connector: Lowering the barrier to streaming data

Wednesday, August 21, 2024

Automating ingestion processes is crucial for modern businesses that handle vast amounts of data daily. In today's fast-paced digital landscape, the ability to seamlessly collect, process, and analyze data can make the difference between staying ahead of the competition and falling behind. To simplify ingestion, tools such as Fluent Bit enable customers to route data between pluggable sources and sinks without needing to write a single line of code. Instead, data routing is managed via a config file. The Fluent Bit WriteAPI Connector is a pluggable sink built on top of the BigQuery Storage Write API that enables organizations to rapidly develop a data ingestion pipeline.

What are the BigQuery Storage Write API and Fluent Bit?

The BigQuery Storage Write API is a high-performance data-ingestion API for BigQuery. It leverages both batching and streaming methods to ingest records into BigQuery in real-time. The WriteAPI offers features such as ability to scale and provides exactly-once delivery to guarantee that data is not duplicated. Using the Write API directly typically requires technical expertise, as users must navigate one of the client SDKs. This can create a high barrier to entry for some customers to stream data into BigQuery.

Fluent Bit is a widely-used open-source observability agent known for its lightweight design, speed, and flexibility. It operates by collecting logs, traces and metrics through various inputs such as local or network files, filtering and buffering them, and then routing them to designated outputs. Fluent Bit's high-performance parsing capabilities allow for data to be processed according to user specifications. The output component is a configurable plugin that directs data to different destinations, such as various tables in BigQuery. There can be multiple WriteAPI outputs and each output can be independently configured to use a specific write mode, enabling seamless data streaming into BigQuery based on tag/match pairs.

Why Use the Fluent Bit WriteAPI Connector?

Our solution to the technical challenges posed by using the WriteAPI is the Fluent Bit WriteAPI Connector. This connector automates the data ingestion process, eliminating the need for customers to write any code. The entire pipeline is managed through a single configuration file, making it easy to use. The flow of data is depicted in the diagram below.

Example Use Case

Say we wish to monitor a log file containing JSON data, and we would like to ingest this data into a BigQuery table that has a single column titled “Text” of type String. A line from the log file looks like this:

{"Text": "Hello, World"}

Setup Process

Setting Up Fluent Bit:

install

Cloning the Google Git Repository:

Google Git Repository

/usr/local/fluentbit-bigquery-writeapi-sink

plugins.conf

writeapi

[PLUGINS]

Path /usr/local/fluentbit-bigquery-writeapi-sink/out_writeapi.so

Setting Up BigQuery Tables:

BigQuery tables

Text

STRING

myProject.myDataset.myTable.

click to enlarge

Prepare the input file:

/usr/local/logfile.log

touch /usr/local/logfile.log

Configuring the Plugin:

/usr/local

demo.conf

format

This routes the data from /usr/local/logfile.log to the BigQuery table at myProject.myDataset.myTable. There are additional configurable fields that control the stream, such as chunking, asynchronous response queue, and also the type of stream. These fields let you control how your data is streamed.

To run the pipeline, use the command:

fluent-bit -c /usr/local/demo.conf

As the log file is updated new lines will automatically appear in the BigQuery table. For example, to populate the log file you can run the following command:

echo "{\"Text\": \"Hello, world\"}" >> /usr/local/logfile.log

Note that the default flush interval in Fluent Bit is 1 minute, so it might take a minute before the log file is flushed. The BigQuery table will now be updated as follows:

click to enlarge

Key Features

The connector supports a wide variety of features including multi-instancing, dynamic scaling, exactly-once delivery, and automatic retry.

1. Multi-Instancing

The multi-instancing feature of the Fluent Bit WriteAPI Connector is designed to offer flexibility in routing data. Specifically, users can configure the connector to handle multiple data inputs and outputs in various combinations. This feature also supports more complex configurations, such as multiple inputs feeding into multiple outputs, allowing data to be aggregated or distributed as needed. An input connector is labeled with a tag field. In our example, this has value log1. Data is routed to an output connector based on the value of its match field. In our example, this also has value log1, meaning there is a 1-to-1 correspondence between the input and output connector. The match field is a regex so it can be used to connect with multiple inputs. For example, if this was set to * then data from all inputs would flow to this output.

2. Dynamic Scaling

Handling large volumes of data efficiently is crucial for modern pipelines. The dynamic scaling feature addresses the issue of potential overloads in the Write API. As data is streamed into BigQuery, there may be times when the API queue becomes full—by default, it can hold up to 1000 pending responses. When this limit is reached, no new data can be appended until some of the pending responses are processed, which can create back pressure in the system. To manage this, the connector automatically scales up its capacity by creating an additional network connection when it detects that the number of pending responses has reached the threshold.

3. Exactly-Once

The "exactly-once" feature ensures that each piece of data is sent and recorded in BigQuery exactly once. This feature ensures no data is duplicated. If the connector encounters an intermittent issue while sending a specific piece of data, it will synchronously retry sending it until it is successful. This ensures data is delivered correctly.

4. Retry Functionality

The retry functionality allows the connector to handle temporary failures gracefully. The retry mechanism is configurable, meaning users can set how many times the system should attempt to resend the data before giving up. By default, the connector will retry sending failed data up to four times. In the default stream mode, if a row of data fails to send, it is retried while other rows continue to be processed. However, in the "exactly once" mode, the retry process is synchronous, meaning the system will wait for the failed row to be successfully sent before moving on to subsequent rows.

5. Error Handling

Error handling in the connector is designed to catch and manage issues that may arise during data transmission. The connector will continue processing incoming data even if earlier data had a failure. Any permanent issues that are encountered are logged to the console.

Conclusion

The ability to efficiently collect, process, and analyze data is a critical factor for business success. The Fluent Bit WriteAPI Connector stands out as a powerful solution that simplifies and automates the data ingestion process, bridging the gap between Fluent Bit's versatile data collection capabilities and Google BigQuery's robust analytics platform.

By eliminating the need for complex coding and manual data management, the Fluent Bit WriteAPI Connector lowers the barrier to entry for businesses of all sizes. Whether you're a small startup or a large enterprise, this tool allows you to effortlessly set up and manage your data pipelines with a single configuration file. Its features like multi-instancing, dynamic scaling, exactly-once delivery, and error handling ensure that your data is ingested accurately, reliably, and in real-time.

The straightforward setup process, combined with the flexibility and scalability of the connector, make it a valuable asset for any organization looking to harness the power of their data. By automating the ingestion process, businesses can focus on what truly matters: deriving actionable insights from their data to drive growth and innovation.

By Tanishqa Puhan, BigQuery WriteAPI

2023 Open Source Contributions: A Year in Review

Tuesday, August 13, 2024

At Alphabet, open source remains a critical component of our business and internal systems. We depend on thousands of upstream projects and communities to run our infrastructure, products, and services. Within the Open Source Programs Office (OSPO), we continue to focus on investing in the sustainability of open source communities and expanding access to open source opportunities for contributors around the world. As participants in this global ecosystem, our goal with this report is to provide transparency and to report our work within and around open source communities.

In 2023 roughly 10% of Alphabet’s full-time workforce actively contributed to open source projects. This percentage has remained roughly consistent over the last five years, indicating that our open source contribution has remained proportional to the size of Alphabet over time. Over the last 5 years, Google has released more than 7,000 open source elements, representing a mix of new projects, features, libraries, SDKs, datasets, sample code, and more.

Most open source projects we contribute to are outside of Alphabet

In 2023, employees from Alphabet interacted with more than 70,000 public repositories on GitHub. Over the last five years, more than 70% of the non-personal GitHub repositories receiving Alphabet contributions were outside of Google-managed organizations. Our top external projects (by number of unique contributors at Alphabet) include both Google-initiated projects such as Kubernetes, Apache Beam, and gRPC as well as community-led projects such as LLVM, Envoy, and web-platform-tests.

In addition to Alphabet employees supporting external projects, in 2023 Alphabet-led projects received contributions from more than 180,000 non-Alphabet employees (unique GitHub accounts not affiliated with Alphabet).

Open source remains vital to industry collaboration and innovation

As the technology industry turns to focus on novel AI and machine learning technologies, open source communities have continued to serve as a shared resource and avenue for collaboration on new frameworks and emerging standards. In addition to launching new projects such as Project Open Se Cura (an open-source framework to accelerate the development of secure, scalable, transparent and efficient AI systems), we also collaborated with AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Anyscale, Apple, Arm, Cerebras, Graphcore, Hugging Face, Intel, Meta, NVIDIA, and SiFive to release OpenXLA to the public for use and contribution. OpenXLA is an open source ML compiler enabling developers to train and serve highly-optimized models from all leading ML frameworks on all major ML hardware. In addition to technology development, Google’s OSPO has been supporting the OSI's Open Source AI definition initiative, which aims to clearly define 'Open Source AI' by the end of 2024.

Investing in the next generation of open source contributors

As a longstanding consumer and contributor to open source projects, we believe it is vital to continue funding both established communities as well as invest in the next generation of contributors to ensure the sustainability of open source ecosystems. In 2023, OSPO provided $2.4M in sponsorships and membership fees to more than 60 open source projects and organizations. Note that this value only represents OSPO's financial contribution; other teams across Alphabet also directly fund open source work. In addition, we continue to support our longstanding programs:

In its 19th year, Google Summer of Code (GSoC) enabled more than 900 individuals to contribute to 168 organizations. Over the lifetime of this program, more than 20,000 individuals from 116 countries have contributed to more than 1,000 open source organizations across the globe.

In its fifth year, Google Season of Docs provided direct grants to 13 open source projects to improve open source project documentation. Each organization also created a case study to help other open source projects learn from their experience.

In 2023, the Google Open Source Peer Bonus Program gave awards to 163 non-Alphabet contributors from the broader open source community representing 35 different countries.

A map of the world with highlighting every country that has had Google Summer of Code participants

Securing our shared supply chain remains a priority

We continue to invest in improving the security posture of open source projects and ecosystems. Since launching in 2016, Google's free OSS-Fuzz code testing service has helped discover and get over 10000 vulnerabilities and 34,000 bugs fixed across more than 1200 projects. In 2023, we added features, expanded our OSS-Fuzz Rewards Program, and continued our support for academic fuzzing research. In 2023, we also applied the generative power of LLMs to improve fuzz testing. In addition to this project we’ve been:

Collaborating on resources and frameworks: We’ve continued to support and participate in the Open Source Security Foundation (OpenSSF), which announced the release of SLSA v1.0, a framework that provides specifications for software supply chain security. We also supported the development of the malicious packages repository, the first open source system for collecting and publishing cross-ecosystem reports of malicious packages, and added GitLab support in v4.12 of Scorecard. The OSV Vulnerability Schema has continued to see strong organic adoption, with 8 new ecosystems adopting the format and publishing records for inclusion in OSV.dev.

Investing in tooling to increase trust and transparency: In 2023, Google’s Open Source Security Team (GOSST) helped to fund the launch of Trusted Publishing for PyPI and supported the rollout of 2FA enforcement across PyPI. GOSST also supported the launch of Sigstore-powered provenance in npm and other sigstore clients like Python to milestone releases. We also released Capslock, a capability analysis CLI tool that informs users of privileged operations (like network access and arbitrary code execution) in a given package and its dependencies, quantum resilient security keys as part of OpenSK (our open source security key firmware), and increased free public access to package dependency data through deps.dev’s new API.

Helping more projects adopt security best practices as well as identify and remediate vulnerabilities: Over the last year, the upstream team has proposed security improvements to more than 181 critical open source projects including widely-used projects such as NumPy, etcd, XGBoost, Ruby, TypeScript, LLVM, curl, Docker, and more. In addition to this work, GOSST continues to support OSV-Scanner to help projects find existing vulnerabilities in their dependencies, and enable comprehensive detection and remediation by providing commit-level vulnerability detail for over 30,000 existing CVE records from the NVD.

Our open source work will continue to grow and evolve to support the changing needs of our communities. Thank you to our colleagues and community members who continue to dedicate personal and professional time supporting the open source ecosystem. Follow our work at opensource.google.

Appendix: About this data

This report features metrics provided by many teams and programs across Alphabet. In regards to the code and code-adjacent activities data, we wanted to share more details about the derivation of those metrics.

Data sources: These data represent the activities of Alphabet employees on public repositories hosted on GitHub and our internal production Git service Git-on-Borg. These sources represent a subset of open source activity currently tracked by Google OSPO.

GitHub: We continue to use GitHub Archive as the primary source for GitHub data, which is available as a public dataset on BigQuery. Alphabet activity within GitHub is identified by self-registered accounts, which we estimate underreports actual activity.

Git-on-Borg: This is a Google managed git service which hosts some of our larger, long running open source projects such as Android and Chromium. While we continue to develop on this platform, most of our open source activity has moved to GitHub to increase exposure and encourage community growth.

Driven by humans: We have created many automated bots and systems that can propose changes on various hosting platforms. We have intentionally filtered these data to focus on human-initiated activities.

Business and personal: Activity on GitHub reflects a mixture of Alphabet projects, third-party projects, experimental efforts, and personal projects. Our metrics report on all of the above unless otherwise specified.

Alphabet contributors: Please note that unless additional detail is specified, activity counts attributed to Alphabet open source contributors will include our full-time employees as well as our extended Alphabet community (temps, vendors, contractors, and interns). In 2023, full time employees at Alphabet represented more than 95% of our open source contributors.

GitHub Accounts: For counts of GitHub accounts not affiliated with Alphabet, we cannot assume that one account is equivalent to one person, as multiple accounts could be tied to one individual or bot account.

^*Active counts: Where possible, we will show ‘active users’ defined by logged activity (excluding ‘WatchEvent’) within a specified timeframe (a month, year, etc.) and ‘active repositories’ and ‘active projects’ as those that have enough activity to meet our internal active-project criteria and have not been archived.

By Sophia Vargas – Analyst and Researcher, OSPO

Introducing the Pigweed SDK: A modern embedded development suite

Thursday, August 8, 2024

Back in 2020, Google announced Pigweed, an open-source collection of embedded libraries to enable a faster and more reliable development experience for 32-bit microcontrollers. Since then, Pigweed’s extensive collection of middleware libraries has continuously evolved and now includes RTOS abstractions and a powerful RPC interface. These components have shipped in millions of devices, including Google’s own Pixel suite of devices, Nest thermostats, DeepMind robots, as well as satellites and autonomous aerial drones.

Today, we introduce the first developer preview of the Pigweed SDK, making it even easier to leverage Pigweed’s libraries to develop, debug, test, and deploy embedded C++ applications. Using the included sample applications and comprehensive tutorial, you can easily get started prototyping simple programs and build up to more complex applications that leverage advanced Pigweed functionalities. Pigweed’s modern and modular approach makes it easy to design applications with significantly reduced debugging and maintenance overhead, thus making it a perfect choice for medium to large product teams.

We are also thrilled to contribute to the Raspberry Pi Pico 2 and RP2350 launch, providing official support in Pigweed for RP2350 and its predecessor, the RP2040. Building on the success of the Pico 1 and RP2040, the Pico 2 introduces the RP2350 microcontroller, bringing more performance and an exciting set of new capabilities in a much lower power profile. We’ve worked closely with the Raspberry Pi team to not only provide a great experience on Pigweed, but also upstreamed a new Bazel-based build system for Raspberry Pi’s own Pico SDK.

Raspberry Pi Pico 2 (RP2350) with Enviro+ pack hat.

What's in the SDK

The Pigweed SDK aims to be the best way to develop for the Pico family of devices. The SDK includes the Sense showcase project, which demonstrates a lot of our vision for the future of sustainable, robust, and rapid embedded system development, such as:

Hermetic building, flashing, testing, and cross-platform toolchain integration through Bazel.

Fully open-source Clang/LLVM toolchain for embedded that includes a compiler, linker, and C/C++ libraries with modern performance, features, and standards compliance

Efficient and robust device communication over RPC

An interactive REPL for viewing device logs and sending commands via command-line and web interfaces

Visual Studio Code integration with full C++ code intelligence

GitHub Actions support for continuous building and testing

Access to pico-sdk APIs when you need to drop down to hardware-specific functionality

Moving image of the Pigweed CLI console engaging with the device through interactive Remote Procedure Calls (RPCs).

Utilize the Pigweed CLI console to communicate with your device through interactive Remote Procedure Calls (RPCs).

By building your project with the Pigweed SDK (using the Sense showcase as your guide), you can start on readily available hardware like the Pico 1 or 2 today. Then when you’re ready to start prototyping your own custom hardware, you can target your Pigweed SDK project to your custom hardware without the need for a major rewrite.

Try Sense now

Bazel for embedded

Pigweed is all-in on Bazel for embedded development. We believe Bazel has great potential to improve the productivity (and mental wellbeing) of embedded development teams. We made the "all-in" decision last September and the Raspberry Pi collaboration was a great motivator to really flesh out our Bazel strategy:

We contributed to an entirely new Bazel-based build for the Pico SDK to make it easy for the RP2 ecosystem to use Bazel and demonstrate how Bazel takes care of complex toolchains and dependencies for you.

The new Sense showcase demonstrates Bazel-based building, testing, and flashing.

Our GitHub Actions guide shows you how to build and test your Bazel-based repo when pull requests are opened, updated, or merged.

Head over to Bazel's launch blog post to learn more about the benefits of Bazel for embedded.

Clang/LLVM for embedded

Pigweed SDK fully leverages the modern Clang/LLVM toolchain. We are especially excited to include LLVM libc, a fully compliant libc implementation that can easily be decomposed and scaled down for smaller systems. The team spent many months developing and contributing patches to the upstream project. Their collaboration with teams across Google and the upstream LLVM team was instrumental in making this new version of libc available for embedded use cases.

The sample applications, Pigweed modules, host-side unit tests, and showcase examples already use Clang, LLD, LLVM libc and libc++. Thus, developers can take advantage of Clang’s diagnostics and large tooling ecosystem, LLD’s fast linking times, and modern C and C++ standard library implementations which support features such as Thread Safety Analysis and Hardening.

IDE integration

With full Visual Studio Code support through pw_ide, you can build and test the Sense showcase from the comfort of a modern IDE and extend the IDE integration to meet your needs. Full target-aware code intelligence makes the experience smooth even for complicated embedded products. Automatic linting, formatting, and code quality analysis integrations are coming soon.

Parallel on-device testing with PicoPico

As you would expect from a team with the mission to make embedded development more sustainable, robust, and rapid for large teams, we are of course obsessed with testing. We have hundreds of on-device unit tests running all the time on Picos. The existing options were a bit slow so we whipped up PicoPico in a week (literally) to make it easier to run all these tests in parallel.

One “PicoPico” node for running parallel on-device tests

RP2 support

The goal behind our extensive catalog of modules is to make it easy to fully leverage C++ in your embedded system codebases. We aim to provide sensible, reusable, hardware-agnostic abstractions that you can build entire systems on top of. Most of our modules work with any hardware, and we have RP2 drivers for I2C, SPI, GPIO, exception handling, and chrono. When Pigweed's modules don't meet your needs, you can still fallback to using pico-sdk APIs directly.

Get started

Clone our Sense showcase repo and follow along with our tutorial. The showcase is a suitable starting point for learning what a fully featured embedded system built on top of the Pigweed SDK looks like.

What’s next

The Pigweed team will continue to have regular and on-going preview releases, with new features, bug fixes, and improvements based on your feedback. The team is working on a comms stack for all your end-to-end networking needs, factory-at-your-desk scripting, and much, much more. Stay tuned on the Pigweed blog for updates!

Learn more

Questions? Feedback? Talk to us on Discord or email us at [email protected].

We have a Pigweed Live session scheduled on August 26th, 13:00 PST where the Pigweed team will talk more about the Pigweed SDK and answer any questions you have. Join [email protected] to get an invite to the meetings.

Acknowledgements

We are profoundly grateful for our passionate community of customers, partners, and contributors. We honor all the time and energy you've given us over the years. Thank you!

By Amit Uttamchandani – Product Manager, Keir Mierle – Software Engineer, and the Pigweed team.

Google CQL: From Clinical Measurements to Action

Wednesday, July 31, 2024

Today, many institutions are building custom solutions for understanding their medical data, as well as tools for acting on that data. A major pain point with the current approach is that these tools can be error prone, lack built in medical context and medical data structure representations. Enter Clinical Quality Language (CQL), a portable, computable, and open HL7 language specification for expressing computable clinical logic over healthcare data. We believe that CQL has the power to radically improve the future of data driven workflows in healthcare. Over the past year at Google Health, our team has been hard at work building foundational tools for healthcare data analytics. Today we’re announcing the release of an experimental open source toolkit for Clinical Quality Language execution.

The Google CQL engine is an experimental open source toolkit that includes a CQL execution engine built from scratch in Go. We built this engine with a focus on horizontal scalability in mind, ease of use, and high test coverage. We wanted to make it easy to experiment with our engine, so we’ve included an easy to use CLI, REPL, and a two-click setup web playground! The toolkit is still a work in progress and we very much welcome input, contributions, and ideas from the community.

Why CQL

CQL represents a major shift away from the precedent of distributing clinical logic as free text guidelines which each institution implements in custom and often error prone ways. Now, CQL allows clinical logic to be written once, distributed, and run anywhere in a single framework. Major standards bodies like Medicare, NCQA, and the World Health Organization (WHO) have already started to adopt and distribute clinical measures in CQL! (Check out these antenatal care measures from the WHO as an example). We believe that CQL lowers the burden to writing, sharing, and computing complex clinical content.

CQL supports multiple common healthcare data models (such as FHIR and QDM) and is designed with common clinical concepts, tasks, and nested data structures in mind. For example, consider this comparison:

(click to enlarge) A side by side comparison of FHIR SQL (BigQuery) to CQL.
This logic extracts CHD encounters with statins prescribed during the visit.

The FHIR SQL requires more boilerplate, unnesting, and custom value set handling. It’s very clear here that the CQL is more readable, concise, and easier to understand than the SQL implementation for this example.

If you’d like to see a more in depth CQL example with an explanation, see Appendix A.

As the healthcare industry has matured so have the representations of Clinical Quality Measures. Previously, clinical quality mandates were provided as free-text guidelines. That left it up to each medical institution to implement themselves. This was of course error prone, and repetitive across the industry. There is a shift today where institutions like the WHO, CMS, and NCQA are writing clinical measures increasingly in CQL.

Transition to standards based Clinical Quality Measures diagram

Examples like the WHO Antenatal Care Guidelines project exemplify the shift to openly distributed and executable measures. We believe that computable and shareable measures like these WHO SMART Guidelines are the future for expressing and sharing medical knowledge.

Our CQL Toolkit

We would love others excited about this work to check out our experimental CQL tools at https://github.com/google/cql. We continue to be very interested in welcoming external contributors, so we strongly encourage you to check out the repository to give it a try and consider helping with any open issues. If you’re not sure where to ask, reach out to us! We’d also like to hear from others about what they’re working on and how the Google CQL engine may fit into their toolchain, feel free to reach out at [email protected] or open an issue on the repository.

If you want to learn more about CQL see https://github.com/cqframework/clinical_quality_language and https://cql.hl7.org/index.html.

Appendix A: Simplified Diabetes CQL Example

library ExampleCQLLibrary version '1.2.3'
using FHIR version '4.0.1'

valueset Diabetes: 'diabetes-valuseset-url' version '1.0'
valueset GlucoseLevels: 'glucose-levels-valueset-url' version '1.0'

context Patient

define PatientMeetsAgeRequirement: AgeInYearsAt(Now()) < 20

define HasDiabetes:

       exists ([Condition: Diabetes] chd where chd.onset before Now())

define LatestGlucoseReading:
       Last([Observation: GlucoseLevels] bp sort by effective desc)

define LatestGlucoseAbove200: LatestGlucoseReading.value > 200

define Denominator: PatientMeetsAgeRequirement and HasDiabetes

define Numerator: Denominator and LatestGlucoseAbove200

In this example for a given patient record, the code selects for individuals under 20 where their most recent glucose reading was above 200. Although this is a simple example, it’s made simple because CQL provides a solid foundation for which to define and act on medical information and concepts.

By Evan Gordon and Suyash Kumar – Software Engineers

Health AI Team: Ryan Brush, Kai Bailey, Ed Nanale, Chris Grenz

DAGify: Accelerate Your Journey from Control-M to Apache Airflow

Friday, July 26, 2024

In the dynamic world of data engineering and workflow orchestration, organizations are increasingly migrating from legacy enterprise schedulers like Control-M to the open-source powerhouse, Apache Airflow. However, this transition often involves a complex and time-consuming process of converting existing job definitions. DAGify emerges as a beacon of efficiency in this scenario, offering an open-source solution to automate the conversion of Control-M XML files into Airflow's native DAG format.

DAGify isn't just a simple conversion tool; it's a migration accelerator, designed to significantly reduce the manual effort and potential errors associated with transitioning to Airflow. While it might not provide a perfect 1:1 migration in every case, its primary goal is to expedite the process, allowing developers to focus on optimizing their workflows in the new environment.

Introduction

Control-M has served as a reliable workhorse for many organizations, but its proprietary nature and limitations can become roadblocks in today's cloud-centric and agile data landscape. Apache Airflow, with its flexibility, scalability, and thriving community, presents a compelling alternative. However, the migration journey can be daunting, especially when dealing with intricate Control-M job definitions.

DAGify steps in to bridge this gap, offering an intuitive and extensible solution. By automating the conversion process, it empowers organizations to embrace Airflow's capabilities without the burden of manual translation. This translates to faster migrations, reduced errors, and a smoother transition overall.

Technical Details

Under the hood, DAGify employs a template-driven approach, making it adaptable to various Control-M configurations and Airflow requirements. It parses Control-M XML files, extracting crucial information about jobs, dependencies, and schedules. This data is then intelligently mapped to Airflow's operators, tasks, and dependencies, preserving the essence of the original workflow. While still under active development, DAGify already supports key Control-M features like job and dependency mapping. The project roadmap includes further enhancements, such as handling custom calendars and expanding support for other enterprise schedulers.

Template-driven conversion

DAGify employs a flexible template system that empowers you to define the mapping between Control-M jobs and Airflow operators. These user-defined YAML templates specify how Control-M attributes translate into Airflow operator parameters. For instance, the control-m-command-to-airflow-ssh template maps Control-M's "Command" task type to Airflow's SSHOperator, outlining how attributes like JOBNAME and CMDLINE are incorporated into the generated DAG.

The template's structure field utilizes Jinja2 templating to dynamically construct the Airflow operator code, seamlessly integrating Control-M job attributes.

Example:

A Control-M task like:

<JOB 
  APPLICATION="my_application" 
  SUB_APPLICATION="my_sub_application" 
  JOBNAME="job_1" 
  DESCRIPTION="job_1_reports"  
  TASKTYPE="Command" 
  CMDLINE="./hello_world.sh" 
  PARENT_FOLDER="my_folder">
  <OUTCOND NAME="job_1_completed" ODATE="ODAT" SIGN="+" />
</JOB>

is converted to an Airflow operator using the control-m-command-to-airflow-ssh-gce template:

job_1 = SSHOperator(
    task_id="x_job_1",
    command="./hello_world.sh",
    dag=dag,
)

The repository includes several pre-defined templates for common Control-M task types. The config.yaml file at the project's root allows you to customize which templates are applied during the conversion process.

Leveraging Google Cloud Composer

For organizations seeking a fully managed Airflow experience, Google Cloud Composer provides a compelling solution. It eliminates the complexities of managing Airflow infrastructure, allowing you to focus on building and orchestrating your data pipelines. DAGify seamlessly integrates with Google Cloud Composer, making it even easier to migrate your Control-M workflows to a cloud-native environment.

Try it yourself

Eager to experience the power of DAGify? It's readily available as an open-source project on GitHub: https://github.com/GoogleCloudPlatform/dagify. The repository provides detailed instructions on setting up and running DAGify locally or within a Docker container.

Key steps to get started:

Clone the repository: git clone https://github.com/GoogleCloudPlatform/dagify.git
Install dependencies: make clean (This sets up a virtual environment and installs required packages)
Run DAGify: python3 DAGify.py --source-path=[YOUR-SOURCE-XML-FILE]

Remember, DAGify is an ongoing project, and community contributions are welcome! If you encounter any issues or have feature requests, feel free to open an issue on GitHub.

Conclusion

DAGify represents a significant leap forward in simplifying enterprise scheduler migrations to Apache Airflow. By automating the conversion process and seamlessly integrating with Google Cloud Composer, it empowers organizations to embrace the benefits of Airflow more rapidly and efficiently. Whether you're a seasoned Airflow developer or just starting your migration journey, DAGify is a valuable tool to explore.

Remember:

Thorough testing is crucial: Always test your converted DAGs in a staging environment before deploying them to production.
Leverage Airflow's ecosystem: Explore the vast array of Airflow plugins and integrations to further enhance your workflows.
Stay engaged with the community: Keep an eye on DAGify's development and contribute to its growth if you can!

Happy migrating!

By Konrad Schieban and Tim Hiatt – Google Cloud

Acknowledgments

Thank you to the following team members who made this solution possible: Shreya Prabhu, Harish S, Slava Guzanov and Joanna Rajaseharan from Google Cloud.

Google Blocks is now Open Source

Tuesday, July 16, 2024

In 2017, we shared Google Blocks with the world as a simple, easy and fun way to create 3D objects and scenes, using the new wave of VR headsets of the day.

We were thrilled to see the surprising, inventive and beautiful assets you all put together with Google Blocks, and continue to be impressed by the enthusiasm of the community.

We now wish to share the code behind Google Blocks, allowing for novel and rich experiences to emerge from the creativity and passion of open source contributors such as the Icosa Foundation, who have already been doing wonderful work with Tilt Brush, which we open-sourced in 2021.

"We're thrilled to see Blocks join Tilt Brush in being released to the community, allowing another fantastic tool to grow and evolve. We can't wait to take the app to the next level as we have done with Open Brush."

– Mike Nisbet, Icosa Foundation

What’s Included

The open source archive of the Blocks code can be found at: https://github.com/googlevr/blocks

Please note that Google Blocks is not an actively developed product, and no pull requests will be accepted. You can use, distribute, and modify the Blocks code in accordance with the Apache 2.0 License under which it is released.

The currently published version of Google Blocks will remain available in digital stores for users with supported VR headsets. If you're interested in creating your own Blocks experience, please review the build guide and visit our github repo to access the source code.

Thank you all for coming on this journey with us so far, we can’t wait to see where you take Blocks from here.

By Ian MacGillivray – Software Engineer, on behalf of the Google Blocks team.

Bounds Checking Flexible Array Members

Tuesday, July 9, 2024

Buffer overflows are the cause of many security issues, and are a persistent thorn in programmers' sides. C is particularly susceptible to them. The advent of sanitizers mitigates some security issues by automatically inserting bounds checking, but they're not able to do so in all situations—in particular for flexible array members, because their size is known only at runtime.

The size of a flexible array member is typically opaque to the compiler. The alloc_size attribute on malloc() may be used for bounds checking flexible array members within the same function as the allocation. But the attribute's information isn't carried with the allocated object, making it impossible to perform bounds checking elsewhere.

To mitigate this drawback, Clang and GCC are introducing¹ the counted_by attribute for flexible array members.

Specifying a flexible array member's element count

The number of elements allocated for a flexible array member is frequently stored in another field within the same structure. When applied to the flexible array member, the counted_by attribute is used by the sanitizer—enabled by -fsanitize=array-bounds—by explicitly referencing the field that stores the number of elements. The attribute creates an implicit relationship between the flexible array member and the count field enabling the array bounds sanitizer to verify flexible array operations.

There are some rules to follow when using this feature. For this structure:

struct foo {
	/* ... */
	size_t count; /* Number of elements in array */
	int array[] __attribute__((counted_by(count)));
};

The count field must be within the same non-anonymous, enclosing struct as the flexible array member.

The count field must be set before any array access.

The array field must have at least count number of elements available at all times.

The count field may change, but must never be larger than the number of elements originally allocated.

An example allocation of the above structure:

struct foo *foo_alloc(size_t count) {
  struct foo *ptr = NULL;
  size_t size = MAX(sizeof(struct foo),
                    offsetof(struct foo, array[0]) +
                        count * sizeof(p->array[0]));

  ptr = calloc(1, size);
  ptr->count = count;
  return ptr;
}

Uses for fortification

Fortification (enabled by the _FORTIFY_SOURCE macro) is an ongoing project to make the Linux kernel more secure. Its main focus is preventing buffer overflows on memory and string operations.

Fortification uses the __builtin_object_size() and __builtin_dynamic_object_size() builtins to try to determine if input passed into a function is valid (i.e. "safe"). A call to __builtin_dynamic_object_size() generally isn't able to take the size of a flexible array member into account. But with the counted_by attribute, we're able to calculate the size and improve safety.

Uses in the Linux kernel

The counted_by attribute is already in use in the Linux kernel, and will be instrumental in catching issues like integer overflows, which led to a heap buffer overflow. We want to expand its use to more flexible array members, and enforce its use in the future.

Conclusion

The counted_by attribute helps address a long-standing fortification road block where the memory bounds of a flexible array member couldn't be determined by the compiler, thus making Linux, and other hardened applications, less exploitable.

¹In Clang v18.0.0 and GCC v15.0.0.

By Bill Wendling, Staff Software Engineer