Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote/correctness: Bazel doesn't track system libraries. #4558

Open
buchgr opened this issue Feb 1, 2018 · 23 comments
Open

remote/correctness: Bazel doesn't track system libraries. #4558

buchgr opened this issue Feb 1, 2018 · 23 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: process

Comments

@buchgr
Copy link
Contributor

buchgr commented Feb 1, 2018

Bazel currently does not track tools outside a workspace. This can be a problem if, for example, an action uses a compiler from /usr/bin/. Then, two users with different
compilers installed will wrongly share cache hits because the outputs are different but they have the same action hash.

@buchgr buchgr added the P2 We'll consider working on this in future. (Assignee optional) label Feb 1, 2018
@BenTheElder
Copy link

Is there a workaround for this currently?
Can we just do something like --action_env=COMPILER=<compiler version>?

@buchgr
Copy link
Contributor Author

buchgr commented Feb 3, 2018

@BenTheElder Yep exactly. It's not just compilers though it's any tools that your build uses from your system that you did not explicitly specify really. So you might want to include some more info than just your compiler.

@edbaunton
Copy link
Contributor

Would the use of toolchains help with this?

@ulfjack
Copy link
Contributor

ulfjack commented Feb 5, 2018

Our naive proposal is to allow each toolchain to add a toolchain identifier into the cache key(s). The risk is that a bug (or a design flaw) in the toolchain might cause incorrect caching, but, on the other hand, it leaves the door open to toolchains that want to cache across platforms. E.g., a Java 7 compiler on Mac and Linux outputs essentially the same files, so caching those might be safe even if the binaries are not bit identical.

@BenTheElder
Copy link

I forgot to follow up here: setting --action_env doesn't seem to be sufficient since not all actions consume this (?) I'm currently working around this by just making sure different caches are used for different toolchain versions.

@buchgr
Copy link
Contributor Author

buchgr commented Feb 12, 2018

@BenTheElder It’s my understanding that this would be a bug in Bazel, correct
@ulfjack? Do you have a reproducer?

Something that works for sure with every action is —experimental_remote_platform_override. It takes as a value the text representation of the Platform message defined in the remote_execution.proto. It’s effectively just a key value pair.

@BenTheElder
Copy link

I don't have a great reproducer currently, in https://github.com/kubernetes/test-infra I modify the bazelrc to point at a local copy of bazel-remote-cache from within a debian+bazel docker container and then I build again from another container after upgrading the GCC version to gcc-7.

When doing this I noticed stale headers from gcc-4.9 being included as part of compiling https://github.com/google/protobuf as a dependency despite setting an --action_env=CC_VERSION=<exact version of the gcc package> each time so I ran the bulid with -s and didn't see the env being passed to the failed compilation step.

@buchgr where is --experimental_remote_platform_override documented?

@ulfjack
Copy link
Contributor

ulfjack commented Feb 13, 2018

See #3320. :-/

@buchgr
Copy link
Contributor Author

buchgr commented Feb 13, 2018

@BenTheElder ohh I apologize for the misinformation - it was my understanding that --action_env is supported by all actions. Please use the platform override then - I shall document it on our website.

Essentially, you can pass it the following protobuf:

--experimental_remote_platform_override='properties:{name:"key1" value:"value1" name:"key2" value:"value2"}'

This is passed through to the computation of the action key for remote caching.

@lisendong
Copy link

when I build bazel from source ( bazel-0.10.0), I met the system env var problem:

ERROR: /home/mpi/tensorflow/downloads/bazel-0.10.0/src/main/protobuf/BUILD:101:1: error executing shell command: 'cp 'bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.jar' 'bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.srcjar'' failed (Exit 127): bash failed: error executing command
  (cd /tmp/bazel_vAIW1oPT/out/execroot/io_bazel && \
  exec env - \
  /bin/bash -c 'cp '\''bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.jar'\'' '\''bazel-out/k8-opt/bin/src/main/protobuf/command_server_java_grpc_srcs.srcjar'\''')
/bin/bash: cp: command not found
Target //src:bazel failed to build
INFO: Elapsed time: 55.301s, Critical Path: 39.40s
FAILED: Build did NOT complete successfully

because in my system, "cp" command is under ~/bin/ but not /bin (and I do not have root privileges)
how could I set the $PATH enviromnt?

I have tried "--action_env=PATH" but it does not work

@gkossakowski
Copy link

E.g., a Java 7 compiler on Mac and Linux outputs essentially the same files, so caching those might be safe even if the binaries are not bit identical.
@ulfjack do you have plans for supporting caching of jdk-derived (e.g. javac compiled) outputs across platforms? is it possible to override the toolchain even in a hacky way today?

@ulfjack
Copy link
Contributor

ulfjack commented Feb 22, 2018

The current design does not make it easy to do cross-platform caching of platform-independent outputs (e.g., Java bytecode); we're currently adding in the full command-line and env variables, which will often differ (e.g., Windows command line contains backward slashes vs. Linux uses forward slashes).

Right now, if all the input files are platform-independent, and the command-lines / env variables happen to be identical (e.g., this could happen for Linux / MacOS with the right flags), then the current cache will return a cache hit - that's actually a correctness issue right now, since we can't guarantee that the outputs are actually platform independent, i.e., this would even happen for an action involving gcc if the command line, env variables, and inputs happen to be identical. I believe that we actually have a bug report for that somewhere.

Our rough plan for solving this is to add a toolchain identifier into the cache key to prevent such cache hits (this isn't a fully baked proposal yet), and I expect that we'll allow users to override those / provide identifiers that happen to be identical across platforms (handwave).

Getting cache hits across Windows and Linux / MacOS platforms is a more complex topic that we haven't really started to look into yet. We'll need to figure something out about paths for sure, but there are more problems. In particular, what immediately comes to mind is that line endings make source files not bit-for-bit identical so we can't use a straightforward hash, or we have to require users to use a consistent line ending convention across multiple platforms. Another problem is case sensitivity of file systems (or lack thereof), and the differences in command-line flags (e.g., Windows usually uses /flag, whereas Linux usually uses -flag or --flag).

@gkossakowski
Copy link

@ulfjack thanks for the detailed response! This ticket gave me an idea how we could actually share outputs across platforms. I opened a new issue #4714 to discuss it. I'd appreciate the input.

@buchgr buchgr changed the title Bazel doesn't track system libraries. remote/correctness: Bazel doesn't track system libraries. Mar 14, 2018
@BenTheElder
Copy link

BenTheElder commented May 31, 2018

Following back up on this after a while, FWIW we've been working around this for kubernetes for a while now by hashing the toolchains ourselves (since we run in debian container(s) we can do this pretty easily) and using this to key our cache.

So far this has worked well enough as a stopgap.

@buchgr buchgr added type: process team-Remote-Exec Issues and PRs for the Execution (Remote) team and removed category: http caching labels Jan 16, 2019
@drigz
Copy link
Contributor

drigz commented Feb 28, 2019

@buchgr
--experimental_remote_platform_override is gone in 0.23.0. --remote_default_platform_properties seems to be the successor, but It's not clear if it will work in the presence of platform autodetection. Can you recommend a new workaround?

For now I will try `--remote_http_cache=https://${cache_host}/${extra_cache_key}", inspired by @BenTheElder's link.

@HackAttack
Copy link
Contributor

It’s not obvious to me that this issue is specific to the remote cache; wouldn’t it affect the local cache as well? If not, why not?

@buchgr
Copy link
Contributor Author

buchgr commented May 8, 2019

@HackAttack correct. however, the local cache affects only a single user while the remote cache can do quite a bit of harm.

@GabrielGhe
Copy link

I forgot to follow up here: setting --action_env doesn't seem to be sufficient since not all actions consume this (?) I'm currently working around this by just making sure different caches are used for different toolchain versions.

Could you elaborate on this? Which actions do not consume this and which do?

@BenTheElder
Copy link

Could you elaborate on this? Which actions do not consume this and which do?

To be honest after more than a year I don't really remember and other details have changed pretty significantly since then. I no longer have much activity related to this.

We've been reasonably happy with the distinct cache locations, FWIW. I think recent efforts are focused on using GCP RBE which uses the build container image as a key IIRC.

@buchgr buchgr removed their assignment Jan 9, 2020
mostynb added a commit to mostynb/bazel-remote that referenced this issue Aug 25, 2020
In some client setups, untracked local files can be used by an action
without being included in the Action message, which causes action cache
collisions: bazelbuild/bazel#4558

Ideally this should be fixed on the client side (either in the client,
or in the build configuration), but it is not always easy to do in
practice.

As a workaround, this patch adds a setting to mangle ActionCache keys
with the instance name provided by the client (if it is not empty), to
produce a new ActionCache key. Clients are then able to specify a
different instance name whenever a change is made that could affect
these untracked inputs. The instance name value could be something like
the hash of the compiler version. This allows multiple ActionCache items
to exist in the cache, without requiring a change to the on-disk storage
format.

This feature is disabled by default, since it would cause cache
invalidations for existing users.

Fixes buchgr#15.
mostynb added a commit to buchgr/bazel-remote that referenced this issue Aug 28, 2020
In some client setups, untracked local files can be used by an action
without being included in the Action message, which causes action cache
collisions: bazelbuild/bazel#4558

Ideally this should be fixed on the client side (either in the client,
or in the build configuration), but it is not always easy to do in
practice.

As a workaround, this patch adds a setting to mangle ActionCache keys
with the instance name provided by the client (if it is not empty), to
produce a new ActionCache key. Clients are then able to specify a
different instance name whenever a change is made that could affect
these untracked inputs. The instance name value could be something like
the hash of the compiler version. This allows multiple ActionCache items
to exist in the cache, without requiring a change to the on-disk storage
format.

This feature is disabled by default, since it would cause cache
invalidations for existing users.

Fixes #15.
@Flamefire
Copy link
Contributor

Could you elaborate on this? Which actions do not consume this and which do?

I'm recently seeing action_env being consumed by some cc_library actions and not by others. See #12059

SanjayVas added a commit to world-federation-of-advertisers/actions that referenced this issue Mar 11, 2021
This allows for manually invalidating prior cache results when there are incompatible changes that Bazel doesn't handle. For example, changing the C++ compiler version.

See bazelbuild/bazel#4558
SanjayVas added a commit to world-federation-of-advertisers/actions that referenced this issue Mar 11, 2021
This allows for manually invalidating prior cache results when there are incompatible changes that Bazel doesn't handle. For example, changing the C++ compiler version.

See bazelbuild/bazel#4558
@cameron-martin
Copy link
Contributor

Since bazel can consume anything from the system by default, it would seem natural to add some kind of extra input to all build actions that represents the complete state of the system, for example the hash of a docker container or the commit sha of the scripts that provision the system. This is similar to what was suggested above, with using --action_env, but instead of choosing specific properties such as compiler version you pass a hash of the whole system.

However, this falls down when using remote execution because this is specified by the machine invoking bazel rather than the remote execution service. The machine invoking bazel has no way of knowing which environment the remote execution service will use to execute the action.

@aaronmondal
Copy link

The solution we're using in rules_ll is to wrap the entire build environment in nix and generate remote execution toolchains from that. We then replicate the RBE environment locally and can seamlessly switch between RBE and local builds and reuse the same cache.

Upsides:

  • Lets us share caches between remote executors and local dev environments, since the remote execution container is built using the exact same tools as in the dev environment.
  • It's really easy to update the toolchain containers - it's basically just a nix flake update and the latest toolchains are used (currently clang 16).
  • Fully reproducible. Even across different systems as long as the CPU architecture is x86_64 and we're running on linux. Both the RBE container and the local devenv are binary-identical when built on e.g. a custom Gentoo and Ubuntu running in WSL2.
  • Local builds don't require containerization, i.e. they build at native speed without overhead.

Downsides:

  • Easy to use, but hard to modify. One needs to know their way around not only Bazel and remote execution, but also nix and the nix cc toolchains to customize behavior.
  • Since the RBE images are completely custom, a container registry becomes a hard requirement. In our case we're using a local k8s cluster, but running a regular docker registry should also work.
  • The RBE images are a bit larger than the "standard" bazel RBE images (~2.8 GB uncompressed, ~1.3 compressed in the registry).
  • We don't have a good way to version the RBE images yet. Nix flakes don't let you customize "input parameters" for purity reasons. This makes it a bit tricky to tag a toolchain image as a non-latest version.
  • If you need some non-bazel dependency from nixpkgs you'll need to modify the toolchain container and rebuild all toolchains. This is technically WAI, but there is probably a more elegant solution to this.

Implementation:

@xkszltl
Copy link

xkszltl commented Jan 26, 2024

Does this only apply to the toolchain, or including external libs as well?
By external libs I mean:

  • C++ headers
  • -lxxx in linkopts
  • Lib file imported in whatever trackable way to bazel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: process
Projects
None yet
Development

No branches or pull requests