Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the size of a statically linked binary and library #10740

Closed
2 tasks
alexcrichton opened this issue Nov 30, 2013 · 7 comments
Closed
2 tasks

Optimize the size of a statically linked binary and library #10740

alexcrichton opened this issue Nov 30, 2013 · 7 comments
Labels
A-codegen Area: Code generation A-linkage Area: linking into static, shared libraries and binaries E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot.

Comments

@alexcrichton
Copy link
Member

Once #10528 lands, we'll be able to create static libraries and static binaries. While being very useful, we're creating massive binaries. There are a few opportunities for improvement that I can see here:

  • Static executables and static libraries contain the metadata sections of their dependent libraries. These are certainly not needed, and these sections should be removed (or possibly this is a good argument for putting the metadata in a separate file?). This would in theory be solved with objcopy -R, but objcopy doesn't exist by default on OSX, and the objcopy I found ended up producing a corrupted executable that didn't run.
  • We don't necessarily want to pull in all of libstd. There are likely vast portions of libstd which are not used in a crate which can all get removed. This involves eliminating unused functions and data which is not used. C/C++ solve this with -ffunction-sections and -fdata-sections which places each function and static global in its own section. The linker is then passed --gc-sections and magically removes everything that's unused.

Both of these optimizations are a little dubious, and this is why I chose the default output of libraries to be dynamic libraries for the compiler. These optimizations can benefit the size of an executable, but I've seen the compilation of fn main() {} increase by 5-10x when implementing these optimizations (even in the common no-opt compile case).

Additionally, these optimizations are going to be difficult to implement across platforms. Most of what I've described is linux-specific. There is a -dead_strip option on the OSX linker, but that's the only relevant size optimization flag I can find. I have not checked to see what the mingw linker provides in terms of size optimizations.

Empirical data

All of the data here is collected from a 32-bit ubuntu VM, but I imagine the numbers are very similar on other platforms. The program in question is simply fn main() {}.

  • Statically linked executable - 6.9MB
  • Removing metadata - 2.7MB
  • -ffunction-sections + --gc-sections - 1.6MB
  • -ffunction-sections + --gc-sections + #[no_uv] - 730K

Note that --gc-sections always removes the metadata. I'm unsure of whether --gc-sections corrupts our exception-handling sections.

From this, the "most optimized normal case" that I can get to is 1.6MB, which is still very large. As a comparison, the "hello world" go executable is 400K. A no_uv 730K executable is pretty reasonable, so it could just be that having M:N/uv means that you're pulling in larger portions of libstd. I believe that this size of 1.6MB means that further investigation is warranted to figure out where all this size is coming from.

Nominating for discussion. I don't think that this should block 1.0, but this is certainly a concern that we should prioritize.

@alexcrichton
Copy link
Member Author

I have investigated how to implement -ffunction-sections and -fdata-sections, and to do this we need to call llvm's TargetMachine::setFunctionSections(true) and TargetMachine::setDataSections(true), so from an implementation perspective these two optimizations are very easy to implement. I'm much more worried about the impact they have on compile times.

bors added a commit that referenced this issue Dec 3, 2013
This registers new snapshots after the landing of #10528, and then goes on to tweak the build process to build a monolithic `rustc` binary for use in future snapshots. This mainly involved dropping the dynamic dependency on `librustllvm`, so that's now built as a static library (with a dynamically generated rust file listing LLVM dependencies).

This currently doesn't actually make the snapshot any smaller (24MB => 23MB), but I noticed that the executable has 11MB of metadata so once progress is made on #10740 we should have a much smaller snapshot.

There's not really a super-compelling reason to distribute just a binary because we have all the infrastructure for dealing with a directory structure, but to me it seems "more correct" that a snapshot compiler is just a `rustc` binary.
@comex
Copy link
Contributor

comex commented Jan 2, 2014

This affects me; I'm not sure how much 500K (see below) matters in the long run for my use cases (including a kernel module), but it's very far from paying only for what I use. On OS X, the base size of a statically linked binary is:

% rustc -O a.rs; du -sh a
2.6M    a

--link-args -dead_strip brings it down:

% rustc -O a.rs --link-args -dead_strip; du -sh a 
1.2M    a

Using the libnative example at https://gist.github.com/anonymous/8162357 helps:

% rustc -O a.rs --link-args -dead_strip; du -sh a    
568K    a

and -Z lto brings it down a bit more:

% rustc -O a.rs --link-args -dead_strip -Z lto; du -sh a
472K    a

But this is still way too high for my liking. Most of the file is text:

 % size /tmp/a
__TEXT  __DATA  __OBJC  others  dec hex
323584  28672   0   4295098368  4295450624  100076000

A lot of the functions seem to come from unused methods in traits like IoFactory, so it would be very nice (perhaps difficult?) to have some way to prune them from vtables.

With rust-core I get a 12K binary. That's a fairly trivial result in some sense, since it just means that without any pretty printing for failure, everything has been optimized away, but the bare minimum for that pretty printing is much closer to 12K than 472K. I could just use rust-core for vaguely embedded stuff, but I don't think this is a good solution in the long run.

@alexcrichton
Copy link
Member Author

The best prospects you have of a small statically linked binary is to use LTO (as you found out), but you should not be running with -dead_strip because that can corrupt libraries and the object files (such as discarding metadata) and is not guaranteed to always work with the objects that rust generates.

You are correct in that the vtables are the major cause of bloat right now. As a result, all I/O code is pulled in to all binaries even if they don't use it (if they're statically linked). This is a consequence of our decision of the architecture of I/O and I don't forsee it changing soon.

If you care about using the standard library and having small binaries (not kernel modules), then I would highly recommend dynamic linking as an option. Dynamic linking is optimized for exactly this use case (one library implementation shared among many binaries).

If you care about the size of your libraries if you're making a kernel module, then these numbers are all irrelevant. You cannot use libnative or libgreen with a kernel module because they're all implemented on top of libc, which is not available in the kernel. There would be a separate library (libkernel if you write it) which would implement the relevant functionality that libstd expects to have (tasks, stdio, etc). Using LTO works exactly as well as you would expect it to, and a libkernel that didn't have large vtables would optimize to a very small module.

And finally, the embedded context is the same as the kernel context. If you're writing an embedded kernel, you cannot use libnative or even rust-core. You are forced to write your own implementation of various components. If you're worried about generating embedded binaries that are large, then I recommend that you use dynamic linking instead.

@thestinger
Copy link
Contributor

You can use rust-core for an embedded kernel, it doesn't have any required dependencies. There's just not much available yet with allocators blocked on fixing destructors.

@comex
Copy link
Contributor

comex commented Jan 3, 2014

@alexcrichton Good point regarding libkernel.

It seems like -dead_strip not working properly should be considered a bug, to be fixed by defining whatever metadata needs to be kept in libraries as an exported symbol. Thus unnecessary data could still be removed in statically linked executables, without having to deal with the overhead of LTO.

I think I am going to try to implement stripping dead virtual methods in a somewhat hacky way in LLVM to see how much it helps in practice.

On a somewhat related note, based on quickly skimming the resulting binary in IDA, I'm somewhat suspicious that part of the problem may be that rustc just generates more verbose code for idiomatic Rust than you see in idiomatic C, e.g. doing a lot of copying things around between the stack and registers, though I could be completely wrong. Is there any easy way to get the equivalent of -Oz/-Os in rustc?

@alexcrichton
Copy link
Member Author

I do not forsee officially supporting flags like -dead_strip or --gc-sections because this implies a close reliance on the system linker which we wish to not have. We would have to instruct the linker to retain certain symbols which is more of a pain than a gain on most platforms. I do not consider this a bug. Our LTO infrastructure will always preserve symbols correctly for a rust program, and this is the recommended method of cutting down program sizes.

Rust generates a very large amount of IR, but that does not mean that an optimized rust binary is slower or larger than C. Rust has always been on-par with C/C++ code (that I have examined). The only downside is that O0 is like 30x slower than C++ O0 (due to our large amount of codegen). Thankfully LLVM is a pretty good optimizer.

@comex
Copy link
Contributor

comex commented Jan 3, 2014

I do not understand what you mean about dead_strip. Supporting dead_strip properly on OS X requires two things:

  • Ensure that anything that is in a magic section is also referenced from an exported symbol. I think this is a matter of just making rust_metadata exported.
  • Ensure that MH_SUBSECTIONS_VIA_SYMBOLS is set for all objects that do not have multiple symbols which must be kept together. This is already true because LLVM always sets that flag.

I don't think that making one symbol public is more of a pain than a gain; LTO is nice but it is rather slow. I guess this could be a different issue report, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-linkage Area: linking into static, shared libraries and binaries E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot.
Projects
None yet
Development

No branches or pull requests

3 participants