# How To Split A Fuzzer-Generated Input Into Several Parts This document describes several recommended strategies for splitting a single fuzzer-generated input into several parts (sub-inputs). # Examples Splitting a fuzzer-generated input into several independent parts (sub-inputs) is required very often. Some examples: * Fuzzing a regular expression library requires * The regular expression (RE) * Flags for RE compilation and matching * A string to search the RE in * Fuzzing an audio/video format decoder often requires * Decoding flags * Several frames * Fuzzing a XSLT or CSS library requires * The stylesheet input * The XML/HTML input * Fuzzing a font-rendering library requires * The font file * The text to render * The rendering flags * Fuzzing a database library may require * The query text * The database state TODO: more examples? # Common Data Format When trying to split the fuzzer-generated input into several parts, the first question one needs to ask is whether the input format is common, i.e. is it used or processed by other libraries, APIs, of fuzz targets. If the data format is common (e.g. a widely used media format or network packet format) then it is highly desirable for a fuzz target to consume exactly this data format, and not some custom modification. This way it will be easier to procure a seed corpus for this fuzz target and to use the generated corpus to test/fuzz other targets. ## Multiple Options If the data format may be processed by a fuzz target in a small number of different ways, it is often the best approach to split the fuzz target into several ones, each processing the input in exactly one way. Make sure to [cross-pollinate](https://github.com/google/fuzzing/blob/master/docs/glossary.md#cross-pollination) the corpora between these targets. OSS-Fuzz does that automatically. ## Embedding / Comments When a fuzz target for a common data format requires some flags, options, or additional auxiliary sub-input(s), it is sometimes possible to embed the extra input inside a custom section or a comment of the main data format. Examples: * PNG allows custom "chunks", and so a fuzz target for a PNG decoder can hide the flags used during PNG processing in a separate PNG chunk, e.g. `fUZz` ([example](https://github.com/google/oss-fuzz/blob/master/projects/libpng-proto/libpng_transforms_fuzzer.cc)). * When fuzzing C/C++/Java/JavaScript inputs one may hide a sub-input in a single-line `//` comment. TODO: example? ## Hash When only one small fixed-size sub-input is required (such as flags / options), the fuzz target may compute a hash function on the full input and use it as the flag bits. This option is very easy to implement, but it's applicability is limited to relatively simple cases. The major problem is that a small local mutation of the input leads to a large change in the sub-input, which often makes fuzzing less efficient. Try this approach if the flags are individual bits and the input type allows some bit flips in the inputs (e.g. a plain text). TODO: example. # Custom Serialization Format If you **do not intend to share the corpus** with any other API or fuzz targets, then a custom serialization format might be a good option for a multi-input fuzz target. ## First / Last Bytes When only one fixed-size sub-input is required (such as flags / options), it is possible to treat the first (or last) `K` bytes of the input as sub-input, and the rest of the bytes as the main input. Just remember to copy the main input into a separate heap buffer of `Size - K` bytes, so that buffer under/overflows on the main input are detected. TODO: example. ## Magic separator Choose a 4-byte (or 8-byte) magic constant that will serve as a separator between the inputs. In the fuzz target, split the input using this separator. Use `memmem` to find the separator in the input -- `memmem` is known to be friendly to fuzzing engines, at least to libFuzzer. Example (see full code [here](https://github.com/llvm-mirror/compiler-rt/blob/6cd423889971c0d97801a9f3b9b5afb91ae9c137/test/fuzzer/MagicSeparatorTest.cpp)): ```cpp // Splits [data,data+size) into a vector of strings using a "magic" Separator. std::vector> SplitInput(const uint8_t *Data, size_t Size, const uint8_t *Separator, size_t SeparatorSize) { ... } extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) { const uint8_t Separator[] = {0xDE, 0xAD, 0xBE, 0xEF}; auto Inputs = SplitInput(Data, Size, Separator, sizeof(Separator)); // Use Inputs.size(), Inputs[0], Inputs[1], ... } ``` It is relatively easy for a modern fuzzing engine to discover the separator, but nevertheless we recommend to provide several seed inputs with the desired number of separators. ## Fuzzed Data Provider [FuzzedDataProvider] (*FDP*) is a single-header C++ library that is helpful for splitting a fuzz input into multiple parts of various types. It is a part of LLVM and can be included via `#include ` directive. If your compiler doesn't have this header (in case it's an older Clang version or some other compiler), you can copy the header from [here] and add it to your project manually. It should just work, as the header doesn't depend on LLVM. An advantage and disadvantage of using this library is that the input splitting happens dynamically, i.e. you don't need to define any structure of the input. This might be very helpful in certain cases, but would also make the corpus to be no longer in a particular format. For example, if you fuzz an image parser and split the fuzz input into several parts, the corpus elements will no longer be valid image files, and you won't be able to simply add image files to your corpus. ### Main concepts * [FuzzedDataProvider] is a class whose constructor accepts `const uint8_t*, size_t` arguments. Usually, you would call it in the beginning of your `LLVMFuzzerTestOneInput` and pass the `data, size` parameters provided by the fuzzing engine. * Once an FDP object is constructed using the fuzz input, you can consume the data from the input by calling the FDP methods listed below. * If there is not enough data left to consume, FDP will consume all the remaining bytes. For example, if you call `ConsumeBytes(10)` when there are only `4` bytes left in the fuzz input, FDP will return a vector of length `4`. * If there is no data left, FDP will return the default value for the requested type or an empty container (when consuming a sequence of bytes). * If you consume data from FDP in a loop, make sure to check the value returned by `remaining_bytes()` between loop iterations. * Do not use the methods that return `std::string` unless your API requires a string object or a C-style string with a trailing null byte. This is a common mistake that hides off-by-one buffer overflows from AddressSanitizer. ### Methods for extracting individual values * `ConsumeBool`, `ConsumeIntegral`, `ConsumeIntegralInRange` methods are helpful for extracting a single boolean or integer value (the exact type is defined by a template parameter), e.g. some flag for the target API, or a number of iterations for a loop, or length of a part of the fuzz input. * `ConsumeProbability`, `ConsumeFloatingPoint`, `ConsumeFloatingPointInRange` methods are very similar to the ones mentioned above. The difference is that these methods return a floating point value. * `ConsumeEnum` and `PickValueInArray` methods are handy when the fuzz input needs to be selected from a predefined set of values, such as an enum or an array. These methods are using the last bytes of the fuzz input for deriving the requested values. This allows to use valid / test files as a seed corpus in some cases. ### Methods for extracting sequences of bytes Many of these methods have a length argument. You can always know how many bytes are left inside the provider object by calling `remaining_bytes()` method on it. * `ConsumeBytes` and `ConsumeBytesWithTerminator` methods return a `std::vector` of the requested size. These methods are helpful when you know how long a certain part of the fuzz input should be. Use `.data()` and `.size()` methods of the resulting object if your API works with raw memory arguments. * `ConsumeBytesAsString` method returns a `std::string` of the requested length. This is useful when you need a null-terminated C-string. Calling `c_str()` on the resulting object is the best way to obtain it. * `ConsumeRandomLengthString` method returns a `std::string` as well, but its length is derived from the fuzz input and typically is hard to predict, though always deterministic. The caller can provide the max length argument. * `ConsumeRemainingBytes` and `ConsumeRemainingBytesAsString` methods return `std::vector` and `std::string` objects respectively, initialized with all the bytes from the fuzz input that left unused. * `ConsumeData` method copies the requested number of bytes from the fuzz input to the given pointer (`void *destination`). The method is useful when you need to fill an existing buffer or object (e.g. a struct) with fuzzing data. For more information about the methods, their arguments and implementation details, please refer to the [FuzzedDataProvider] source code. Every method has a detailed comment in that file, and the implementation is relatively small. ### Examples of fuzz targets using `FuzzedDataProvider` * [net_verify_name_match_fuzzer] splits the fuzz input into two parts. * [net_http2_frame_decoder_fuzzer] reads data in small chunks in a loop in order to emulate a sequence of frames coming from the network connection. * [net_crl_set_fuzzer] initialized multiple parameters and uses the rest of the fuzz input for the main argument (i.e. data to be parsed / processed). Note that using [Protobufs](#Protobufs) based fuzzing might be more efficient for such a target. * [net_parse_cookie_line_fuzzer] is a slightly more sophisticated fuzz target that emulates different actions with different parameters initialized with the fuzz input. [FuzzedDataProvider]: https://github.com/llvm/llvm-project/blob/main/compiler-rt/include/fuzzer/FuzzedDataProvider.h [here]: https://raw.githubusercontent.com/llvm/llvm-project/main/compiler-rt/include/fuzzer/FuzzedDataProvider.h [net_crl_set_fuzzer]: https://cs.chromium.org/chromium/src/net/cert/crl_set_fuzzer.cc?rcl=0be62a8d95f7fa1455fce1a76f0fa5b8484d0c8c [net_http2_frame_decoder_fuzzer]: https://cs.chromium.org/chromium/src/net/spdy/fuzzing/http2_frame_decoder_fuzzer.cc?rcl=0be62a8d95f7fa1455fce1a76f0fa5b8484d0c8c [net_parse_cookie_line_fuzzer]: https://cs.chromium.org/chromium/src/net/cookies/parse_cookie_line_fuzzer.cc?rcl=0be62a8d95f7fa1455fce1a76f0fa5b8484d0c8c [net_verify_name_match_fuzzer]: https://cs.chromium.org/chromium/src/net/cert/internal/verify_name_match_fuzzer.cc?rcl=0be62a8d95f7fa1455fce1a76f0fa5b8484d0c8c ## Type-length-value A custom [Type-length-value](https://en.wikipedia.org/wiki/Type-length-value), or TLV, may sound like a good solution. However, we typically **do not recommend using a custom TLV** to split your fuzzer-generated input for the following reasons: * This is more test-only code for you to maintain, and easy to get wrong * Typical mutations performed by fuzzing engines, such as inserting a byte, will break the TLV structure too often, making fuzzing less efficient However, a TLV input combined with a custom mutator might be a good option. See [Structure-Aware Fuzzing](structure-aware-fuzzing.md). ## Protobufs Yet another option is to use one of the general-purpose serialization formats, such as Protobufs, in combination with a custom mutator. See [Structure-Aware Fuzzing](structure-aware-fuzzing.md).