Skip to content

Quantitative finance example applications on GPUs using portable programming models.

License

Notifications You must be signed in to change notification settings

NVIDIA/accelerated-quant-finance

Repository files navigation

Quantitative Finance Examples in ISO C++ Parallelism

This project contains two sample applications relevant to quantitative finance. These examples are intended to demonstrate how ISO C++ can be used to write code parallel code that is portable to CPUs and GPUs. This code is intended solely for illustration and should not be used in production environments.

Example Code

Black-Scholes Model

In the first example, we consider a large universe of options, generated by sampling option strikes, maturities, and volatilities on a 3D grid. The Black-Scholes valuation of calls and puts with these parameters is perfectly parallelizable, so it scales well to multicore CPUs or GPUs. A CPU baseline code is provided and parallelized with OpenMP. The Black-Scholes calculation is contained within the BlackScholesBody function so that it can be reused between the baseline code and the code using C++ Parallel Algorithms.

Options grid Sampling grid for our options universe.

Baseline Code

The baseline loops over options are contained in BlackScholesCPU. To expose all available parallelism via OpenMP a collapse(2) clause is used. Different OpenMP compilers may prefer other parallelization schemes for these loops, but that it outside of the scope of this example.

C++ Parallel Algorithms Code

The parallel algorithms version of the code appears in BlackScholesStdPar and has a number of small changes to utilize modern C++ features.

C++ span

Note that rather than taking raw pointers for input and output arrays more modern span objects are used. These objects provide non-owning views into the data, meanging that these views do not control the lifetime of the data but rather provide a well-defined view into the data. They also carry metadata about the data structure. For instance, rather than passing the number of options as a parameter to the function, it can be inferred from the data structure, as seen on line 56.

Iota Views

While the original code iterates from 0 to optN, the updated code uses an iota view of these bounds. Conceptually, an iota is a list of all iteration indices, except they are not actually stored in memory but rather calcuated in the fly. This is a convenient way to accomplish the same iteration space as the original loops when using C++ parallel algorithms.

The for_each Algorithm and par_unseq Execution Policy

Finally, the feature that replaces serial for loops with something that can be run in parallel is the std::for_each algorithm. As the name implies, this algorithm will apply the same operation for each index in the iota view created above. The final argument to for_each in the example code is a lambda function, inside which the original BlackScholesBody function is called. Each time the lambda function is called, it is passed a single parameter, an integer value from the iota, which serves the same purpose as a loop index in the original code.

Since C++17 standard algorithms accept an optional first argument, which is an execution policy. This instructs the compiler about how the code can be executed. For instance std::execution::seq would instruct the compiler that the code must be run in a sequential order, like in a for loop. Using std::execution::par instructs the compiler that the code may be parallelized, but within each thread executing the algorithm iterations must occur in order. This example code uses std::execution::par_unseq, which instructs the compiler that not only can the iterations be executed in parallel, but their order doesn't matter, making it ideal for GPU offloading or threading with SIMD vectoriazation. Using a parallel execution policy does not guarantee parallel execution, but it asserts that it's safe (and desirable) to execute the code in parallel. When using the nvc++ compiler, code built using par or par_unseq will be parallelized and offloaded when run on any GPU with a compute capability of at least 7.0.

Profit and Loss Calculation Example

Theta-Gamma Tradeoff Depending on the size of the underlying move over a time step, the gain from gamma may counteract, surpass, or succumb to losses due to theta.
Theta-Gamma PNL Total PNL along a path is the sum of daily PNLs.
Gamma-Theta PNL Surface PNL surface: average PNL across all paths for a given option to a given time horizon.

This example simulates a number of options along one or more random paths to a given horizon, calculating the mean profit or loss across all paths. In this example we generate some number of random paths using the built-in C++ random number generation capabilities and then parallelize the simulation in two different ways. Unlike the earlier example, we start with a parallel C++ code based loosely on the earlier example and using much of the same Black-Scholes code.

Exposing More Parallelism

This second example has no serial baseline code, it is written from the start using the same approach outlined in the above example. Profit & Loss is calcuated for all options along a random path. Since parallelizing over options was a successful strategy above, that is the starting point in this code as well. In the baseline code each path is walked in-order, calculating profits & losses for all options at that point along the path. This approach yields a code that parallelizes well across CPU cores and obtains a speed-up on the GPU while maintaining a single code base for both. This approach is found in calculate_pnl_paths_sequential function.

Because C++ parallel algorithms are synchronous with the CPU, this initial approach results in an overhead for launching the function on the GPU and waiting for the results at every point along every path. Removing this overhead would be beneficial to obtaining higher performance. Since each path is fully independent of each other path, it would make sense to parallelize across paths in addition to across options. This requires some restructuring to pull off. The refactored code is found in calculate_pnl_paths_parallel. The loop over paths is folded in with the options, but each path is walked sequential within the parallel algorithm using a transform_reduce algorithm.

Since both paths and options are running in parallel there now exists the potential for a race condition if two paths attempt to update the PNL value for a particular option at the same time. To prevent this possibility an atomic_ref is used at line 211 and fetch_add at line 247.

When run on both a multicore CPU and on a GPU the new approach achieves higher performance than the original, with the GPU benefiting more significantly due to the reduced numer of kernels launch and synchronization points.

Build Instructions

Software Requirements

This software has been tested with NVIDIA HPC SDK 23.1 and newer, if GCC 12 or newer is also installed. The code should work with any C++ compiler that supports the specific features used within, but has not been tested. Additionally, the code has been tested with the NVIDIA HPC SDK container using the provided Dockerfile.

Building

A Makefile has been provided for building with nvc++ for multicore CPU and GPU targets. The default make target will build the following exeuctables, or each can be build with its own target.

  • BlackScholes_cpu - Black-Scholes example built for multicore CPU
  • BlackScholes_gpu - Black-Scholes example built for an NVIDIA GPU
  • pnl_cpu - The Profit & Loss example built for multicore CPU
  • pnl_gpu - The Profit & Loss example built for multicore CPU

Optionally, the provided Dockerfile can be used to build the executables. The resulting executables will be built in /work and will appear in the container $PATH by default.

Fix for missing header error

When building this code with nvc++ it is necessary that HPC SDK has been configured to use a recent version of the GNU compiler collection (GCC). You may seen an error like the following if your installation is misconfigured.

nvc++  -fast -mp -std=c++20 -I. -mp -stdpar=gpu -gpu=ccall -c -o BlackScholes_main_gpu.o BlackScholes_main.cpp
"BlackScholes_main.cpp", line 26: catastrophic error: cannot open source file "span"
  #include <span>
                 ^

1 catastrophic error detected in the compilation of "BlackScholes_main.cpp".
Compilation terminated.

If this error occurs, you specify the correct GCC version on the compilation line. To do this for the examples, simply change the first line of the Makefile as follows:

CXX=nvc++ --gcc-toolchain=$(which g++12)

About

Quantitative finance example applications on GPUs using portable programming models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published