Intel Math Kernel Library
Intel Math Kernel Library
Intel Math Kernel Library
Users Guide
August 2008
Version -001
Version Information Original issue. Documents Intel Math Kernel Library (Intel MKL) 9.0 gold release. Documents Intel MKL 9.1 beta release. Getting Started, LINPACK and MP LINPACK Benchmarks chapters and Support for Third-Party and Removed Interfaces appendix added. Existing chapters extended. Document restructured. List of examples added. Documents Intel MKL 9.1 gold release. Existing chapters extended. Document restructured. More aspects of ILP64 interface discussed. Section Configuring the Eclipse* IDE CDT to Link with Intel MKL added to chapter 3. Cluster content is organized into one separate chapter 9 Working with Intel Math Kernel Library Cluster Software and restructured, appropriate links added. Documents Intel MKL 10.0 Beta release. Layered design model has been described in chapter 3 and the content of the entire book adjusted to the model. Automation of setting environment variables at startup has been described in chapter 4. New Intel MKL threading controls have been described in chapter 6. The Users Guide for Intel MKL merged with the one for Intel MKL Cluster Edition to reflect consolidation of the respective products. Documents Intel MKL 10.0 Gold release. Configuring of Eclipse CDT 4.0 to link with Intel MKL has been described in chapter 3. Intel Compatibility OpenMP* run-time compiler library (libiomp) has been described.
-002
January 2007
-003
June 2007
-004
September 2007
-005
October 2007
-006
Documents Intel MKL 10.1 beta release. Information on dummy libraries in Table "High-level directory structure" has been further detailed. Information on the Intel MKL configuration file removed. Section Accessing Man Pages has been added to chapter 3. Section "Support for Boost uBLAS Matrix-Matrix Multiplication" has been added to chapter 7. Chapter Getting Assistance for Programming in the Eclipse* IDE has been added. Documents Intel MKL 10.1 gold release. Linking examples for IA-32 architecture and section "Linking with Computational Libraries" have been added to chapter 5. Integration of DSS/PARDISO into the layered structure has beendocumented. Two Fortran code examples have been added.
May 2008
-007
August 2008
ii
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Core Inside, FlashFile, i960, InstantIP, Intel, Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, IPLink, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. * Other names and brands may be claimed as the property of others. Copyright 2006 - 2008, Intel Corporation. All rights reserved.
iii
Contents
Chapter 1 Overview
Technical Support ....................................................................... 1-1 About This Document .................................................................. 1-1 Purpose................................................................................. 1-2 Audience ............................................................................... 1-2 Document Organization ........................................................... 1-2 Term and Notational Conventions .............................................. 1-3
Chapter 2
Getting Started
Checking Your Installation............................................................ 2-1 Obtaining Version Information ...................................................... 2-2 Compiler Support ....................................................................... 2-2 Using Intel MKL Code Examples .................................................... 2-2 Before You Begin Using Intel MKL ................................................. 2-3
Chapter 3
iv
Contents of the Documentation Directory................................. 3-20 Accessing Man Pages ............................................................ 3-21
Chapter 4
Chapter 5
Chapter 6
Contents
Variable .............................................................................. 6-4 Changing the Number of Threads at Run Time............................ 6-5 Using Additional Threading Control ........................................... 6-8 Tips and Techniques to Improve Performance ................................ 6-13 Coding Techniques................................................................. 6-13 Hardware Configuration Tips ................................................... 6-15 Managing Multi-core Performance ............................................ 6-15 Operating on Denormals......................................................... 6-17 FFT Optimized Radices ........................................................... 6-17 Using Intel MKL Memory Management ....................................... 6-17 Redefining Memory Functions.................................................. 6-18
Chapter 7
Chapter 8 Chapter 9
Coding Tips
Aligning Data for Numerical Stability ............................................. 8-1
vi
Appendix A Intel Math Kernel Library Language Interfaces Support Appendix B Support for Third-Party Interfaces
GMP* Functions ......................................................................... B-1 FFTW Interface Support .............................................................. B-1
vii
Contents
Table 3-7 Contents of the doc directory ....................................... 3-20 Table 5-1 Quick comparison of Intel MKL linkage models .............. 5-2 Table 5-2 Interface layer library for linking with the Absoft* compilers .............................................................................. 5-11 Table 5-3 Selecting the Threading Layer ...................................... 5-12 Table 5-4 Computational libraries to link, by function domain ......... 5-13 Table 6-1 How to avoid conflicts in the execution environment for your threading model............................................................... 6-4 Table 6-2 Intel MKL environment variables for threading controls .. 6-9 Table 6-3 Interpretation of MKL_DOMAIN_NUM_THREADS values.... 6-12 Table 7-1 Interface libraries and modules ..................................... 7-1 Table 11-1 Contents of the LINPACK Benchmark ........................... 11-2 Table 11-2 Contents of the MP LINPACK Benchmark ...................... 11-5
List of Examples
Example 6-1 Changing the number of processors for threading........ 6-5 Example 6-2 Setting the number of threads to one ........................ 6-9 Example 6-3 Setting an affinity mask by operating system means using an Intel compiler......................................................... 6-16 Example 6-4 Redefining memory functions .................................. 6-19 Example 7-1 Calling a complex BLAS Level 1 function from C .......... 7-8 Example 7-2 Calling a complex BLAS Level 1 function from C++...... 7-9 Example 7-3 Using CBLAS interface instead of calling BLAS directly from C .................................................................................. 7-10 Example 8-1 Aligning addresses at 16-byte boundaries .................. 8-2
viii
List of Figures
Figure 5-1 Linking with Layered Intel Math Kernel Library ............ 5-5 Figure 7-1 Column-major order vs. row-major order ...................... 7-5 Figure 10-1 Intel Math Kernel Library Help in the Eclipse* IDE .... Figure 10-2 Hits to the Intel Web Site in the Eclipse* IDE Help search ................................................................................. Figure 10-3 Infopop Window with an Intel MKL function description ........................................................................... Figure 10-4 F1 Help in the Eclipse* IDE ...................................... Figure 10-5 F1 Help Search in the Eclipse* IDE CDT ..................... Figure 10-6 Code/Content Assist ............................................... Figure 10-7 Customizing Code/Content Assist ............................. 10-2 10-3 10-4 10-5 10-6 10-7 10-9
ix
Overview
Intel Math Kernel Library (Intel MKL) offers highly optimized, thread-safe math routines for science, engineering, and financial applications that require maximum performance.
Technical Support
Intel provides a support web site, which contains a rich repository of self help information, including getting started tips, known product issues, product errata, license information, user forums, and more. Visit the Intel MKL support website at http://www.intel.com/software/products/support/ .
1-1
This guide should be used in conjunction with the latest version of the Intel Math Kernel Library for Linux* Release Notes document to reference how to use the library in your application.
Purpose
Intel Math Kernel Library for Linux* Users Guide is intended to assist you in mastering the usage of the Intel MKL on Linux. In particular, it Describes post-installation steps to help you start using the library Shows you how to configure the library with your development environment Acquaints you with the library structure Explains how to link your application to the library and provides simple usage scenarios Describes how to code, compile, and run your application with Intel MKL for Linux.
Audience
The guide is intended for Linux programmers with beginner to advanced experience in software development.
Document Organization
The document contains the following chapters and appendices: Chapter 1 Overview. Introduces the concept of the Intel MKL usage information; describes the documents purpose and organization as well as explains notational conventions. Getting Started. Describes post-installation steps and gives information needed to start using Intel MKL after its installation. Intel Math Kernel Library Structure. Discusses the structure of the Intel MKL directory after installation at different levels of detail as well as the library versions and parts. Configuring Your Development Environment. Explains how to configure Intel MKL with your development environment. Linking Your Application with Intel Math Kernel Library. Compares static and dynamic linking models; describes the general link line syntax to be used for linking with Intel MKL libraries;
Chapter 2 Chapter 3
Chapter 4 Chapter 5
1-2
Overview
explains which libraries should be linked with your application for your particular platform and function domain; discusses how to build custom dynamic libraries. Chapter 6 Managing Performance and Memory. Discusses Intel MKL threading; shows coding techniques and gives hardware configuration tips for improving performance of the library; explains features of the Intel MKL memory management and, in particular, shows how to replace memory functions that the library uses by default with your own ones. Language-specific Usage Options. Discusses mixed-language programming and the use of language-specific interfaces. Coding Tips. Presents coding tips that may be helpful to your specific needs. Working with Intel Math Kernel Library Cluster Software. Discusses usage of ScaLAPACK and Cluster FFTs; in particular, describes linking of your application with these function domains, including C- and Fortran-specific linking examples; gives information on the supported MPI. Getting Assistance for Programming in the Eclipse* IDE. Discusses Intel MKL features that software engineers can benefit from when working in the Eclipse* IDE. LINPACK and MP LINPACK Benchmarks. Describes the Intel Optimized LINPACK Benchmark for Linux* and Intel Optimized MP LINPACK Benchmark for Clusters. Intel Math Kernel Library Language Interfaces Support. Summarizes information on language interfaces that Intel MKL provides for each function domain, including the respective header files. Support for Third-Party Interfaces. Describes in brief some interfaces that Intel MKL supports.
Chapter 10
Chapter 11
Appendix A
Appendix B
1-3
<mkl_directory>
The main directory to which Intel MKL is installed. Should be substituted with the specific pathname in the configuring, linikng, and building instructions:
Table 1-1
Italic
Notational conventions
Italic is used for emphasis and also indicates document names in body text, for example: see Intel MKL Reference Manual Indicates filenames, directory names and pathnames, for example:
Monospace lowercase Monospace lowercase mixed with uppercase UPPERCASE MONOSPACE Monospace italic
libmkl_core.a , /opt/intel/mkl/10.1.0.004
Indicates commands and command-line options, for example:
1-4
Getting Started
This chapter helps you start using the Intel Math Kernel Library (Intel MKL) for the Linux* OS by giving you some basic information and describing post-installation steps.
/opt/intel/mkl/RR.r.y.xxx, where RR.r is the version number, y is the release-update number and xxx is the package number, for example, /opt/intel/mkl/10.1.0.004 <Intel Compiler Pro directory>/mkl, where <Intel Compiler Pro directory> is the installation directory for Intel C++ Compiler Professional
Edition or Intel Fortran Compiler Professional Edition, for example,
/opt/intel/Compiler/11.0.015/mkl.
2. 3. If you choose to keep multiple versions of Intel MKL installed on your system, update build scripts so that they point to the desired version. Check that the following six files are placed in the tools/environment directory:
2-1
Compiler Support
Intel supports Intel MKL for use only with compilers identified in the Release Notes. However, the library has been successfully used with other compilers as well. When using the CBLAS interface, the header file mkl.h will simplify program development, since it specifies enumerated values as well as prototypes for all the functions. The header determines if the program is being compiled with a C++ compiler and, if so, the included file will be correct for use with C++ compilation. Starting with Intel MKL 9.1, full support is provided for the GNU gfortran* compiler, which differs from the Intel Fortran Compiler in calling conventions for functions that return complex data. Absoft* Fortran compilers are supported as well. For usage specifics of the Absoft compilers, see Linking with the Absoft* compilers in chapter 5.
The examples are grouped in subdirectories mainly by Intel MKL function domains and programming languages. For instance, subdirectory examples/spblas contains Sparse BLAS examples, and subdirectory examples/vmlc contains VML examples in C. Source code of the examples is in the next level sources subdirectory. To compile, build, and run the examples, use the makefile provided. For information on how to use it, refer to the makefile header. See also:
2-2
Getting Started
Mathematical problem
Identify all Intel MKL function domains that problems you are solving require:
BLAS Sparse BLAS LAPACK PBLAS ScaLAPACK Sparse Solver routines Vector Mathematical Library functions Vector Statistical Library functions Fourier Transform functions (FFT) Cluster FFT PBLAS Trigonometric Transform routines Poisson, Laplace, and Helmholtz Solver routines Optimization (Trust-Region) Solver routines GMP* arithmetic functions Reason. The function domain you intend to use narrows the search in the Reference Manual for specific routines you need. Additionally, the link line that you use to link your application with Intel MKL cluster software depends on the function domains you intend to employ (see Working with Intel Math Kernel Library Cluster Software). Coding tips may also depend on the function domain (see Tips and Techniques to Improve Performance).
2-3
Table 2-1
Programming language
Threading model
Select among the following options how you are going to thread your application:
Your application is already threaded You may want to use the Intel threading capability, that is, Compatibility OpenMP* run-time library (libiomp) or Legacy OpenMP* run-time library (libguide), or a threading capability provided by a third-party compiler You do not want to thread your application. Reason. By default, the OpenMP* software sets the number of threads that Intel MKL uses. If you need a different number, you have to set it yourself using one of the available mechanisms. For more information, and especially, how to avoid conflicts in the threaded execution environment, see Using Intel MKL Parallelism. Additionally, the compiler that you use to thread your application determines which threading library you should link with your application (see Linking Examples). Linking model Decide which linking model is appropriate for linking your application with Intel MKL libraries: Static Dynamic Reason. For information on the benefits of each linking model, link command syntax and examples, link libraries, and other linking topics, like how to save disk space by creating a custom dynamic library, see Linking Your Application with Intel Math Kernel Library. MPI used Reason: To link your application with ScaLAPACK and/or Cluster FFT, the libraries corresponding to your particular MPI should be included in the link line (see Working with Intel Math Kernel Library Cluster Software).
2-4
The chapter discusses the structure of the Intel Math Kernel Library (Intel MKL), including the Intel MKL directory structure, as well as the library versions and parts. Starting with version 10.1, Intel MKL employs a layered model to streamline the library structure, reduce its size, and add usage flexibility. See also: Layered Model Concept.
Table 3-1
Directory
<mkl directory> <mkl directory>/benchmarks/linpack <mkl directory>/benchmarks/mp_linpack <mkl directory>/doc <mkl directory>/examples <mkl directory>/include <mkl directory>/interfaces/blas95 <mkl directory>/interfaces/ fftw2x_cdft
3-1
Table 3-1
Directory
<mkl directory>/interfaces/fftw2xc <mkl directory>/interfaces/fftw2xf <mkl directory>/interfaces/fftw3xc <mkl directory>/interfaces/fftw3xf <mkl directory>/interfaces/lapack95 <mkl directory>/lib/32 <mkl directory>/lib/64 <mkl directory>/lib/em64t <mkl directory>/man/man3 <mkl directory>/tests <mkl directory>/tools/builder <mkl directory>/tools/environment <mkl directory>/tools/plugins/ com.intel.mkl.help <mkl directory>/tools/support
3-2
Interfaces. On Linux systems based on IA-64 architecture, the Intel Fortran Compiler
returns complex values differently than gnu and some other compilers. Rather than duplicate the library for these differences, separate interface libraries are provided to support compiler differences while constraining the size of the library. Similarly, LP64 can be supported on top of ILP64 through an interface. Moreover, interface libraries are provided to support legacy supercomputers where single precision means 64-bit arithmetic.
Threading. For efficiency reasons, Intel MKL employs function-level threading throughout
the library rather than loop-level threading. Consequently, all threading can be constrained to a relatively small set of functions and collected into a library. All references to compiler-specific run-time libraries are generated in these functions. By compiling them with different compilers and providing a threading library layer Intel MKL can work in programs threaded with Intel compilers and other supported threading compilers. A non-threaded library version can also be obtained by turning off threading when compiling the threading library layer because all threading is provided through OpenMP* technology.
Computation. For any given processor family (processors based on IA-32, IA-64, or Intel
64 architecture), a single computational library is used for all interfaces and threading layers because there is no parallelism in the computational layer.
Intel MKL. The only RTLs provided, except those that are relevant to the Intel MKL cluster software, are Intel compiler based RTLs: Intel Compatibility OpenMP* run-time compiler library (libiomp) and Intel Legacy OpenMP* run-time compiler library (libguide). To thread using third-party threading compilers, you can employ Threading layer libraries or use the compatibility library in the appropriate circumstances.
Run-time library (RTL). The last layer provides RTL support. Not all RTLs are delivered with
Layers
There are four essential parts of the library: 1. 2. 3. 4. Interface layer Threading layer Computational layer Compiler Support RTL layer (RTL layer, for brevity).
Interface Layer. This layer provides matching between compiled code of your application and the threading and/or computational parts of the library. This layer provides:
An LP64 interface to Intel MKL ILP64 software (see Support for ILP64 Programming for details) A means of dealing with the way different compilers return function values
3-3
A means of mapping between single-precision names and double-precision names in applications that employ ILP64, such as Cray-style naming.
Threading Layer. This layer provides a way for threaded Intel MKL to share supported
compiler threading. The layer also provides for a sequential version of the library. What was internal to the library previously, now is essentially exposed in the threading layer. This layer is compiled for different environments (threaded or sequential) and compilers (Intel, gnu, and so on).
Computational Layer. This is the heart of Intel MKL and has only one variant for any processor/operating system family, such as 32-bit Intel processors on a 32-bit operating system. The computational layer accommodates multiple architectures through identification of the architecture or architectural feature and chooses the appropriate binary code at execution. Intel MKL may be thought of as the large computational layer that is unaffected by different computational environments. Then, as it has no RTL requirements, RTLs refer not to the computational layer but to one of the layers above it: the Interface layer or Threading layer. The most likely case is matching the threading layer with the RTL layer. RTL Layer. This layer has run-time library support functions. For example, libiomp and libguide are RTLs providing threading support for the OpenMP* threading in Intel MKL.
See also the Linking Examples section in chapter 5.
3-4
Concept
ILP64 interface is provided for the following two reasons: To support huge data arrays (with more than 2 billion elements) To enable compiling your Fortran code with the -i8 compiler option.
3-5
The Intel Fortran Compiler supports the -i8 option for changing behavior of the INTEGER type. By default the standard INTEGER type is 4-byte. The -i8 option makes the compiler treat INTEGER constants, variables, function and subroutine parameters as 8-byte. The ILP64 binary interface uses 8-byte integers for function parameters that define array sizes, indices, strides, etc. At the language level, that is, in the *.f90 and *.fi files located in the Intel MKL include directory, such parameters are declared as INTEGER. To bind your Fortran code with the ILP64 interface, you must compile your code with the -i8 compiler option. And vice-versa, if your code is compiled with -i8, you can bind it only with the ILP64 interface because the LP64 binary interface requires the INTEGER type to be 4-byte. Note that some Intel MKL functions and subroutines have scalar or array parameters of type INTEGER*4 or INTEGER(KIND=4), which are always 4-byte, regardless of whether the code is compiled with the -i8 option. For the languages of C/C++, Intel MKL provides the MKL_INT type as a counterpart of the INTEGER type for Fortran. MKL_INT is a macro defined as the standard C/C++ type int by default. However, if the MKL_ILP64 macro is defined for the code compilation, MKL_INT is defined as a 64-bit integer type. To define the MKL_ILP64 macro, you may call the compiler with the -DMKL_ILP64 command-line option. Intel MKL also defines the type MKL_LONG for maintaining ILP64 interface in the specific case of FFT interface for C/C++. The MKL_LONG macro is defined as the standard C/C++ type long by default; and if the MKL_ILP64 macro is defined for the code compilation, MKL_LONG is defined as a 64-bit integer type.
NOTE. The type int is 32-bit for the Intel C++ compiler, as well as for most of modern C/C++ compilers. The type long is 32- or 64-bit for the Intel C++ and compatible compilers, depending on the particular OS.
In the Intel MKL interface for the C or C++ languages, that is, in the *.h header files located in the Intel MKL include directory, such function parameters as array sizes, indices, strides, etc. are declared as MKL_INT. The FFT interface for C/C++ is the specific case. The header file mkl_dfti.h uses the MKL_LONG type for both explicit and implicit parameters of the interface functions. Specifically, the type of the explicit parameter dimension of the function DftiCreateDescriptor() is MKL_LONG and the type of the implicit parameter length is MKL_LONG for a one-dimensional transform and MKL_LONG[] (that is, an array of numbers having type MKL_LONG) for a multi-dimensional transform.
3-6
To bind your C/C++ code with the ILP64 interface, you must provide the -DMKL_ILP64 command-line option to the compiler to enforce MKL_INT and MKL_LONG being 64-bit. And vice-versa, if your code is compiled with -DMKL_ILP64 option, you can bind it only with the ILP64 interface because the LP64 binary interface requires MKL_INT to be 32-bit and MKL_LONG to be the standard long type. Note that certain MKL functions have parameters explicitly declared as int or int[]. Such integers are always 32-bit regardless of whether the code is compiled with the -DMKL_ILP64 option. Table 3-2 summarizes how the Intel MKL ILP64 concept is implemented:
Table 3-2
The same include directory for ILP64 and LP64 interfaces Type used for parameters that are always 32-bit Type used for parameters that are 64-bit integers for the ILP64 interface and 32-bit integers for LP64 Type used for all integer parameters of the FFT functions Command-line option to control compiling for ILP64
INTEGER -i8
MKL_LONG
-DMKL_ILP64
Fortran:
ifort -i8 -I<mkl drectory>/include
C or C++:
icc -DMKL_ILP64 -I<mkl directory>/include
To compile for the LP64 interface, just omit the -i8 or -DMKL_ILP64 option. Notice that linking of the application compiled with the -i8 or -DMKL_ILP64 option to the LP64 libraries may result in unpredictable consequences and erroneous output.
3-7
Table 3-3
Compiling for the ILP64 interface Compiling for the LP64 interface
INTEGER for Fortran MKL_INT for C/C++ MKL_LONG for the parameters of the C/C++ FFT interface.
This way you make your code universal for both ILP64 and LP64 interfaces. You may alternatively use other 64-bit types for the integer parameters that must be 64-bit in ILP64. For example, with Intel compilers, you may use types:
Note that code written this way will not work for the LP64 interface. Table 3-4 summarizes usage of the integer types.
Table 3-4
Integer types
Fortran C or C++
32-bit integers
int
Universal integers:
MKL_INT
3-8
Table 3-4
INTEGER
without specifying KIND
MKL_LONG
Limitations
Note that, not all components support the ILP64 feature. Table 3-5 shows which function domains support ILP64 interface.
Table 3-5
BLAS
Function domain Sparse BLAS LAPACK PBLAS ScaLAPACK VML VSL DSS/PARDISO* solvers ISS solvers Optimization (Trust-Region) solvers FFT FFTW
3-9
Table 3-5
Function domain Cluster FFT PDE support: Trigonometric Transforms PDE support: Poisson Solvers GMP BLAS 95 LAPACK 95
Table 3-6
Directory/file
lib/32
Static Libraries Interface layer Interface library for GNU Fortran compiler Interface library for Intel compiler Parallel drivers library supporting GNU compiler Parallel drivers library supporting Intel compiler
libmkl_gf.a libmkl_intel.a
Threading layer
libmkl_gnu_thread.a libmkl_intel_thread.a
3-10
Table 3-6
Directory/file
libmkl_pgi_thread.a libmkl_sequential.a
Computational layer
lib/32/libmkl_cdft_core.a
Cluster version of FFTs Kernel library for IA-32 architecture Dummy library. Contains references to Intel MKL libraries lib/32/libmkl_intel.a, lib/32/libmkl_intel_thread.a, and lib/32/libmkl_core.a. Dummy library. Contains references to Intel MKL libraries lib/32/libmkl_intel.a, lib/32/libmkl_intel_thread.a, and lib/32/libmkl_core.a. Dummy library. Contains a reference to
libmkl_lapack.a
lib/32/libmkl_scalapack_core.a
ScaLAPACK routines Iterative Sparse Solver, Trust Region Solver, and GMP routines Sequential version of Iterative Sparse Solver, Trust Region Solver, and GMP routines
Intel Legacy OpenMP* run-time library for static linking Intel Compatibility OpenMP* run-time library for static linking BLACS routines supporting the following MPICH versions:
BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2 A soft link to lib/32/libmkl_blacs_intelmpi.a BLACS routines supporting OpenMPI.
3-11
Table 3-6
Directory/file
libmkl_gf.so libmkl_intel.so
Threading layer
libmkl.so
libmkl_p4p.so
3-12
Table 3-6
Directory/file
ILP64 interface library for GNU Fortran compiler LP64 interface library for GNU Fortran compiler ILP64 interface library for Intel compiler LP64 interface library for Intel compiler SP2DP interface library for Intel compiler
Parallel drivers library supporting GNU compiler Parallel drivers library supporting Intel compiler Parallel drivers library supporting PGI compiler Sequential drivers library
3-13
Table 3-6
Directory/file
lib/em64t/libmkl_cdft_core.a.
Cluster version of FFTs Kernel library for Intel 64 architecture Dummy library. Contains references to Intel MKL libraries lib/em64t/libmkl_intel_lp64.a, lib/em64t/libmkl_intel_thread.a, and lib/em64t/libmkl_core.a. Dummy library. Contains references to Intel MKL libraries lib/em64t/libmkl_intel_lp64.a, lib/em64t/libmkl_intel_thread.a, and lib/em64t/libmkl_core.a. Dummy library. Contains a reference to
libmkl_lapack.a
libmkl_scalapack.a libmkl_scalapack_ ilp64.a libmkl_scalapack_ lp64.a libmkl_solver.a libmkl_solver_ ilp64.a libmkl_solver_ilp64_ sequential.a libmkl_solver_lp64.a libmkl_solver_lp64_ sequential.a
lib/em64t/libmkl_scalapack_lp64.a.
ScaLAPACK routines library supporting ILP64 interface ScaLAPACK routines library supporting LP64 interface A dummy library. Contains a reference to
lib/em64t/libmkl_solver_lp64.a.
Iterative Sparse Solver and GMP routine library supporting ILP64 interface Sequential version of Iterative Sparse Solver and Trust Region Solver routine library supporting ILP64 interface Iterative Sparse Solver, Trust Region Solver, and GMP routine library supporting LP64 interface Sequential version of Iterative Sparse Solver, Trust Region Solver, and GMP routine library supporting LP64 interface
3-14
Table 3-6
Directory/file
ILP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2 LP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2 A soft link to
lib/em64t/libmkl_blacs_intelmpi_ilp64.a
A soft link to
lib/em64t/libmkl_blacs_intelmpi_lp64.a
LP64 version of BLACS routines supporting the following MPICH versions:
ILP64 version of BLACS routines supporting OpenMPI. LP64 version of BLACS routines supporting OpenMPI. ILP64 version of BLACS routines supporting SGI MPT. LP64 version of BLACS routines supporting SGI MPT.
ILP64 interface library for GNU Fortran compiler LP64 interface library for GNU Fortran compiler ILP64 interface library for Intel compiler LP64 interface library for Intel compiler SP2DP interface library for Intel compiler
3-15
Table 3-6
Directory/file
libmkl.so
libmkl_core.so libmkl_def.so libmkl_mc.so libmkl_mc3.so libmkl_lapack.so libmkl_scalapack_ ilp64.so libmkl_scalapack_ lp64.so libmkl_vml_def.so libmkl_vml_mc.so libmkl_vml_mc3.so libmkl_vml_p4n.so libmkl_vml_mc2.so
RTL layer
libguide.so libiomp5.so
3-16
Table 3-6
Directory/file
ILP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2 LP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2
Libraries for IA-64 architecture
ILP64 interface library for Intel compiler LP64 interface library for Intel compiler SP2DP interface library for Intel compiler ILP64 interface library for GNU Fortran compiler LP64 interface library for GNU Fortran compiler
Parallel drivers library supporting Intel compiler Parallel drivers library supporting GNU compiler Sequential drivers library Dummy library. Contains a reference to
lib/64/libmkl_cdft_core.a
Cluster version of FFTs Kernel library for IA-64 architecture Dummy library. Contains references to Intel MKL libraries lib/64/libmkl_intel_lp64.a, lib/64/libmkl_intel_thread.a, and lib/64/libmkl_core.a. Dummy library. Contains references to Intel MKL libraries lib/64/libmkl_intel_lp64.a, lib/64/libmkl_intel_thread.a, and
libmkl_lapack.a
lib/64/libmkl_scalapack_lp64.a.
ScaLAPACK routines library supporting ILP64 interface ScaLAPACK routines library supporting LP64 interface
3-17
Table 3-6
Directory/file
Intel Legacy OpenMP* run-time library for static linking Intel Compatibility OpenMP* run-time library for static linking ILP64 version of BLACS routines supporting the following MPICH versions:
ILP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2 LP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2 A soft link to
lib/64/libmkl_blacs_intelmpi_ilp64.a
A soft link to
lib/64/libmkl_blacs_intelmpi_lp64.a
LP64 version of BLACS routines supporting the following MPICH versions:
ILP64 version of BLACS routines supporting OpenMPI. LP64 version of BLACS routines supporting OpenMPI. ILP64 version of BLACS routines supporting SGI MPT.
3-18
Table 3-6
Directory/file
libmkl_blacs_ sgimpt_lp64.a
Dynamic Libraries Interface layer
ILP64 interface library for GNU Fortran compiler LP64 interface library for GNU Fortran compiler ILP64 interface library for Intel compiler LP64 interface library for Intel compiler SP2DP interface library for Intel compiler
Parallel drivers library supporting GNU compiler Parallel drivers library supporting Intel compiler Sequential drivers library Dummy library. Contains references to Intel MKL libraries lib/64/libmkl_intel_lp64.so, lib/64/libmkl_intel_thread.so, and lib/64/libmkl_core.so. Library dispatcher for dynamic load of processor-specific kernel library Kernel library for IA-64 architecture LAPACK and DSS/PARDISO routines and drivers ScaLAPACK routines library supporting ILP64 interface ScaLAPACK routines library supporting LP64 interface VML kernel for IA-64 architecture Intel Legacy OpenMP* run-time library for dynamic linking Intel Compatibility OpenMP* run-time library for dynamic linking
libmkl.so
ILP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2
3-19
Table 3-6
Directory/file
libmkl_blacs_ intelmpi_lp64.so
LP64 version of BLACS routines supporting Intel MPI 2.0 and 3.0, and MPICH2
interfaces
1. Additionally, a number of interface libraries may be generated as a result of respective makefile operation in the directory (see Using Language-Specific Interfaces with Intel MKL in chapter 7).
Dummy Libraries
Layered libraries give more flexibility to choose the appropriate combination of libraries but do not have backward compatibility by library names in link lines. Dummy libraries are introduced to provide backward compatibility with earlier version of Intel MKL, which did not use layered libraries. Dummy libraries do not contain any functionality, but only dependencies on a set of layered libraries. Placed in a link line, dummy libraries enable omitting dependent layered libraries, which will be linked automatically. Dummy libraries contain dependency on the following layered libraries (default principle): Interface: Intel, LP64 Threading: Intel compiled Computational: the computation library.
So, if you employ the above interface and use OpenMP* threading provided by the Intel compiler, you may not change your link lines.
File name
Install.txt mkl_documentation.htm
3-20
Table 3-7
File name
mklEULA.txt mklman.pdf mklman90_j.pdf mklsupport.txt Readme.txt Release_Notes.htm Release_Notes.txt userguide.pdf vmlnotes.htm vslnotes.pdf ./tables
3-21
This chapter explains how to configure your development environment for the use with the Intel Math Kernel Library (Intel MKL).
mklvars32, mklvarsem64t, and mklvars64 with two flavors each (.sh and .csh) in the tools/environment directory to set the environment variables INCLUDE, LD_LIBRARY_PATH, MANPATH, LIBRARY_PATH, CPATH, and FPATH in the user shell. Section
Automating the Process explains how to automate setting of these variables at startup. For information on how to set up environment variables for threading, see Setting the Number of Threads Using OpenMP* Environment Variable in Chapter 6.
4-1
or mklvars64.
If you have super user permissions, you can add the same commands to a general-system file in /etc/profile (for bash and sh) or in /etc/csh.login (for csh). Before uninstalling Intel MKL, remove the above commands from all profile files where the script execution was added, to avoid problems during logging in.
TIP. After linking your CDT with Intel MKL, you can benefit from the Eclipse-provided code assist feature. See Code/Context Assist description in Eclipse Help.
4-2
1.
If the tool-chain/compiler integration supports include path options, go to the Includes tab of the C/C++ General > Paths and Symbols property page and set the Intel MKL include path, that is, <mkl directory>/include. If the tool-chain/compiler integration supports library path options, go to the Library Paths tab of the C/C++ General > Paths and Symbols property page and set the Intel MKL library path, depending upon the target architecture, for example, <mkl
2.
directory>/lib/em64t.
3. For a particular build, go to the Tool Settings tab of the C/C++ Build > Settings property page and specify names of the Intel MKL libraries to link with your application, for example, mkl_solver_lp64 and mkl_core (As compilers typically require library names rather than library file names, the "lib" prefix and "a" extension are omitted). To learn how to choose the libraries, see Selecting Libraries to Link in chapter 5. The name of the particular setting where libraries are specified depends upon the compiler integration.
Note that the compiler/linker will automatically pick up the include and library paths settings only in cases where the automatic makefile generation is turned on, otherwise, you will have to specify the include and library paths directly in the makefile to be used.
Note that with the Standard Make, the above settings are needed for the CDT internal functionality only. The compiler/linker will not automatically pick up these settings and you will still have to specify them directly in the makefile. For Managed Make projects, you can specify settings for a particular build. To do this, 1. Go to the Tool Settings tab of the C/C++ Build property page. All the settings you need to specify are on this page. Names of the particular settings depend upon the compiler integration and therefore are not given below. If the compiler integration supports include path options, set the Intel MKL include path, that is, <mkl_directory>/include.
2.
4-3
3.
If the compiler integration supports library path options, set a path to the Intel MKL libraries, depending upon the target architecture, for example, <mkl
directory>/lib/em64t.
4. Specify names of the Intel MKL libraries to link with your application, for example,
mkl_lapack and mkl_ia32 (As compilers typically require library names rather than library file names, the lib prefix and a extension are omitted). To learn
how to choose the libraries, see Selecting Libraries to Link in chapter 5.
4-4
This chapter discusses linking your applications with Intel Math Kernel Library (Intel MKL) for the Linux* OS. The chapter compares static and dynamic linking models; describes the general link line syntax to be used for linking with Intel MKL libraries; features information in a tabular form on the libraries that should be linked with your application for your particular platform and function domain; provides linking examples. Building custom shared objects is also discussed.
Static Linking
Static linking resolves all symbolic references at link time. The behavior of statically built executables is predictable, because there are no run-time dependencies. The main disadvantage is that having to relink new versions of the library to your application may be error-prone and time-consuming, because you have to relink the entire application. Moreover, static linking results in large executables and uses memory less efficiently. If several executables are linked with the same library, each of them must load it into memory independently. This matters most for executables having data sizes that are small and comparable with the size of the executable.
Dynamic Linking
Dynamic linking postpones the resolution of some undefined symbolic references until run time. Dynamically built executables contain those symbols along with a list of libraries that provide definitions of the symbols. When the executable is loaded, the final linking is done before the application runs. If several dynamically built executables reference the same
5-1
library, it is loaded into memory only once and the executables share it, thereby saving memory. Dynamic linking enables you to separately update the libraries on which applications depend and does not require relinking the applications. The development advantages of dynamic linking are achieved at some cost to performance, because every unresolved symbol has to be looked up in a dedicated table and resolved at run time.
Table 5-1
Feature
5-2
software to link in more than one copy of the library. This causes performance problems (too many threads) and may cause correctness problems if more than one copy is initialized. You are advised to link with libiomp and libguide dynamically even if other libraries are linked statically.
<ld> myprog.o <mkl directory>/lib/32/libmkl_solver.a <mkl directory>/lib/32/libmkl_intel.a <mkl directory>/lib/32/libmkl_intel_thread.a <mkl directory>/lib/32/libmkl_core.a <mkl directory>/lib/32/libiomp5.so -lpthread
where <ld> is a linker, myprog.o is the user's object file. Appropriate Intel MKL libraries are listed first and followed by the system library libpthread. In the link line, list library names (with absolute or relative paths, if needed) preceded with -L<path>, which indicates where to search for binaries, and -I<include>, which indicates where to search for header files. Discussion of linking with Intel MKL libraries employs this option.
To link with Intel MKL libraries, specify paths and libraries in the link line as shown below.
NOTE. The syntax below is provided for dynamic linking. For static linking, replace each library name preceded with "-l" with the path to the library file, for example, replace -lmkl_core with $MKLPATH/libmkl_core.a, where $MKLPATH is the appropriate user-defined environment variable. See specific examples in the Linking Examples section.
<files to link> -L<MKL path> -I<MKL include> [-lmkl_lapack95] [-lmkl_blas95] [cluster components]
5-3
[{-lmkl_{intel, intel_ilp64, intel_lp64, intel_sp2dp, gf, gf_ilp64, gf_lp64}] [-lmkl_{intel_thread, gnu_thread, pgi_thread, sequential}] [{-lmkl_solver, -lmkl_solver_lp64, -lmkl_solver_ilp64}] {{[-lmkl_lapack] -lmkl_{ia32, em64t, ipf}}, -lmkl_core}} [{-liomp5, -lguide}] [-lpthread] [-lm]
See Selecting Libraries to Link for details of this syntax usage and specific recommendations on which libraries to link depending on your Intel MKL usage scenario. See also Fortran 95 Interfaces and Wrappers to LAPACK and BLAS in chapter 7 for information on the libraries that you should build prior to linking Working with Intel Math Kernel Library Cluster Software on linking with cluster components.
To link with Intel MKL, you can choose layered model or legacy model, which is backward compatible on link line (except cluster components). The syntax above incorporates both models. For the layered model, you need to choose one library from the Interface layer, one library from the Threading layer, the Computational layer library (no choice here), and add run-time libraries. In case of the legacy model, you need not change the link line with respect to the one used with Intel MKL 9.x (see the Dummy Libraries section in chapter 3 for details). Figure 5-1 compares linking for the Intel MKL version 10.0 or higher, which uses layers, and Intel MKL 9.x.
5-4
Figure 5-1
In case of employing the layered model for static linking, the cluster component, Interface layer, Threading layer, and Computational layer libraries must be enclosed in grouping symbols (for example, -Wl,--start-group $MKLPATH/libmkl_cdft_core.a
The order of listing libraries in the link line is essential, except for the libraries enclosed in the grouping symbols.
5-5
5-6
Layered, ILP64: libmkl_intel_ilp64.a libmkl_intel_thread.a libmkl_core.a DSS/PARDISO*, dynamic case: Legacy: libmkl_lapack.so libmkl_em64t.so Layered, LP64: libmkl_intel_lp64.so libmkl_intel_thread.so
libmkl_lapack.so libmkl_core.so
Layered, ILP64: libmkl_intel_ilp64.so libmkl_intel_thread.so
libmkl_lapack.so libmkl_core.so
When linking (see Link Command Syntax and Linking Examples), note that The Iterative Sparse Solver and Trust Region Solver routine library currently does not comply with the layered model. So, it is not changed internally with respect to the Intel MKL 9.x. However, to support LP64/ILP64 interfaces, two libraries were introduced in the unified structure: libmkl_solver_lp64.a for the LP64 interface and libmkl_solver_ilp64.a for the ILP64 interface. For backward link line compatibility libmkl_solver.a has become a dummy library. There is still only static version of the solver library, as it was for previous releases. To link with the Iterative Sparse Solver and Trust Region Solver routine library using the layered model, include the library libmkl_solver_lp64.a or libmkl_solver_ilp64.a in the link line, depending upon the interface you need.
NOTE. In MKL 10.1 Gold, the DSS/PARDISO* solver functionality was excluded from libmkl_solver*.a libraries and integrated into the Intel MKL layered structure. So, to use DSS/PARDISO, it is no longer necessary to link with libmkl_solver*.a, but the former link line is still working. Note that both static and dynamic libraries are now available for DSS/PARDISO*.
libmkl_lapack95.a and libmkl_blas95.a libraries contain LAPACK95 and BLAS95 interfaces respectively. They are not included into the original distribution and should be built before using the interface. (See Fortran 95 Interfaces and Wrappers to LAPACK and BLAS and Compiler-dependent Functions and Fortran 90 Modules in chapter 7 for details on building the libraries and on why source code is distributed in this case.)
To use the Intel MKL FFT, Trigonometric Transform, or Poisson, Laplace, and Helmholtz Solver routines, link in the math support system library by adding "-lm" to the link line. In products for Linux, it is necessary to link the pthread library by adding -lpthread. The pthread library is native to Linux and libguide makes use of this library to support multi-threading. Any time libguide is required, add -lpthread at the end of your link line (link order is important).
5-7
See also: Linking Examples Linking with Interface Libraries Linking with Threading Libraries Linking with Computational Libraries
Linking Examples
The section provides specific examples of linking supporting Intel compilers on systems based on IA-32 and Intel 64 architectures. In these examples, <MKL path> and <MKL include> placeholders are replaced with user-defined environment variables $MKLPATH and $MKLINCLUDE, respectively. See also examples on linking with ScaLAPACK and Cluster FFT in chapter 9. For more linking examples, see the Intel MKL support website at http://www.intel.com/support/performancetools/libraries/mkl/.
2.
4.
1.
See Fortran 95 Interfaces and Wrappers to LAPACK and BLAS in chapter 7 for information on how to build Fortran 95 LAPACK and BLAS interface libraries.
5-8
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_lapack95 -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
6. Static linking of users code myprog.f, Fortran 95 BLAS interface1, and parallel Intel MKL:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_blas95 -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
7. Static linking of users code myprog.f, parallel version of an iterative sparse solver, and parallel Intel MKL:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_solver -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
8. Static linking of users code myprog.f, sequential version of an iterative sparse solver, and sequential Intel MKL:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_solver_sequential -Wl,--start-group $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
2. Dynamic linking of users code myprog.f and parallel Intel MKL supporting LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread
5-9
4.
Dynamic linking of users code myprog.f and sequential version of Intel MKL supporting LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
6. Dynamic linking of users code myprog.f and parallel Intel MKL supporting ILP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_lapack95 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
8. Static linking of users code myprog.f, Fortran 95 BLAS interface1, and parallel Intel MKL supporting LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_blas95 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
9. Static linking of users code myprog.f, parallel version of an iterative sparse solver, and parallel Intel MKL supporting LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_solver_lp64 -Wl,--start-group $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -Wl,--end-group -liomp5 -lpthread
10. Static linking of users code myprog.f, sequential version of an iterative sparse solver, and sequential Intel MKL supporting LP64 interface:
ifort myprog.f -L$MKLPATH -I$MKLINCLUDE -lmkl_solver_lp64_sequential -Wl,--start-group $MKLPATH/libmkl_intel_lp64 $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -Wl,--end-group -lpthread
1. See Fortran 95 Interfaces and Wrappers to LAPACK and BLAS in chapter 7 for information on how to build Fortran 95 LAPACK and BLAS interface libraries.
5-10
Table 5-2
Architecture
IA-32 Intel 64 Intel 64
ILP64 LP64
Threading Layer. Starting with version 10.0, Intel MKL is structured as layers. One of those
layers is a Threading Layer. Because of the internal structure of the library, all of the threading represents a small amount of code. This code is compiled by different compilers (Intel, gnu and PGI compilers on Linux*) and the appropriate layer linked in with the threaded application. MKL 10.0, this layer included only the Intel Legacy OpenMP* run-time compiler library that provides support for one additional threading compiler on Linux (gnu). That is, a
RTL Layer. The second relevant component is the Compiler Support RTL Layer. Prior to Intel
libguide. Now you have a new choice to use the Intel Compatibility OpenMP* run-time compiler library libiomp. The Compatibility library libiomp is an extension of libguide
5-11
program threaded with a gnu compiler can safely be linked with Intel MKL and libiomp and execute efficiently and effectively. So, you are encouraged to use libiomp rather than libguide. Table 5-3 shows different scenarios, depending on the threading compiler used, and the possibilities for each scenario to choose the Threading layer and RTL layer when using the current version of Intel MKL (static cases only):
Table 5-3
Compiler Intel PGI
libiomp5.so or libguide.so
PGI supplied Use of libmkl_
sequential.a
removes threading from Intel MKL calls.
No No No Yes
libiomp5.so or libguide.so
PGI supplied None
libiomp5.so or
GNU OpenMP run-time library None
libiomp5 offers
superior scaling performance.
Yes No Yes No
libiomp5.so or libguide.so
None
libiomp5.so or libguide.so
5-12
Table 5-4
Function domain/ Interface BLAS Sparse BLAS BLAS95 Interface
libmkl_core.so libmkl_core.so
n/a1
libmkl_core.so libmkl_core.so
n/a1
CBLAS LAPACK
LAPACK95 Interface
Iterative Sparse Solvers, Trust Region Solver, and GMP routines Iterative Sparse Solvers, Trust Region Solver, and GMP routines, LP64 interface Iterative Sparse Solvers, Trust Region Solver, and GMP routines, ILP64 interface
n/a1
n/a1
libmkl_solver_ lp64.a
or
libmkl_solver_ ilp64.a
or
n/a1
5-13
Table 5-4
Function domain/ Interface Direct Sparse Solver/ PARDISO* Solver Vector Math Library Vector Statistical Library Fourier Transform Functions Trigonometric Transform Functions Poisson Library ScaLAPACK2
libmkl_core.a
libmkl_core.a
libmkl_core.a libmkl_core.a
libmkl_core.a libmkl_core.a
libmkl_core.a
libmkl_core.so
libmkl_core.a
libmkl_core.so
libmkl_core.a
libmkl_core.so
libmkl_core.a
libmkl_core.so
libmkl_core.a
See below
libmkl_core.so
See below
n/a1
n/a1
n/a1
n/a1
5-14
2. Add also the library with BLACS routines corresponding to the used MPI. For details, see Linking with ScaLAPACK and Cluster FFTs in chapter 9.
Notes on Linking
Updating LD_LIBRARY_PATH
When using the Intel MKL shared libraries, do not forget to update the shared libraries environment path, that is, a system variable LD_LIBRARY_PATH, to include the libraries location. For example, if the Intel MKL libraries are in the <mkl_directory>/lib/32 directory, the following command line can be used (assuming a bash shell):
export LD_LIBRARY_PATH=<mkl_directory>/lib/32:$LD_LIBRARY_PATH
If you use dynamic linking (libiomp5.so) of the threading library (recommended), make sure the LD_LIBRARY_PATH is defined so that exactly this version of libiomp is found and used at run time.
5-15
export = user_list
specifies the full name of the file that contains the list of entry point functions to be included into shared object. This file is used for definition file creation and then for export table creation. The default name is user_list (no extension).
name = mkl_custom
specifies the name of the created library. By default, the library mkl_custom.so is built.
xerbla = user_xerbla.o
specifies the name of the object file that contains the user's error handler. This error handler will be added to the library and used instead of the Intel MKL error handler xerbla. By default this parameter is not specified, and the native Intel MKL xerbla is used. Note that if the users error handler has the same name as the Intel MKL handler, the name of the users handler must be upper-case, that is, XERBLA.o. All parameters are not mandatory. In the simplest case, the command line could be
make ia32 and the values of the remaining parameters will default. As a result, mkl_custom.so library for processors using IA-32 architecture will be created, the functions list will be taken from the user_list file, and the native Intel MKL error handler xerbla will be used.
Another example for a more complex case is as follows:
5-16
This chapter features different ways to obtain the best performance with the Intel Math Kernel Library (Intel MKL): primarily, it discusses threading (see Using Intel MKL Parallelism), then shows coding techniques and gives hardware configuration tips for improving performance. The chapter also discusses the Intel MKL memory management and shows how to redefine memory functions that the library uses by default.
*tptrs, *tbtrs
Orthogonal factorization, computational routines:
6-1
All FFTs (except 1D transformations when DFTI_NUMBER_OF_TRANSFORMS=1 and sizes are not power of two).
NOTE. For power-of-two data in 1D FFTs, Intel MKL provides parallelism for all the three supported architectures. For Intel 64 architecture, the parallelism is provided for double complex out-of-place FFTs only.
Being designed for multi-threaded programming, Intel MKL is thread-safe, which means that all Intel MKL functions1 work correctly during simultaneous execution by multiple threads. In particular, any chunk of threaded Intel MKL code provides access of multiple threads to the same shared data, while permitting only one thread at any given time to access a shared piece of data. Due to thread-safety, you can call Intel MKL from multiple threads and not worry about the function instances interfering with each other. The library uses OpenMP* threading software, which responds to the environmental variable OMP_NUM_THREADS that sets the number of threads to use. Notice that there are different means to set the number of threads. In Intel MKL releases earlier than 10.0, you could use the environment variable OMP_NUM_THREADS (see Setting the Number of Threads Using OpenMP* Environment Variable for details) or the equivalent OpenMP run-time function calls (detailed in section Changing the Number of Threads at Run Time). Starting with version 10.0, Intel MKL also offers variables that are independent of OpenMP, such as MKL_NUM_THREADS, and equivalent Intel MKL functions for thread management (see Using Additional Threading Control for details). The Intel MKL variables are always inspected first, then the OpenMP variables are examined, and if neither are used, the OpenMP software chooses the default number of threads. This is a change with respect to Intel MKL versions 9.x or earlier, which used a default value of one, as the Intel compiler OpenMP software uses the default number of threads equal to the number of processors in your system.
NOTE. In Intel MKL 10.1, the OpenMP* software determines the default number of threads. The default number of threads is equal to the number of logical processors in your system for Intel OpenMP* libraries.
To achieve higher performance, you are recommended to set the number of threads to the number of real processors or physical cores. Do this by any available means, which are summarized in section Techniques to Set the Number of Threads.
1.
6-2
When choosing the appropriate technique, take into account the following rules: If you employ the OpenMP techniques (OMP_NUM_THREADS and omp_set_num_threads()) only, which was the case with earlier Intel MKL versions, the library will still respond to them. The Intel MKL threading controls take precedence over the OpenMP techniques. A function call takes precedence over any environment variables. The exception is the OpenMP subroutine omp_set_num_threads(), which does not have precedence over Intel MKL environment variables, such as MKL_NUM_THREADS. The environment variables cannot be used to change run-time behavior in the course of the run, as they are read only once.
6-3
Here are several cases with recommendations depending on the threading model you employ:
Table 6-1
How to avoid conflicts in the execution environment for your threading model
Discussion If more than one thread calls the library, and the function being called is threaded, it may be important that you turn off Intel MKL threading. Set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads). This is more problematic in that setting of OMP_NUM_THREADS in the environment affects both the compiler's threading library and libiomp (libguide). In this case, you should try to choose the Threading layer library that matches the layered Intel MKL with the OpenMP compiler you employ (see Linking Examples on how to do this). If this is impossible, the sequential version of Intel MKL can be used as the Threading layer. To do this, you should link with the appropriate Threading layer library: libmkl_sequential.a or libmkl_sequential.so (see the High-level Directory Structure section in chapter 3). The threading software will see multiple processors on the system even though each processor has a separate MPI process running on it. In this case, set the number of threads to one by any of the available means (see Techniques to Set the Number of Threads).
Threading model You thread the program using OS threads (pthreads on the Linux* OS).
You thread the program using OpenMP directives and/or pragmas and compile the program using a compiler other than a compiler from Intel.
There are multiple programs running on a multiple-cpu system, as in the case of a parallelized program running using MPI for communication in which each processor is treated as a node.
To avoid correctness and performance problems, you are also strongly encouraged to dynamically link with the Intel Compatibility OpenMP run-time library libiomp and Intel Legacy OpenMP run-time library libguide.
6-4
// ******* C language ******* #include "omp.h" #include "mkl.h" #include <stdio.h> #define SIZE 1000 void main(int args, char *argv[]){ double *a, *b, a = new double b = new double c = new double *c; [SIZE*SIZE]; [SIZE*SIZE]; [SIZE*SIZE];
double alpha=1, beta=1; int m=SIZE, n=SIZE, k=SIZE, lda=SIZE, ldb=SIZE, ldc=SIZE, i=0, j=0; char transa='n', transb='n'; for( i=0; i<SIZE; i++){ for( j=0; j<SIZE; j++){ a[i*SIZE+j]= (double)(i+j); b[i*SIZE+j]= (double)(i*j); c[i*SIZE+j]= (double)0;
6-5
} } cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); printf("row\ta\tc\n"); for ( i=0;i<10;i++){ printf("%d:\t%f\t%f\n", i, a[i*SIZE], c[i*SIZE]); } omp_set_num_threads(1); for( i=0; i<SIZE; i++){ for( j=0; j<SIZE; j++){ a[i*SIZE+j]= (double)(i+j); b[i*SIZE+j]= (double)(i*j); c[i*SIZE+j]= (double)0; } } cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); printf("row\ta\tc\n"); for ( i=0;i<10;i++){ printf("%d:\t%f\t%f\n", i, a[i*SIZE], c[i*SIZE]); } omp_set_num_threads(2); for( i=0; i<SIZE; i++){ for( j=0; j<SIZE; j++){ a[i*SIZE+j]= (double)(i+j); b[i*SIZE+j]= (double)(i*j); c[i*SIZE+j]= (double)0; } } cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc); printf("row\ta\tc\n"); for ( i=0;i<10;i++){ printf("%d:\t%f\t%f\n", i, a[i*SIZE], c[i*SIZE]); } delete [] a; delete [] b; delete [] c; } // ******* Fortran language ******* PROGRAM DGEMM_DIFF_THREADS
6-6
ALLOC_SIZE = 8*N*N A_PTR = MKL_MALLOC(ALLOC_SIZE,128) B_PTR = MKL_MALLOC(ALLOC_SIZE,128) C_PTR = MKL_MALLOC(ALLOC_SIZE,128) ALPHA = 1.1 BETA = -1.2 DO I=1,N DO J=1,N A(I,J) = I+J B(I,J) = I*j C(I,J) = 0.0 END DO END DO CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N) print *,'Row A C' DO i=1,10 write(*,'(I4,F20.8,F20.8)') I, A(1,I),C(1,I) END DO CALL OMP_SET_NUM_THREADS(1); DO I=1,N DO J=1,N A(I,J) = I+J B(I,J) = I*j C(I,J) = 0.0 END DO END DO CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N) print *,'Row A C' DO i=1,10 write(*,'(I4,F20.8,F20.8)') I, A(1,I),C(1,I) END DO
6-7
CALL OMP_SET_NUM_THREADS(2); DO I=1,N DO J=1,N A(I,J) = I+J B(I,J) = I*j C(I,J) = 0.0 END DO END DO CALL DGEMM('N','N',N,N,N,ALPHA,A,N,B,N,BETA,C,N) print *,'Row A C' DO i=1,10 write(*,'(I4,F20.8,F20.8)') I, A(1,I),C(1,I) END DO STOP END
NOTE. Intel MKL does not always have a choice on the number of threads for certain reasons, such as system resources.
Employing Intel MKL threading controls in your application is optional. If you do not use them, the library will mainly behave the same way as Intel MKL 9.1 in what relates to threading with the possible exception of a different default number of threads. See Note on FFT Usage for the usage differences.
6-8
Table 6-2 lists the Intel MKL environment variables for threading control, their equivalent functions, and OMP counterparts:
Table 6-2
Environment Variable
Service Function
Comment Suggests the number of threads to use. Suggests the number of threads for a particular function domain. Enables Intel MKL to dynamically change the number of threads.
OMP_NUM_THREADS
OMP_DYNAMIC
NOTE. The functions take precedence over the respective environment variables. In particular, if in your application, you want Intel MKL to use a given number of threads and do not want users of your application to change this via environment variables, set this number of threads by a call to mkl_set_num_threads(), which will have full precedence over any environment variables being set.
The example below illustrates the use of the Intel MKL function mkl_set_num_threads() to mimic the Intel MKL 9.x default behavior, that is, running on one thread. Example 6-2 Setting the number of threads to one
6-9
MKL_DYNAMIC
The default value of MKL_DYNAMIC is TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE.
MKL_DYNAMIC being TRUE means that Intel MKL will always try to pick what it considers the best number of threads, up to the maximum specified by the user. MKL_DYNAMIC being FALSE means that Intel MKL will normally try not to deviate from the number of threads the user requested. Note however that setting MKL_DYNAMIC=FALSE
does not ensure that Intel MKL will use the number of threads that you request mainly because the library may have no choice on this number for such reasons as system resources. Moreover, the library may examine the problem and pick a different number of threads than the value suggested. For example, if you attempt to do a size 1 matrix-matrix multiply across 8 threads, the library may instead choose to use only one thread because it is impractical to use 8 threads in this event. Note also that if Intel MKL is called in a parallel region, it will use only one thread by default. If you want the library to use nested parallelism, and the thread within a parallel region is compiled with the same OpenMP compiler as Intel MKL is using, you may experiment with setting MKL_DYNAMIC to FALSE and manually increasing the number of threads. In general, you should set MKL_DYNAMIC to FALSE only under circumstances that Intel MKL is unable to detect, for example, when nested parallelism is desired where the library is called already from a parallel section.
MKL_DYNAMIC being TRUE, in particular, provides for optimal choice of the number of threads in the following cases:
If the requested number of threads exceeds the number of physical cores (perhaps because of hyper-threading), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores.
6-10
If you are able to detect the presence of MPI, but cannot determine if it has been called in a thread-safe mode (it is impossible to detect this with MPICH 1.2.x, for instance), and MKL_DYNAMIC has not been changed from its default value of TRUE, Intel MKL will run one thread.
MKL_DOMAIN_NUM_THREADS
MKL_DOMAIN_NUM_THREADS accepts a string value <MKL-env-string>, which must have
the following format:
<MKL-env-string> ::= <MKL-domain-env-string> { <delimiter> <MKL-domain-env-string> } <delimiter> ::= [ <space-symbol>* ] ( <space-symbol> | <comma-symbol> | <semicolon-symbol> | <colon-symbol> ) [ <space-symbol>* ] <MKL-domain-env-string> ::= <MKL-domain-env-name> <uses> <number-of-threads> <MKL-domain-env-name> ::= MKL_ALL | MKL_BLAS | MKL_FFT | MKL_VML <uses> ::= [ <space-symbol>* ] ( <space-symbol> | <equality-sign> | <comma-symbol>) [ <space-symbol>* ] <number-of-threads> ::= <positive-number> <positive-number> ::= <decimal-positive-number> | <octal-number> | <hexadecimal-number>
In the syntax above, MKL_BLAS indicates the BLAS function domain, MKL_FFT indicates non-cluster FFTs, and MKL_VML indicates the Vector Mathematics Library. For example,
MKL_ALL 2 : MKL_BLAS 1 : MKL_FFT 4 MKL_ALL=2 : MKL_BLAS=1 : MKL_FFT=4 MKL_ALL=2, MKL_ALL=2; MKL_ALL = 2 MKL_BLAS=1, MKL_BLAS=1; MKL_FFT=4 MKL_FFT=4 MKL_FFT 4
MKL_BLAS 1 ,
6-11
Table 6-3
Value of
MKL_DOMAIN_NUM_THREADS MKL_ALL=4
NOTE. The domain-specific settings take precedence over the overall ones. For example, the "MKL_BLAS=4" value of MKL_DOMAIN_NUM_THREADS suggests to try 4 threads for BLAS, regardless of later setting MKL_NUM_THREADS, and a function call "mkl_domain_set_num_threads ( 4, MKL_BLAS );" suggests the same, regardless of later calls to mkl_set_num_threads(). However, pay attention to that a function call with input "MKL_ALL", such as "mkl_domain_set_num_threads (4, MKL_ALL);" is equivalent to "mkl_set_num_threads(4)", and thus it will be overwritten by later calls to mkl_set_num_threads. Similarly, the environment setting of MKL_DOMAIN_NUM_THREADS with "MKL_ALL=4" will be overwritten with
MKL_NUM_THREADS = 2.
Whereas the MKL_DOMAIN_NUM_THREADS environment variable enables you set several variables at once, for example, "MKL_BLAS=4,MKL_FFT=2", the corresponding function does not take string syntax. So, to do the same with the function calls, you may need to make several calls, which in this example are as follows:
6-12
For example,
Coding Techniques
To obtain the best performance with Intel MKL, ensure the following data alignment in your source code: arrays are aligned at 16-byte boundaries
6-13
leading dimension values (n*element_size) of two-dimensional arrays are divisible by 16 for two-dimensional arrays, leading dimension values divisible by 2048 are avoided.
call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info),
where a is the dimension lda-by-n, which is at least N2 elements, instead of
call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info),
where ap is the dimension N*(N+1)/2.
FFT functions
There are additional conditions which improve performance of the FFT functions.
Applications based on IA-32 or Intel 64 architecture. The addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size, which equals
32 bytes for Pentium III processor 64 bytes for Pentium 4 processor 128 bytes for processor using Intel 64 architecture.
6-14
Leading dimension values, in bytes (n*element_size), of two-dimensional arrays are not power of two.
Suppose:
6-15
C code presented in Example 6-3 solves the problem. The code example calls the system function sched_setaffinity to bind the threads to the cores on different sockets. After that the Intel MKL FFT function is called. Compile your application with the Intel compiler using the following command:
icc test_application.c openmp where test_application.c is the filename for the application.
Build the application and run it in 2 threads:
Example 6-3 Setting an affinity mask by operating system means using an Intel compiler
SetThreadAffinityMask( GetCurrentThread(), mask ); #include <stdio.h> #define __USE_GNU // Set affinity mask #include <sched.h> #include <unistd.h> #include <omp.h> int main(void) { int NCPUs = sysconf(_SC_NPROCESSORS_CONF); printf("Using thread affinity on %i NCPUs\n", NCPUs); #pragma omp parallel default(shared) { cpu_set_t new_mask; cpu_set_t was_mask; int tid = omp_get_thread_num(); CPU_ZERO(&new_mask); // 2 packages x 2 cores/pkg x 1 threads/core (4 total cores) CPU_SET(tid==0 ? 0 : 2, &new_mask); if (sched_getaffinity(0, sizeof(was_mask), &was_mask) == -1) { printf("Error: sched_getaffinity(%d, sizeof(was_mask), &was_mask)\n", tid); } if (sched_setaffinity(0, sizeof(new_mask), &new_mask) == -1) { printf("Error: sched_setaffinity(%d, sizeof(new_mask), &new_mask)\n", tid); } printf("tid=%d new_mask=%08X was_mask=%08X\n", tid, *(unsigned int*)(&new_mask), *(unsigned int*)(&was_mask)); } // Call Intel MKL FFT function
6-16
Example 6-3 Setting an affinity mask by operating system means using an Intel compiler (continued)
return 0; }
See the Linux Programmer's Manual (in man pages format) for particulars of the sched_setaffinity function used in the above example.
Operating on Denormals
If an Intel MKL function operates on denormals, that is, non-zero numbers that are smaller than the smallest possible non-zero number supported by a given floating-point format, or produces denormals during the computation (for instance, if the incoming data is too close to the underflow threshold), you may experience considerable performance drop. The CPU state may be set so that floating-point operations on denormals invoke the exception handler that slows down the application. To resolve the issue, before compiling the main program, turn on the -ftz option, if you are using the Intel compiler or any other compiler that can control this feature. In this case, denormals are treated as zeros at processor level and the exception handler is not invoked. Note, however, that setting this option slightly impacts the accuracy. Another way to bring the performance back to norm is proper scaling of the input data to avoid numbers near the underflow threshold.
6-17
memory allocated by the memory management software, call the MKL_MemStat() function. If at some point your program needs to free memory, it may do so with a call to MKL_FreeBuffers(). If another call is made to a library function that needs a memory buffer, then the memory manager will again allocate the buffers and they will again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report the behavior as a memory leak. You can release memory in your program through the use of a function made available in Intel MKL or you can force memory releasing after each call by setting an environment variable. The memory management software is turned on by default. To disable the software using the environment variable, set MKL_DISABLE_FAST_MM to any value, which will cause memory to be allocated and freed from call to call. Disabling this feature will negatively impact performance of routines such as the level 3 BLAS, especially for small problem sizes. Using one of these methods to release memory will not necessarily stop programs from reporting memory leaks, and, in fact, may increase the number of such reports in case you make multiple calls to the library, thereby requiring new allocations with each call. Memory not released by one of the methods described will be released by the system when the program ends.
Memory renaming
In general, if users try to employ their own memory management functions instead of similar system functions (malloc, free, calloc, and realloc), actually, the memory gets managed by two independent memory management packages, which may cause memory issues. To prevent such issues, the memory renaming feature was introduced in certain Intel libraries and in particular in Intel MKL. This feature enables users to redefine memory management functions. Redefining is possible because Intel MKL actually uses pointers to memory functions (i_malloc, i_free, i_calloc, i_realloc) rather than the functions themselves. These pointers initially hold addresses of respective system memory management functions (malloc, free, calloc, realloc) and are visible at the application level. So, the pointer values can be redefined programmatically. Once a user has redirected these pointers to their own respective memory management functions, the memory will be managed with user-defined functions rather than system ones. As only one (user-defined) memory management package is in operation, the issues are avoided.
6-18
Intel MKL memory management by default uses standard C run-time memory functions to allocate or free memory. These functions can be replaced using memory renaming.
2.
#include "i_malloc.h" . . . i_malloc i_calloc i_free . . . // Now you may call Intel MKL functions = my_malloc; = my_calloc; = my_free;
i_realloc = my_realloc;
6-19
Intel Math Kernel Library (Intel MKL) basically provides support for Fortran and C/C++ programming. However, not all function domains support both Fortran and C interfaces (see Table A-1 in Appendix A). For example, LAPACK has no C interface. Still you can call functions comprising these domains from C using mixed-language programming. Moreover, even if you want to use LAPACK or BLAS, which basically support Fortran, in the Fortran 95 environment, additional effort is initially required to build language-specific interface libraries and modules, being delivered as source code. The chapter mainly focuses on mixed-language programming and the use of language-specific interfaces. It expands upon the use of Intel MKL in C language environments for function domains that basically support Fortran as well as explains usage of language-specific interfaces and, in particular, Fortran 95 interfaces to LAPACK and BLAS. In this connection, compiler-dependent functions are discussed to explain why Fortran 90 modules are supplied as sources. A separate section guides you through the process of running examples of invoking Intel MKL functions from Java.
7-1
Table 7-1
File name
libfftw2xf_intel.a libfftw2xf_gnu.a libfftw3xc_intel.a libfftw3xc_gnu.a libfftw3xf_intel.a libfftw3xf_gnu.a libfftw2x_cdft_SINGLE.a libfftw2x_cdft_DOUBLE.a mkl95_blas.mod mkl95_lapack.mod mkl95_precision.mod
Section Fortran 95 Interfaces and Wrappers to LAPACK and BLAS shows by example how these libraries and modules are generated.
7-2
As a result, the required library and a respective .mod file will be built and installed in the standard catalog of the release. The .mod files can also be obtained from files of interfaces using the compiler command
2.
3.
7-3
Where the dependencies do arise, supporting RTL is shipped with Intel MKL. The only examples of such RTLs, except those that are relevant to the Intel MKL cluster software, are libiomp and libguide, which are the libraries for the OpenMP* code compiled with an Intel compiler. libiomp and libguide support the threaded code in Intel MKL. In other cases where RTL dependencies might arise, the functions are delivered as source code and it is the responsibility of the user to compile the code with whatever compiler employed. In particular, Fortran 90 modules result in the compiler-specific code generation requiring RTL support, so, Intel MKL delivers these modules as source code.
WARNING. Avoid calling BLAS95/LAPACK95 from C/C++. Such calls require skills in manipulating the descriptor of a deferred-shape array, which is the Fortran 90 type. Moreover, BLAS95/LAPACK95 routines contain links to a Fortran RTL.
LAPACK
As LAPACK routines are Fortran-style, when calling them from C-language programs, make sure that you follow the Fortran-style calling conventions: Pass variables by 'address' as opposed to pass by 'value'. Function calls in Example 7-2 and Example 7-3 illustrate this. Store your data Fortran-style, that is, in column-major rather than row-major order.
7-4
With row-major order, adopted in C, the last array index changes most quickly and the first one changes most slowly when traversing the memory segment where the array is stored. With Fortran-style column-major order, the last index changes most slowly whereas the first one changes most quickly (as illustrated by Figure 7-1 for a 2D array). Figure 7-1 Column-major order vs. row-major order
For example, if a two-dimensional matrix A of size m x n is stored densely in a one-dimensional array B, you can access a matrix element like this:
A[i][j] = B[i*n+j] in C (i=0, ... , m-1, j=0, ... , n-1) A(i,j) = B(j*m+i) in Fortran (i=1, ... , m, j=1, ... , n).
When calling LAPACK routines from C, also mind that LAPACK routine names can be both upper-case or lower-case, with trailing underscore or not. For example, these names are equivalent: dgetrf, DGETRF, dgetrf_, DGETRF_.
BLAS
BLAS routines are Fortran-style routines. If you call BLAS routines from a C-language program, you must follow the Fortran-style calling conventions: Pass variables by address as opposed to passing by value. Store data Fortran-style, that is, in column-major rather than row-major order.
Refer to the LAPACK section for details of these conventions. See Example 7-2 on how to call BLAS routines from C.
7-5
When calling BLAS routines from C, also mind that BLAS routine names can be both upper-case and lower-case, with trailing underscore or not. For example, these names are equivalent: dgemm, DGEMM, dgemm_, DGEMM_.
CBLAS
An alternative for calling BLAS routines from a C-language program is to use the CBLAS interface. CBLAS is a C-style interface to the BLAS routines. You can call CBLAS routines using regular C-style calls. When using the CBLAS interface, the header file mkl.h will simplify the program development as it specifies enumerated values as well as prototypes of all the functions. The header determines if the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation. Example 7-3 illustrates the use of CBLAS interface.
-DMKL_Complex8=std::complex<float> -DMKL_Complex16=std::complex<double>.
7-6
Calling BLAS Functions That Return the Complex Values in C/C++ Code
You must be careful when handling a call from C to a BLAS function that returns complex values. The problem arises because these are Fortran functions and complex return values are handled quite differently for C and Fortran. However, in addition to normal function calls, Fortran enables calling functions as though they were subroutines, which provides a mechanism for returning the complex value correctly when the function is called from a C program. When a Fortran function is called as a subroutine, the return value shows up as the first parameter in the calling sequence. This feature can be exploited by the C programmer. The following example shows how this works. Normal Fortran function call: A call to the function as a subroutine: A call to the function from C (notice that the hidden parameter gets exposed):
result = cdotc( n, x, 1, y, 1 ).
NOTE. Intel MKL has both upper-case and lower-case entry points in BLAS, with trailing underscore or not. So, all these names are acceptable: cdotc, CDOTC, cdotc_, CDOTC_.
Using the above example, you can call from C, and thus, from C++, several level 1 BLAS functions that return complex values. However, it is still easier to use the CBLAS interface. For instance, you can call the same function using the CBLAS interface as follows:
cblas_cdotu( n, x, 1, y, 1, &result )
NOTE. The complex value comes back expressly in this case. The following example illustrates a call from a C program to the complex BLAS Level 1 function zdotc(). This function computes the dot product of two double-precision complex vectors. In this example, the complex dot product is returned in the structure c.
7-7
#include "mkl.h" #define N 5 void main() { int n, inca = 1, incb = 1, i; typedef struct{ double re; double im; } complex16; complex16 a[N], b[N], c; void zdotc(); n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } zdotc( &c, &n, a, &inca, b, &incb ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); }
7-8
Below is the C++ implementation: Example 7-2 Calling a complex BLAS Level 1 function from C++
#include <complex> #include <iostream> #define MKL_Complex16 std::complex<double> #include "mkl.h" #define N 5 int main() { int n, inca = 1, incb = 1, i; std::complex<double> a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i] = std::complex<double>(i,i*2.0); b[i] = std::complex<double>(n-i,i*2.0); } zdotc(&c, &n, a, &inca, b, &incb ); std::cout << "The complex dot product is: " << c << std::endl; return 0; }
7-9
The implementation below uses CBLAS: Example 7-3 Using CBLAS interface instead of calling BLAS directly from C
#include "mkl.h" typedef struct{ double re; double im; } complex16; extern "C" void cblas_zdotc_sub ( const int , const complex16 *, const int , const complex16 *, const int, const complex16*); #define N 5 void main() { int n, inca = 1, incb = 1, i; complex16 a[N], b[N], c; n = N; for( i = 0; i < n; i++ ){ a[i].re = (double)i; a[i].im = (double)i * 2.0; b[i].re = (double)(n - i); b[i].im = (double)i * 2.0; } cblas_zdotc_sub(n, a, inca, b, incb,&c ); printf( "The complex dot product is: ( %6.2f, %6.2f)\n", c.re, c.im ); }
7-10
matrices. The library uses an expression template technique for passing expressions as function arguments, which enables evaluating vector and matrix expressions in one pass without temporary matrices. uBLAS distinguishes two modes: Debug (safe) mode, default. Type and conformance checking is performed Release (fast) mode. Turned on with NDEBUG preprocessor symbol.
The documentation for the latest version of the Boost uBLAS is available at www. boost. org/doc/libs/1_35_0/libs/numeric/ublas/doc/index.htm. Intel MKL provides overloaded prod() functions for substituting uBLAS dense matrix-matrix multiplication with the Intel MKL gemm calls. Though these functions break uBLAS expression templates and introduce temporary matrices, the performance advantage can be considerable for matrix sizes that are not too small (roughly, over 50). You do not need to change your source code to use the functions. To call them, Include the header file mkl_boost_ublas_matrix_prod.hpp in your code Add appropriate Intel MKL libraries to the link line (refer to the Linking Your Application with Intel Math Kernel Library chapter for details).
prod( m1, m2 ) prod( trans(m1), m2 ) prod( trans(conj(m1)), m2 ) prod( conj(trans(m1)), m2 ) prod( m1, trans(m2) ) prod( trans(m1), trans(m2) ) prod( trans(conj(m1)), trans(m2) ) prod( conj(trans(m1)), trans(m2) ) prod( m1, trans(conj(m2)) ) prod( trans(m1), trans(conj(m2)) ) prod( trans(conj(m1)), trans(conj(m2)) ) prod( conj(trans(m1)), trans(conj(m2)) ) prod( m1, conj(trans(m2)) ) prod( trans(m1), conj(trans(m2)) ) prod( trans(conj(m1)), conj(trans(m2)) )
7-11
<mkl_directory>/examples/ublas/source/sylvester.cpp file illustrates usage of the Intel MKL uBLAS header file for solving a special case of the Sylvester equation.
To run the Intel MKL ublas examples, specify the BOOST_ROOT parameter in the make command, for instance,
<mkl directory>/examples/java .
The examples are provided for the following MKL functions: the ?gemm, ?gemv, and ?dot families from CBLAS complete set of non-cluster FFT functions ESSL1-like functions for 1-dimensional convolution and correlation. VSL Random Number Generators (RNG), except user-defined ones and file subroutines. VML functions, except GetErrorCallBack, SetErrorCallBack, and ClearErrorCallBack.
1.
7-12
<mkl directory>/examples/java/examples .
The examples are written in Java. They demonstrate usage of the MKL functions with the following variety of data: 1- and 2-dimensional data sequences real and complex types of the data single and double precision. demonstrate the use of huge arrays (>2 billion elements) demonstrate processing of arrays in native memory check correctness of function parameters demonstrate performance optimizations
To bind with Intel MKL, the examples use the Java Native Interface (JNI* developer framework). The JNI documentation to start with is available from http ://java .sun .com/j2se/1.5.0/docs/guide/jni/index. html . The Java example set includes JNI wrappers which perform the binding. The wrappers do not depend on the examples and may be used in your Java applications. The wrappers for CBLAS, FFT, VML, VSL RNG, and ESSL-like convolution and correlation functions do not depend on each other. To build the wrappers, just run the examples (see the Running the examples section for details). The makefile builds the wrapper binaries and the examples, invoked after that, double-check if the wrappers are built correctly. As a result of running the examples, the following directories will be created in <mkl directory>/examples/java:
The directories docs, include, classes, and bin will contain the wrapper binaries and documentation; the directory _results will contain the testing results. For a Java programmer, the wrappers look like the following Java classes:
7-13
com.intel.mkl.VSL
Documentation for the particular wrapper and example classes will be generated from the Java sources during building and running the examples. To browse the documentation, start from the index file in the docs directory which will be created by the build script:
<mkl directory>/examples/java/docs/index.html .
The Java wrappers for CBLAS, VML, VSL RNG, and FFT establish the interface that directly corresponds to the underlying native functions and you can refer to the Intel MKL Reference Manual for their functionality and parameters. Interfaces for the ESSL-like functions are described in the generated documentation for the com.intel.mkl.ESSL class. Each wrapper consists of the interface part for Java and JNI stub written in C. You can find the sources in the following directory:
<mkl directory>/examples/java/wrappers .
Both Java and C parts of the wrapper for CBLAS and VML demonstrate the straightforward approach, which you may easily employ to cover additional CBLAS functions. The wrapper for FFT is more complicated because of supporting the lifecycle for FFT descriptor objects. To compute a single Fourier transform, an application needs to call the FFT software several times with the same copy of the native FFT descriptor. The wrapper provides the handler class to hold the native descriptor while virtual machine runs Java bytecode. The wrapper for VSL RNG is similar to the one for FFT. The wrapper provides the handler class to hold the native descriptor of the stream state. The wrapper for the convolution and correlation functions mitigates the same difficulty of the VSL interface, which assumes similar lifecycle for "task descriptors". The wrapper utilizes the ESSL-like interface for those functions, which is simpler for the case of 1-dimensional data. The JNI stub additionally enwraps the MKL functions into the ESSL-like wrappers written in C and so "packs" the lifecycle of a task descriptor into a single call to the native method. The wrappers meet the JNI Specification versions 1.1 and 5.0 and should work with virtually every modern implementation of Java. The examples and the Java part of the wrappers are written for the Java language described in The Java Language Specification (First Edition) and extended with the feature of "inner classes" (this refers to late 1990s). This level of language version is supported by all versions of Sun JDK* developer toolkit and compatible implementations starting from version 1.1.5, that is, by all modern versions of Java.
7-14
The level of C language is "Standard C" (that is, C89) with additional assumptions about integer and floating-point data types required by the Intel MKL interfaces and the JNI header files. That is, the native float and double data types are required to be the same as JNI jfloat and jdouble data types, respectively, and the native int is required to be 4-byte long.
NOTE. The implementation from the Sun Microsystems Corporation supports only processors using IA-32 and Intel 64 architectures. The implementation from BEA Systems supports Intel Itanium 2 processors as well. Also note that the Java run-time environment* (JRE*) system, which may be pre-installed on your computer, is not enough. You need the JDK* developer toolkit that supports the following set of tools: java javac javah javadoc
To make these tools available for the examples makefile, set up the JAVA_HOME environment variable and add the JDK binaries directory to the system PATH, for example:
7-15
unset JDK_HOME
To start the examples, use the makefile found in the Intel MKL Java examples directory:
Known limitations
There are three kinds of limitations: functionality performance known bugs.
Functionality. It is possible that some MKL functions will not work if called from the Java
environment via a wrapper, like those provided with the Intel MKL Java examples. Only those specific CBLAS, FFT, VML, VSL RNG, and the convolution/correlation functions listed in the Intel MKL Java examples section were tested with the Java environment. So, you may use the Java wrappers for these CBLAS, FFT, VML, VSL RNG, and convolution/correlation functions in your Java applications.
Performance. The functions from Intel MKL must work faster than similar functions written
in pure Java. However, note that performance was not the main goal for these wrappers. The intent was giving code examples. So, an Intel MKL function called from Java application will probably work slower than the same function called from a program written in C/C++ or Fortran.
Known bugs. There are a number of known bugs in Intel MKL (identified in the Release
Notes) and there are incompatibilities between different versions of JDK. The examples and wrappers include workarounds for these problems to make the examples work anyway. Source codes of the examples and wrappers include comments which describe the workarounds.
7-16
Coding Tips
This is another chapter whose contents discusses programming with Intel Math Kernel Library (Intel MKL). Whereas chapter 7 focuses on general language-specific programming options, this one presents coding tips that may be helpful to meet certain specific needs. Currently the only tip advising how to achieve numerical stability is given. You can find other coding tips, relevant to performance and memory management, in chapter 6.
Unlike the first two conditions, which are under users' control, the alignment of arrays, by default, is not. For instance, arrays dynamically allocated using malloc are aligned at 8-byte boundaries, but not at 16-byte. If you need the numerically stable output, use MKL_malloc() to get the properly aligned workspace:
8-1
// ******* C language ******* ... #include <stdlib.h> ... void *darray; int workspace ... // Allocate workspace aligned on 16-bit boundary darray = MKL_malloc( sizeof(double)*workspace, 16 ); ... // call the program using MKL mkl_app( darray ); ... // Free workspace MKL_free( darray ) ! ******* Fortran language ******* ... double precision darray pointer (p_wrk,darray(1)) integer workspace ... ! Allocate workspace aligned on 16-bit boundary p_wrk = mkl_malloc( 8*workspace, 16 ) ... ! call the program using MKL call mkl_app( darray ) ... ! Free workspace call mkl_free(p_wrk)
8-2
This chapter discusses usage of Intel MKL ScaLAPACK and Cluster FFTs, mainly describing linking your application with these domains and including C- and Fortran-specific linking examples; gives information on the supported MPI. See Table 3-7 for detailed Intel MKL directory structure in chapter 3. For information on the available documentation and the doc directory, see Table 3-7 in the same chapter. For information on MP LINPACK Benchmark for Clusters, see section Intel Optimized MP LINPACK Benchmark for Clusters in chapter 11. Intel MKL ScaLAPACK and FFTs support MPICH-1.2.x and Intel MPI. To link a program that calls ScaLAPACK, you need to know how to link an MPI application first. Typically, this involves using mpi scripts mpicc or mpif77 (C or FORTRAN 77 scripts) that use the correct MPI header files and others. If, for instance, you are using MPICH installed in /opt/mpich, then typically /opt/mpich/bin/mpicc and /opt/mpich/bin/mpif77 will be the compiler scripts and /opt/mpich/lib/libmpich.a will be the library used for that installation.
<<MPI> linker script> <files to link> -L<MKL path> [-Wl,--start-group] <MKL Cluster Library> <BLACS> <MKL Core Libraries> [-Wl,--end-group]
where
\ \
<MPI> is one of several MPI implementations (MPICH, Intel MPI 2.x, Intel MPI 3.x)
9-1
<BLACS> is one of BLACS libraries for the appropriate architecture, which are listed in Table 3-6, for example, for IA-32 architecture, it is one of -lmkl_blacs, -lmkl_blacs_intelmpi, and -lmkl_blacs_openmpi; <MKL Cluster Library> is one of ScaLAPACK or Cluster FFT libraries for the appropriate architecture, which are listed in Table 3-6, for example, for IA-32 architecture, it is one of -lmkl_scalapack_core or -lmkl_cdft_core; <MKL Core Libraries> is <MKL LAPACK & MKL kernel libraries> for ScaLAPACK, and <MKL kernel libraries> for Cluster FFTs; <MKL kernel libraries> are processor optimized kernels, threading library, and system library for threading support linked as described at the beginning of section Link Command Syntax in Chapter 5; <MKL LAPACK & kernel libraries> are the LAPACK library and <MKL kernel libraries>; grouping symbols -Wl,--start-group and -Wl,--end-group are required in case
of static linking. For example, if you are using Intel MPI 3.x, wish to statically use the LP64 interface with ScaLAPACK and to have only one MPI process per core (and thus do not employ threading), provide the following linker options:
-L$MKLPATH -I$MKLINCLUDE -Wl,--start-group $MKLPATH/libmkl_scalapack_lp64.a $MKLPATH/libmkl_blacs_intelmpi_lp64.a $MKLPATH/libmkl_intel_lp64.a $MKLPATH/libmkl_sequential.a $MKLPATH/libmkl_core.a -static_mpi -Wl,--end-group -lpthread lm
For more examples, see Examples for Linking with ScaLAPACK and Cluster FFT. Note that <<MPI> linker script> and <BLACS> library should correspond to the MPI version. For instance, if it is Intel MPI 2.x, then <Intel MPI 2.x linker script> and libmkl_blacs_intelmpi libraries are used. To link with Intel MPI 3.0 or 3.1,
For information on linking with Intel MKL libraries, see Chapter 5 Linking Your Application with Intel Math Kernel Library.
9-2
value is the number of CPUs according to the OS. Be cautious to avoid over-prescribing the number of threads, which may occur, for instance, when the number of MPI ranks per node and the number of threads per node are both greater than one. The best way to set, for example, the environment variable OMP_NUM_THREADS is in the login environment. Remember that mpirun starts a fresh default shell on all of the nodes and so, changing this value on the head node and then doing the run (which works on an SMP system) will not effectively change the variable as far as your program is concerned. In .bashrc, you could add a line at the top, which looks like this:
ScaLAPACK Tests
To build NetLib ScaLAPACK tests, for IA-32 architecture, add libmkl_scalapack_core.a to your link command for IA-64 and Intel 64 architectures, add libmkl_scalapack_lp64.a or libmkl_scalapack_ilp64.a, depending upon the desired interface.
9-3
To link with ScaLAPACK for a cluster of systems based on the IA-32 architecture, use the
following libraries:
\
\ \ \ \ \
To link with Cluster FFT for a cluster of systems based on the IA-32 architecture, use the
/opt/mpich/bin/mpicc <user files to link> $MKLPATH/libmkl_cdft_core.a $MKLPATH/libmkl_blacs_intelmpi.a $MKLPATH/libmkl_intel.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread \ \ \ \ \ \
To link with ScaLAPACK for a cluster of systems based on the IA-64 architecture, use the
following libraries:
\ \ \ \
9-4
-lmkl_lapack
-lmkl_intel_lp64 lmkl_intel_thread -lmkl_lapack lmkl_core
\ \
-liomp5 -lpthread
following libraries:
To link with Cluster FFT for a cluster of systems based on the IA-64 architecture, use the
/opt/intel/mpi/3.0/bin/mpiifort <user files to link> $MKLPATH/libmkl_cdft_core.a $MKLPATH/libmkl_blacs_intelmpi_ilp64.a $MKLPATH/libmkl_intel_ilp64.a $MKLPATH/libmkl_intel_thread.a $MKLPATH/libmkl_core.a -liomp5 -lpthread \ \ \ \
\
A binary linked with ScaLAPACK runs in the same way as any other MPI application (For information, refer to the documentation that comes with the MPI implementation). For instance, the script mpirun is used in case of MPICH2 and OpenMPI, and a number of MPI processes is set by -np. In case of MPICH 2.0 and all Intel MPIs, you should start the daemon before running an application; the execution is driven by the script mpiexec. For further linking examples, see the Intel MKL support website at http://www.intel.com/support/performancetools/libraries/mkl/.
9-5
10
This chapter discusses features of the Intel Math Kernel Library (Intel MKL) which software engineers can benefit from when working in the Eclipse* IDE.
The first three features are provided through the Intel MKL plugin for Eclipse Help (See Table 3-1 in chapter 3 for the plugin location after installation.) To use the plugin, place it into the plugins folder of your Eclipse directory. The last feature is native to the Eclipse CDT.
The Intel MKL Help Index is also available in Eclipse, and the Reference Manual is included in the Eclipse Help search.
10-1
10
Figure 10-1
10-2
10
10-3
10
Infopop window
Infopop window is a popup description of a C function. To obtain the description of an Intel MKL function whose name is typed in the editor, place the cursor over the function name. Figure 10-3 Infopop Window with an Intel MKL function description
F1 Help
F1 Help basically displays the list of relevant documentation topics for a keyword. To get F1 Help for an Intel MKL function whose name is typed in the editor window,
10-4
10
1. 2.
Place the cursor to the function name Press F1. This causes two lists to display: The list of links to the relevant topics in the product documentation displays in the Related Topics page under See also. The Intel MK Help Index establishes the relevance (see Figure 10-4). Typically, one link displays in this list for each function. The list of search results for the function name displays in the Related Topics page under Dynamic Help (see Figure 10-5).
3.
Click a needed link to open the Help topic. F1 Help in the Eclipse* IDE
Figure 10-4
10-5
10
Figure 10-5
2.
10-6
10
To be prompted for the completion of the name of an Intel MKL function or a named constant in the code window, 1. 2. Type the first few characters of the name in your code line Press Ctrl + SPACEBAR. The prompt info appears in a popup. Code/Content Assist
Figure 10-6
10-7
10
To insert an element if it is the only item in the list when Content Assist is invoked check Insert single proposals automatically To display proposals in alphabetical order, rather than by relevance, check Present proposals in alphabetical order To change the amount of time Content Assist is permitted to parse proposals type the value in the Content Assist parsing timeout text box area. To enable alternative triggers for Content Assist, check appropriate check boxes under Auto activation To change the delay before Content Assist is automatically invoked for the triggers (see above), type the new delay in the delay text box area under Auto activation. To change the background color of the Content Assist dialog box, click the color palette button next to Background for completion proposals To change the foreground color of the Content Assist dialog box, click the color palette button next to Foreground for completion proposals 5. Click OK.
10-8
10
Figure 10-7
10-9
11
This chapter describes the Intel Optimized LINPACK Benchmark for the Linux* OS and Intel Optimized MP LINPACK Benchmark for Clusters.
Contents
The Intel Optimized LINPACK Benchmark for Linux* contains the following files, located in the ./benchmarks/linpack/ subdirectory in the Intel MKL directory (see Table 3-1):
11-1
11
Table 11-1
./benchmarks/linpack/
linpack_itanium linpack_xeon32
linpack_xeon64 runme_itanium
runme_xeon32
11-2
11
To run the software for other problem sizes, please refer to the extended help included with the program. Extended help can be viewed by running the program executable with the "-e" option:
If the system has less memory than the above sample data inputs require, you may have to edit or create your own data input files, as directed in the extended help. Each sample script, in particular, uses the OMP_NUM_THREADS environment variable to set the number of processors it is targeting. To optimize performance on a different number of physical processors, change that line appropriately. If you run the Intel Optimized LINPACK Benchmark without setting the number of threads, it will default to the number of cores according to the OS. You can find the settings for this environment variable in the runme_* sample scripts. If the settings do not already match the situation for your machine, edit the script.
Known Limitations
The following limitations are known for the Intel Optimized LINPACK Benchmark for Linux*: Intel Optimized LINPACK Benchmark is threaded to effectively use multiple processors. So, in multi-processor systems, best performance will be obtained with Hyper-Threading technology turned off, which ensures that the operating system assigns threads to physical processors only. If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file.
11-3
11
NOTE. If you wish to use a different version of MPI, you can do so by using the MP LINPACK source provided.
The package includes software developed at the University of Tennessee, Knoxville, Innovative Computing Laboratories and neither the University nor ICL endorse or promote this product. Although HPL 1.0a is redistributable under certain conditions, this particular package is subject to the MKL license. Intel MKL 10.0 Update 3 has introduced a new functionality into MP LINPACK, which is called a hybrid build, while continuing to support the older version. The term hybrid refers to special optimizations added to take advantage of mixed OpenMP*/MPI parallelism. If you want to use one MPI process per node and to achieve further parallelism via OpenMP, use of the hybrid build. If you want to rely exclusively on MPI for parallelism and use one MPI per core, use of the non-hybrid build. In addition to supplying certain hybrid prebuilt binaries, Intel MKL supplies certain hybrid prebuilt libraries to take advantage of the additional OpenMP optimizations.
11-4
11
Note that the non-hybrid version may be used in a hybrid mode, but it would be missing some of the optimizations added to the hybrid version. Non-hybrid builds are the default. In many cases, the use of the hybrid mode is required for system reasons, but if there is a choice, the non-hybrid code may be faster, although that may change in future releases. To use the non-hybrid code in a hybrid mode, use the threaded MPI and Intel MKL, link with a thread-safe MPI, and call function MPI_init_thread() so as to indicate a need for MPI to be thread-safe.
Contents
The Intel Optimized MP LINPACK Benchmark for Clusters includes the HPL 1.0a distribution in its entirety as well as the modifications, delivered in the files listed in Table 11-2 and located in the ./benchmarks/mp_linpack/ subdirectory in the Intel MKL directory (see Table 3-1):
Table 11-2
./benchmarks/mp_linpack/
testing/ptest/HPL_pdtest.c
src/blas/HPL_dgemm.c src/grid/HPL_grid_init.c src/pgesv/HPL_pdgesvK2.c include/hpl_misc.h and hpl_pgesv.h src/pgesv/HPL_pdgesv0.c testing/ptest/HPL.dat Make.ia32 Make.em64t Make.ipf HPL.dat
11-5
11
Table 11-2
./benchmarks/mp_linpack/ Next three files are prebuilt executables, readily available for simple performance testing. (New) Prebuilt binary for IA-32 architecture, Linux, and Intel MPI 3.0. (New) Prebuilt binary for Intel 64 architecture, Linux, and Intel MPI 3.0. (New) Prebuilt binary for IA-64 architecture, Linux, and Intel MPI 3.0. (New) Prebuilt hybrid binary for IA-32 architecture, Linux, and Intel MPI 3.0. (New) Prebuilt hybrid binary for Intel 64 architecture, Linux, and Intel MPI 3.0. (New) Prebuilt hybrid binary for IA-64 architecture, Linux, and Intel MPI 3.0.
bin_intel/ia32/xhpl_hybrid_ ia32 bin_intel/em64t/xhpl_ hybrid_em64t bin_intel/ipf/xhpl_ hybrid_ipf lib_hybrid/32/libhpl_hybrid.a lib_hybrid/em64t/libhpl_ hybrid.a lib_hybrid/64/libhpl_hybrid.a nodeperf.c
(New) Prebuilt library with the hybrid version of MP LINPACK for IA-32 architecture. (New) Prebuilt library with the hybrid version of MP LINPACK for Intel 64 architecture. (New) Prebuilt library with the hybrid version of MP LINPACK for IA-64 architecture. (New) Sample utility that tests the DGEMM speed across the cluster.
Building MP LINPACK
There are a few included sample architecture makes. It is recommended that you edit them to fit your specific configuration. In particular: Set TOPdir to the directory MP LINPACK is being built in. You may set MPI variables, that is, MPdir, MPinc, and MPlib. Specify the location of Intel MKL and of files to be used (LAdir, LAinc, LAlib). Adjust compiler and compiler/linker options. Specify the version of MP LINPACK you are going to build (hybrid or non-hybrid) by setting the version parameter for the make, for example,
11-6
11
For some sample cases, like Linux systems based on Intel 64 architecture, the makes contain values that seem to be common. However, you are required to be familiar with building HPL and picking appropriate values for these variables.
New Features
The toolset is basically identical with the HPL 1.0a distribution. There are a few changes which are optionally compiled in and are disabled until you specifically request them. These new features are:
ASYOUGO: Provides non-intrusive performance information while runs proceed. There are
only a few outputs and this information does not impact performance. This is especially useful because many runs can go hours without any information.
ASYOUGO2: Provides slightly intrusive additional performance information because it intercepts every DGEMM. ASYOUGO2_DISPLAY: Displays the performance of all the significant DGEMMs inside the run. ENDEARLY: Displays a few performance hints and then terminates the run early. FASTSWAP: Inserts the LAPACK-optimized DLASWP into HPL's code. This may yield a benefit
for Itanium 2 processor. You can experiment with this to determine best results.
HYBRID: Establishes the Hybrid OpenMP/MPI mode of MP LINPACK, providing the possibility to use threaded Intel MKL and prebuilt MP LINPACK hybrid libraries.
WARNING. Use this option only with an Intel compiler and the Intel MPI library version 3.1 or higher. You are also recommended to use the compiler version 10.0 or higher.
Benchmarking a Cluster
To benchmark a cluster, follow the sequence of steps (maybe, optional) below. Pay special attention to the iterative steps 3 and 4. They make up a loop that searches for HPL parameters (specified in HPL.dat) which the top performance of you cluster is reached with.
11-7
11
1. 2.
Get HPL installed and functional on all the nodes. You may run nodeperf.c (included in the distribution) to see the performance of DGEMM on all the nodes. Compile nodeperf.c in with your MPI and Intel MKL. For example,
mpicc -O3 nodeperf.c <mkl_directory>/lib/em64t/libmkl_em64t.a <mkl_directory>/lib/em64t/libguide.a -lpthread -o nodeperf . Launching nodeperf.c on all the nodes is especially helpful in a very large cluster.
Indeed, there may be a stray job on a certain node, for example, 738, which is running 5% slower than the rest. MP LINPACK will then run as slow as the slowest node. In this case, nodeperf enables quick identifying of the potential problem spot without lots of small MP LINPACK runs around the cluster in search of the bad node. It is common that after a bunch of HPL runs, there may be zombie processes and nodeperf facilitates finding the slow nodes. It goes through all the nodes, one at a time, and reports the performance of DGEMM followed by some host identifier. Therefore, the higher the penultimate number then, the faster that node was performing. 3. Edit HPL.dat to fit your cluster needs. Read through the HPL documentation for ideas on this. However, you should try on at least 4 nodes. 4. Make an HPL run, using compile options such as ASYOUGO or ASYOUGO2 or ENDEARLY to aid in your search (These options enable you to gain insight into the performance sooner than HPL would normally give this insight.) When doing so, follow these recommendations: Use the MP LINPACK patched version of HPL to save time in the searching. Using a patched version of HPL should not hinder your performance. Thats why features that could be performance intrusive are compile-optional (and it is called out below) in MP LINPACK. That is, if you don't use any of the new options explained in section Options to reduce search time, then these changes are disabled. The primary purpose of the additions is to assist you in finding solutions. HPL requires long time to search for many different parameters. In the MP LINPACK, the goal is to get the best possible number. Given that the input is not fixed, there is a large parameter space you must search over. In fact, an exhaustive search of all possible inputs is improbably large even for a powerful cluster. This patched version of HPL optionally prints information on performance as it proceeds, or even terminates early depending on your desires. Save time by compiling with -DENDEARLY -DASYOUGO2 (described in the Options to reduce search time section) and using a negative threshold (Do not to use a negative threshold on the final run that you intend to submit if you are doing a Top500 entry!) You can set the threshold in line 13 of the HPL 1.0a input file HPL.dat.
11-8
11
5.
If you are going to run a problem to completion, do it with -DASYOUGO (see Options to reduce search time section).
Using the quick performance feedback, return to step 3 and iterate until you are sure that the performance is as good as possible.
If you want the old HPL back, simply don't define these options and recompile from scratch (try "make arch=<arch> clean_arch_all").
-DASYOUGO: Gives performance data as the run proceeds. The performance always starts off higher and then drops because this actually happens in LU decomposition. The ASYOUGO performance estimate is usually an overestimate (because LU slows down as it goes), but it gets more accurate as the problem proceeds. The greater the lookahead step, the less accurate the first number may be. ASYOUGO tries to estimate where one is in the LU decomposition that MP LINPACK performs and this is always an overestimate as compared to ASYOUGO2, which measures actually achieved DGEMM performance. Note that the ASYOUGO output is a subset of the information that ASYOUGO2 provides. So, refer to the description of the -DASYOUGO2 option below for the details of the output. -DENDEARLY: Terminates the problem after a few steps, so that you can set up 10 or 20 HPL runs without monitoring them, see how they all do, and then only run the fastest ones to completion. -DENDEARLY assumes -DASYOUGO. You do not need to define both, although it doesn't hurt. Because the problem terminates early, it is recommended setting the "threshold" parameter in HPL.dat to a negative number when testing ENDEARLY. There is no point in doing a residual check if the problem ended early. It also sometimes gives a better picture to compile with -DASYOUGO2 when using -DENDEARLY.
You need to know the specifics of -DENDEARLY:
11-9
11
-DENDEARLY stops the problem after a few iterations of DGEMM on the blocksize
(the bigger the blocksize, the further it gets). It prints only 5 or 6 "updates", whereas -DASYOUGO prints about 46 or so outputs before the problem completes.
Performance for -DASYOUGO and -DENDEARLY always starts off at one speed, slowly increases, and then slows down toward the end (because that is what LU does). -DENDEARLY is likely to terminate before it starts to slow down.
-DENDEARLY terminates the problem early with an HPL Error exit. It means that you need to ignore the missing residual results, which are wrong, as the problem never completed. However, you can get an idea what the initial performance was, and if it looks good, then run the problem to completion without -DENDEARLY. To avoid the error check, you can set HPL's threshold parameter in HPL.dat to a negative number.
Though -DENDEARLY terminates early, HPL treats the problem as completed and computes Gflop rating as though the problem ran to completion. Ignore this erroneously high rating. The bigger the problem, the more accurately the last update that -DENDEARLY returns will be close to what happens when the problem runs to completion. -DENDEARLY is a poor approximation for small problems. It is for this reason that you are suggested to use ENDEARLY in conjunction with ASYOUGO2, because ASYOUGO2 reports actual DGEMM performance, which can be a closer approximation to problems just starting.
The best known compile options for Itanium 2 processor are with the Intel compiler and look like this:
-DASYOUGO2: Gives detailed single-node DGEMM performance information. It captures all DGEMM calls (if you use Fortran BLAS) and records their data. Because of this, the routine has a marginal intrusive overhead. Unlike -DASYOUGO, which is quite non-intrusive, -DASYOUGO2 is interrupting every DGEMM call to monitor its performance. You should
beware of this overhead, although for big problems, it is, for sure, less than 1/10th of a percent. Here is a sample ASYOUGO2 output (the first 3 non-intrusive numbers can be found in ASYOUGO and ENDEARLY), so it suffices to describe these numbers here:
0.04,0.045,0.05,0.055,0.06,0.065,0.07,0.075,0.080,0.085,0.09,0.095,. 10,...,.195,.295,.395,...,.895. However, this problem size is so small and the block size so big by comparison that as soon as it printed the value for 0.045, it was
already through 0.08 fraction of the columns. On a really big problem, the fractional
11-10
11
number will be more accurate. It never prints more than the 46 numbers above. So, smaller problems will have fewer than 46 updates, and the biggest problems will have precisely 46 updates. The Mflops is an estimate based on 1280 columns of LU being completed. However, with lookahead steps, sometimes that work is not actually completed when the output is made. Nevertheless, this is a good estimate for comparing identical runs. The 3 numbers in parenthesis are intrusive ASYOUGO2 addins. The DT is the total time processor 0 has spent in DGEMM. The DF is the number of billion operations that have been performed in DGEMM by one processor. Hence, the performance of processor 0 (in Gflops) in DGEMM is always DF/DT. Using the number of DGEMM flops as a basis instead of the number of LU flops, you get a lower bound on performance of our run by looking at DMF, which can be compared to Mflops above (It uses the global LU time, but the DGEMM flops are computed under the assumption that the problem is evenly distributed amongst the nodes, as only HPLs node (0,0) returns any output.) Note that when using the above performance monitoring tools to compare different HPL.dat inputs, you should beware that the pattern of performance drop off that LU experiences is sensitive to some of the inputs. For instance, when you try very small problems, the performance drop off from the initial values to end values is very rapid. The larger the problem, the less the drop off, and it is probably safe to use the first few performance values to estimate the difference between a problem size 700000 and 701000, for instance. Another factor that influences the performance drop off is the grid dimensions (P and Q). For big problems, the performance tends to fall off less from the first few steps when P and Q are roughly equal in value. You can make use of a large number of parameters, such as broadcast types, and change them so that the final performance is determined very closely by the first few steps. Using these tools will greatly assist the amount of data you can test.
11-11
A
C/C++ interface via CBLAS + via CBLAS + * *
Table A-1 shows language interfaces that Intel Math Kernel Library (Intel MKL) provides for each function domain, and Table A-2 lists the respective header files. However, Intel MKL routines can be called from other languages using mixed-language programming. For example, see section Mixed-language programming with Intel MKL in chapter 7 on how to call Fortran routines from C/C++. Table A-1 Intel MKL language interfaces support
FORTRAN 77 interface + + + + + + + + + + Fortran 90/95 interface +
Function Domain Basic Linear Algebra Subprograms (BLAS) BLAS-like extension transposition routines Sparse BLAS Level 1 Sparse BLAS Level 2 and 3 LAPACK routines for solving systems of linear equations LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations Auxiliary and utility LAPACK routines Parallel Basic Linear Algebra Subprograms (PBLAS) ScaLAPACK routines DSS/PARDISO* solvers Other Direct and Iterative Sparse Solver routines Vector Mathematical Library (VML) functions Vector Statistical Library (VSL) functions Fourier Transform functions (FFT) Cluster FFT functions
+ + + + + + + + + + + + +
* + + + + + +
A-1
Table A-1
Function Domain Trigonometric Transform routines Fast Poisson, Laplace, and Helmholtz Solver (Poisson Library) routines Optimization (Trust-Region) Solver routines GMP* arithmetic functions Service routines (including memory allocation)
* Supported using a mixed language programming call. See Table A-2 for the respective header file.
Table A-2 lists available header files for all Intel MKL function domains. Table A-2 Intel MKL include files
Include files Fortran All function domains BLAS Routines BLAS-like Extension Transposition Routines CBLAS Interface to BLAS Sparse BLAS Routines LAPACK Routines ScaLAPACK Routines All Sparse Solver Routines C or C++
Function domain
mkl_spblas.fi mkl_lapack.f90 mkl_lapack.fi mkl_solver.f90 mkl_pardiso.f77 mkl_pardiso.f90 mkl_dss.f77 mkl_dss.f90 mkl_rci.fi mkl_rci.fi mkl_vml.f77 mkl_vml.fi
A-2
Table A-2
Function domain Vector Statistical Functions Fourier Transform Functions Cluster Fourier Transform Functions Partial Differential Equations Support Routines
mkl_trig_transforms.f90 mkl_poisson.f90
GMP interface Service routines Memory allocation routines MKL examples interface
A-3
This appendix describes in brief certain interfaces that Intel Math Kernel Library (Intel MKL) supports.
Intel MKL implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library. For specifications of these functions, please see http://www.intel.com/software/products/mkl/docs/gnump/WebHelp/. If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h.
B-1
Index
A
Absoft compiler, linking with, 5-11 affinity mask, 6-16 aligning data, 8-2 audience, 1-2 Compatibility OpenMP run-time compiler library, 5-11 compiler support, 2-2 compiler support RTL layer, 3-4 compiler, Absoft, linking with, 5-11 compiler-dependent function, 7-3 computational layer, 3-4 configuration file, for OOC DSS/PARDISO, 4-4 configuring development environment, 4-1 Eclipse CDT, 4-2 content assist, see code assist context-sensitive Help, for Intel MKL in Eclipse CDT, 10-4 custom shared object, 5-15 building, 5-15 specifying list of functions, 5-16 specifying makefile parameters, 5-16
B
benchmark, 11-1 BLAS calling routines from C, 7-5 Fortran-95 interfaces to, 7-2
C
C, calling LAPACK, BLAS, CBLAS from, 7-4 calling BLAS functions in C, 7-7 complex BLAS Level 1 function from C, 7-8 complex BLAS Level 1 function from C++, 7-9 Fortran-style routines from C, 7-4 CBLAS, 7-6 CBLAS, code example, 7-10 Cluster FFT, linking with, 9-1 cluster software, 9-1 linking examples, 9-3 linking syntax, 9-1 code assist, with Intel MKL, in Eclipse CDT, 10-6 coding data alignment, 8-1 mixed-language calls, 7-7 techniques to improve performance, 6-13
D
data alignment, 8-2 denormal, performance, 6-17 development environment, configuring, 4-1 directory structure documentation, 3-20 high-level, 3-1 in-detail, 3-10 documentation, 3-20 for Intel MKL, viewing in Eclipse IDE, 10-1 dummy library, 3-20 dynamic linking, 5-1
Index-1
E
Eclipse CDT code/content assist, with Intel MKL, 10-6 configuring, 4-2 searching the Intel Web site, 10-3 Eclipse CDT, Intel MKL Help, 10-1, 0-1 context-sensitive, 10-4 environment variables, setting, 4-1 examples code, 2-2 linking, general, 5-8 ScaLAPACK, Cluster FFT, linking with, 9-3
L
language interfaces support, A-1 Fortran-95 interfaces, 7-2 language-specific interfaces, 7-1 LAPACK calling routines from C, 7-4 Fortran-95 interfaces to, 7-2 packed routines performance, 6-14 layer compiler support RTL, 3-4 computational, 3-4 interface, 3-3 RTL, 3-3 threading, 3-4 layered model, 3-2 Legacy OpenMP run-time compiler library, 5-11 library run-time, Compatibility OpenMP, 5-11 run-time, Legacy OpenMP, 5-11 library structure, 3-1 link command examples, 5-8 syntax, 5-3 link libraries computational, 5-13 for Intel 64 architecture, 5-13 interface, for the Absoft compilers, 5-11 threading, 5-12 linkage models, comparison, 5-2 linking, 5-1 dynamic, 5-1 layered model, 5-4 legacy model, 5-4 recommendations, 5-2 static, 5-1 with Cluster FFT, 9-1 with ScaLAPACK, 9-1 LINPACK benchmark, 11-1
F
FFT functions, data alignment, 6-14 FFT interface MKL_LONG type, 3-6 optimized radices, 6-17 threading tip, 6-13 FFTW interface support, B-1 Fortran-95, interfaces to LAPACK and BLAS, 7-2
G
GMP arithmetic functions, B-1 GNU Multiple Precision Arithmetic Library, B-1
H
Help, for Intel MKL in Eclipse CDT, 10-1 HT Technology, see Hyper-Threading technology hybrid, version, of MP LINPACK, 11-4 Hyper-Threading Technology, configuration tip, 6-15
I
ILP64 programming, support for, 3-5 installation, checking, 2-1 interface layer, 3-3
J
Java examples, 7-12
M
memory functions, redefining, 6-18
Index-2
memory management, 6-17 memory renaming, 6-18 mixed-language programming, 7-4 module, Fortran-95, 7-4 MP LINPACK benchmark, 11-4 hybrid version, 11-4 multi-core performance, 6-15
S
ScaLAPACK, linking with, 9-1 sequential version of the library, 3-4 stability, numerical, 8-1 static linking, 5-1 support, technical, 1-1 syntax linking, cluster software, 9-1 linking, general, 5-3
N
notational conventions, 1-3 number of threads changing at run time, 6-5 Intel MKL choice, particular cases, 6-10 setting for cluster, 9-2 setting with OpenMP environment variable, 6-4 techniques to set, 6-3 numerical stability, 8-1
T
technical support, 1-1 thread safety, of Intel MKL, 6-2 threading avoiding conflicts, 6-4 environment variables and functions, 6-8 Intel MKL behavior, particular cases, 6-10 Intel MKL controls, 6-8 see also number of threads threading layer, 3-4
O
OpenMP Compatibility run-time compiler library, 5-11 Legacy run-time compiler library, 5-11 OpenMP, run-time compiler library, 5-11
U
uBLAS, matrix-matrix multiplication, substitution with Intel MKL functions, 7-10 unstable output, numerically, getting rid of, 8-1 usage information, 1-1
P
parallel performance, 6-4 parallelism, 6-1 PARDISO OOC, configuration file, 4-4 performance, 6-1 coding techniques to gain, 6-13 hardware tips to gain, 6-15 multi-core, 6-15 of LAPACK packed routines, 6-14 with denormals, 6-17
R
RTL, 7-3 RTL layer, 3-3 run-time library, 7-3 Compatibility OpenMP, 5-11 Legacy OpenMP, 5-11
Index-3