Intel® oneAPI Math Kernel Library (oneMKL) Release Notes

ID 765830
Updated 3/23/2024
Version 2024.1
Public

author-image

By

Where to Find the Release

Intel® oneAPI Math Kernel Library

2024.1

System Requirements  Bug Fix Log

New Features and Optimizations

  • Intel® Optimized High Performance Conjugate Gradient Benchmark

    • Features 
      • Introduced the HPCG benchmark for Intel® GPUs, optimized for clusters of nodes each with one or more Intel® Data Center GPU Max Series GPUs attached. 
  • BLAS

    • Features 
      • Introduced Conditional Numerical Reproducibility support for level-3 routines on Intel® Data Center GPU Max Series.  
      • Introduced 32-bit SYCL* APIs for all BLAS group batch routines with integer SYCL* USM pointer or SYCL* buffer inputs.
    • Optimizations
      • Improved performance for numerous level-2 APIs on Intel® Data Center GPU Max Series. 
      • Improved performance for complex double precision and TF32 level 3 routines on Intel® Data Center GPU Max Series.
  • Sparse BLAS

    • Features
      • Introduced sparse::trsm and sparse::optimize_trsm SYCL* APIs with support for CSR format sparse triangular solves with multiple dense right-hand sides in row-major or column-major layout.
      • Introduced new sparse::trsv SYCL* API with support for fused alpha scaling of the right-hand side in the sparse triangular solve.
      • Introduced sparse::set_coo_matrix SYCL* API which allows to input a sparse coordinate (COO) matrix format data into the sparse::matrix_handle_t object on CPU and GPU devices.
      • Extended support of SYCL* APIs using a sparse::matrix_handle_t object with COO format data:
        • CPU: sparse::omatcopy, sparse::gemv, sparse::trmv, sparse::gemvdot, sparse::trsv, sparse::trsm and sparse::gemm APIs
        • GPU: sparse::omatcopy, sparse::gemv APIs
        • Introduced a new C example demonstrating sparse format conversions using Inspector Executor Sparse BLAS APIs. The example is located at $MKLROOT/share/doc/mkl/examples.
    • Optimizations
      • Improved support and performance for sparse::gemv using complex data with the CSR format with all non-transpose/transpose/conjugate-transpose operations. 
  • LAPACK 

    • Features
      • Introduced new routines and integrated bug fixes from Netlib LAPACK 3.11.0. New functionality includes level-3 BLAS solvers for triangular systems (?latrs3) and triangular Sylvester equations (?trsyl3) and a new algorithm for solving least square problems (?gelst). oneMKL LAPACK functionality is now aligned with Netlib LAPACK 3.11.0.
      • Introduced SYCL* USM APIs to compute batched group least squares solutions for general matrices (lapack::gels_batch).
      • Introduced SYCL* and C/Fortran APIs to compute approximate singular value decompositions of a batch of matrices (lapack::gesvda_batch), and enabled C/Fortran OpenMP* offload support.
    • Optimizations
      • Improved performance for batched group LU inverse (lapack::getri_batch) on Intel® GPUs for SYCL* APIs, especially for a smaller number of larger matrices.
      • Improved performance for double precision divide-and-conquer eigensolver (lapack::syevd) and generalized eigensolver (lapack::sygvd) on Intel® Data Center GPU Max Series as well as for C and Fortran OpenMP* offload (dsyevd, dsygvd).
      • Improved performance of QR factorization (lapack::geqrf) on Intel® Data Center GPU Max Series for SYCL* USM APIs as well as for C and Fortran OpenMP* offload (?geqrf).
  • DFT 

    • Features
      • Introduced new configuration parameters FWD_STRIDES and BWD_STRIDES for the DFT SYCL* API.
    • Optimizations
      • Improved FFT performance on Intel® Data Center GPU Max Series for 1D complex FFT of large power of two size and batched 2D complex FFT of medium to large power of two size.
      • Improved FFT performance on Intel® Data Center GPU Max Series for 3D real FFT of medium power of two size or small odd size.
  • Vector Math

    • Features
      • Introduced oneMKL VM Support for a subset of Bessel functions for C API and SYCL* API: I0(), I1(), Y0(), Y1(), Yn(), Y0(), Y1(), Yn().
    • Optimizations
      • Improved performance for logb() and nextafter() on Intel® GPUs.
  • Vector Statistics

    • Features
      • Introduced VERBOSE mode support for RNG C/Fortran API.
      • Introduced sub-stream based parallelization mode for SYCL API of mrg32k3a engine.
  • Library Engineering

    • Advance Notice
      • Starting oneMKL 2025.0, a user-supplied "mkl_progress" function will not redefine the default "mkl_progress" function automatically and the "mkl_set_progress" function must be used to specify any overrides.

Known Issues and Limitations

  • OpenMP* offload of Fortran group batch routines to Intel® GPU on Windows* may produce incorrect results with the OpenMP* 5.1 “dispatch” construct. Use the “target variant dispatch” construct instead.
  • Certain sizes/configurations of int8 GEMMs may return incorrect results on Intel® Data Center GPU Max series when B is transposed (column-major) or A is transposed (row-major).
  • oneMKL DFT SYCL* APIs using SYCL* buffer for data input do not support SYCL* sub-buffer inputs for a range of large power of two sizes [2²¹,2²⁶] 1D complex FFT.
  • Double precision FFT of size that are multiple of very large primes may see incorrect results on CPU.
  • oneMKL FFT with a large prime factor (larger than 1024) may fail on Intel® Data Center GPU Max Series
  • On Intel® Iris® Xe MAX Graphics, {c,s}getrfnp_batch functions may hang or have a segmentation fault. As a workaround, use the {c,s}getrfnp_batch_strided functions instead.
  • Some Sparse BLAS SYCL* examples (sparse_gemm_col_major/sparse_gemm_row_major) are known to fail with oneMKL on Windows* when run in Debug mode.  Please use Release mode for this functionality on Windows.
  • Using a lower triangular matrix for sparse matrix-vector multiplication with 1-based indexing and OpenMP* Offload in C with mkl_sparse_optimize and mkl_sparse_?_mv can sporadically provide incorrect output with the Level Zero backend and OpenMP* 5.1 version on Intel® Data Center GPU Max Series. As a workaround, use OpenCL* backend or Level Zero & OpenMP* version <= 5.0.
  • The deprecated sparse::release_matrix_handle API without a sycl::queue input may fail to wait for previously enqueued commands to be completed on an in-order queue if the sycl::event corresponding to the queue's last command is not provided as a dependency to the release API, or a queue synchronization point is not commanded before the release API call.
  • C and Fortran offload examples may exhibit a certain behavior resulting in the crash after completing computations. It is known to affect subset of Intel® GPUs including Intel® Data Center GPU Flex series, but not including Intel® Data Center GPU Max series. To work around this issue, it is recommended to switch Offload plugin to OpenCL* using ONEAPI_DEVICE_SELECTOR=opencl:gpu setting. This behavior does not affect accuracy or performance of oneMKL functions and will be fixed in 2024.2.

Deprecation/Removal

  • The INPUT_STRIDES and OUTPUT_STRIDES configuration parameters are deprecated for the oneMKL SYCL DFT APIs and will be removed in the oneMKL 2026.0 release. Please use the FWD_STRIDES and BWD_STRIDES configuration parameters instead.
  • Random number generation save_state/load_state API with std::string as a second parameter have been deprecated and will be removed in the oneMKL 2026.0 release. Please use save_state/load_state API with const std::uint8_t* as a second parameter instead.
  • The sparse triangular solve sparse::trsv SYCL API without an “alpha” scaling parameter has been deprecated.  Please use the new sparse triangular solve with “alpha” as 1 or other value if desired.

Notes

For the 2024.1 release, the Third Party Programs file has been included as a section in this product’s release notes rather than as a separate text file.

Third Party Programs File

 

2024.0

System Requirements  Bug Fix Log

What’s new? 

  • Integrates Vector Math optimizations into Random Number Generators for high performance computer simulations, statistical sampling, and other areas on x86 CPUs and Intel GPUs. 
  • Supports Vector Math for FP16 datatype on Intel® GPUs 
  • Delivers high-performance benchmarks HPL and HPL-AI optimized for Intel® Xeon® CPU Max Series and Intel® Data Center GPU Max Series 

Directory Layout

Directory layout is improved across all products to streamline installation and setup. 

The Unified Directory Layout is implemented in 2024.0. If you have multiple toolkit versions installed, the Unified layout ensures that your development environment contains the correct component versions for each installed version of the toolkit. 

The directory layout used before 2024.0, the Component Directory Layout, is still supported on new and existing installations. 

For detailed information about the Unified layout, including how to initialize the environment and advantages with the Unified layout, refer to Use the setvars and oneapi-vars Scripts with Linux and Use the setvars and oneapi-vars Scripts with Windows

New Features and Optimizations

  • BLAS

    • Features 
      • Scalar parameters (alpha, beta) to BLAS USM APIs may now be passed by pointer or by value.  
      • Added complex_3m acceleration for GEMM (including batched variants) on Intel® Data Center GPU Max Series. 
      • Added strided versions of gemm3m_batch C and Fortran APIs, including OpenMP* offload support. 
      • Added {cblas_}gemm_f16f16f32 C APIs. These are the half-precision (MKL_F16) analogues of the previously introduced gemm_bf16bf16f32 APIs for bfloat16 (MKL_BF16).
    • Optimizations
      • Enhanced HGEMM performance for small matrices on CPUs. 
      • Improved general performance of GEMV and several BLAS level-1 routines on Intel® Data Center GPU Max Series. 
  • Sparse BLAS

    • Features
      • Inspector Executor Sparse BLAS C APIs now include mkl_sparse_<xyz>_64() APIs using MKL_INT64 for all integers in lp64 and ilp64 modes. 
      • Added std::complex<float> and std::complex<double> support for all existing sparse BLAS SYCL* APIs.
      • Added support for oneapi::mkl::transpose::conjtrans operation to sparse::gemv and sparse::omatcopy SYCL* APIs. 
      • Added support for oneapi::mkl::transpose::{trans, conjtrans} operation on the sparse matrix in sparse::gemm SYCL* API. 
    • Optimizations
      • Improved performance for sparse::gemv/trmv with matrices with high variability in the number of non-zeros per row.
      • Improved sparse::matmat performance for key workloads. 
  • LAPACK 

    • Features
      • Introduced SYCL* APIs to compute LU factorization without pivotization (lapack::getrfnp); added support for OpenMP* offloading in C and Fortran (mkl_?getrfnp). 
      • Introduced SYCL* APIs to compute batched matrix inverse of a group of general matrices (lapack::geinv_batch). 
      • Added argument checking for lapack::gerqf, lapack::hetrf, lapack::orgbr, lapack::orgtr, lapack::ormrq, lapack::ormtr, lapack::sytrf, lapack::ungbr, lapack::ungtr, lapack::unmrq, lapack::unmtr, and their scratchpad size functions. 
    • Optimizations
      • Improved performance of QR factorization (lapack:: geqrf) on Intel® Data Center GPU Max Series for SYCL* USM APIs as well as for C and Fortran OpenMP* offloading. 
      • Improved performance of orthogonal/unitary matrix multiplication (lapack::ormqr/ lapack::unmqr) on Intel® GPUs for SYCL* APIs and C and Fortran OpenMP* offloading. 
      • Improved performance of batched strided LU inverse (lapack::getri_batch) on Intel® GPUs for SYCL* APIs, especially for a smaller number of larger matrices. 
  • DFT 

    • Features 
      • Enabled FFTs larger than 4 GiB (up to 64GiB of data) on Intel® Data Center GPU Max Series. 
    • Optimizations 
      • Improved double precision FFT performance on Intel® Data Center GPU Max Series for any FFT with at least one dimension divisible by a prime number in the range [11,61]. 
      • Improved 1D complex FFT performance on Intel® Data Center GPU Max Series for power of two sizes in the range [2²¹, 2²⁵].  
  • Vector Math

    • Features:
      • Added support for OpenMP* 5.1 offloading in C. 
      • Added SYCL*–OpenMP* interoperability support for OpenMP* offloading. 
      • Status and Mode were aligned in the Classic and Offloading versions of VM. 
      • J0/J1 Bessel functions of 1st kind orders 0 and 1 for real arguments added for GPUs. 
      • Y0/Y1 Bessel functions of 2nd kind orders 0 and 1 for real arguments added for GPUs. 
      • I0/I1 Bessel functions of 1st kind orders 0 and 1 for real arguments added for GPUs. 
    • Optimizations: 
      • HA versions of cexp, cln, csqrt were added in native precision for GPUs. 
      • Native FP16 cos/exp/exp10/ln/log10/log2/sin were added for GPUs. 
      • The FP16 host API performance on GPU was improved by up to 30%. 
  • Vector Statistics 

    • Features
      • Enabled Verbose mode support for RNG SYCL* Host API. 
      • Optimizations 
      • Optimized mrg32k3a and philox4x32x10 RNG SYCL* Device API performance on Intel® Data Center GPU Max Series. 
  • Sparse Solvers 

    • Features 
      • Improved accuracy of generalized eigenvalues calculated using mkl_sparse_?_gv for symmetric matrix types. 

 

Library Engineering

  • The following domain specific SYCL* libraries are now made available in addition to the combined mkl_sycl library: 
    • libmkl_sycl_blas.so 
    • libmkl_sycl_lapack.so (depends on libmkl_sycl_blas.so) 
    • libmkl_sycl_sparse.so (depends on libmkl_sycl_blas.so) 
    • libmkl_sycl_vm.so 
    • libmkl_sycl_rng.so 
    • libmkl_sycl_stats.so 
    • libmkl_sycl_data_fitting.so 

      MKLConfig.cmake also provides corresponding targets to link domain specific SYCL* libraries via MKL::MKL_SYCL::<domain> 
  • Dropped all SSSE3 and AVX optimizations 
  • With the removal of classic compiler support, all references to this compiler have been replaced with icx. 
  • MKLConfig.cmake now rejects operation when the oneMKL version found in the environment variable MKLROOT differs from the version found by CMake. 
  • Removed find_package_handle_standard_args() in MKLConfig.cmake, as it incorrectly set MKL_FOUND. 
  • MKLConfig.cmake: Removed oneMKL path from implicit include directories such that oneMKL include directory path is always explicitly defined, independent of whether it is present in the user’s CPATH environment variable or not. This resolves an issue when cmake is called from different environments. Please note, changes are for C and C++, not for Fortran, according to CMake 3.14+ doc implicit directory variable is not used for Fortran. 
  • Removed __cdecl, its related macros, and *_win.h files. 

 

Fixed issues: 

  • oneMKL DFT SYCL* APIs may fail to compute correct results for 2D and 3D real FFT when using a user-allocated SYCL* buffer workspace and the OpenCL*  runtime. 
  • Improved BLAS support for host USM pointers. 
  • Fixed SYMM/TRSM accuracy issues.  
  • Fixed SGEMM/DGEMM/SYRK failures and memory leaks. 
  • Fixed Fortran OpenMP* issues when complex-precision division is used on Windows on Intel® Iris® Xe Max and Intel® Arc™ A-Series GPUs with static linking. 

Known Issues and Limitations

  • The getri_batch_usm and getri_oop_batch_usm LAPACK examples that are located at ${MKLROOT}/examples/dpcpp/lapack may fail on Intel® Iris® Xe MAX Graphics on Windows* in debug_mode. 
  • On Intel® Iris® Xe MAX Graphics, {c,s}getrfnp_batch functions may hang or have a segmentation fault. As a workaround, use the {c,s}getrfnp_batch_strided functions instead. 
  • OpenMP* offload of Fortran LAPACK functions cpotrf, cpotri, cpotrs, ctrtri, spotrf, spotri, spotrs, strtri to GPU under Windows* in static linking mode may crash. As a workaround, use dynamic linking mode. 
  • oneMKL DFT SYCL* APIs using SYCL* buffer for data input do not support SYCL* sub-buffer inputs for a range of large power of two sizes [2²¹,2²⁶] 1D complex FFT. 
  • Double precision FFT of size that are multiple of very large primes may see incorrect results on CPU. 
  • 2D and 3D FFT might hang on Intel® Data Center GPU Max Series when GPU debugging is enabled. As a workaround, set the following environment variables NEOReadDebugKeys=1 EnableRecoverablePageFaults=0 or disable GPU debugging by writing 0 in the files /sys/class/drm/card*/prelim_enable_eu_debug 
  • Mrg32k3a random number engine may fail on Intel® Arc™ A-Series Graphics GPU in case of Windows* OS and /Od enabled option. 
  • Random number generator Device APIs with enabled Vector Math Device APIs underneath do not work on Intel ® GPUs without native double precision support due to Vector Math restrictions. 
  • Some Sparse BLAS SYCL* examples (sparse_gemm_col_major/sparse_gemm_row_major) are known to fail with oneMKL 2024.0 on Windows* when run in Debug mode.  Please use Release mode linking to use this particular functionality. 
  • Use the prebuilt oneMKL 2024.0 HPCG binaries with the oneAPI 2024.0 compiler runtime for the best performance. Compiling HPCG from sources with the current icpx compiler may result in slightly lower performance than when compiling it with compilers from earlier oneAPI releases. 
  • oneapi::mkl::sparse::trsv() sycl::buffer APIs may crash with a segmentation fault when any of the CSR matrix data, x, or y vectors, are sub-buffer(s) of a sycl::buffer 
  • Asynchronous execution of mkl_sparse_optimize() for mkl_sparse_x_mv() using OpenMP* offloading in C can sporadically hang on Intel® Data Center GPU Max Series. As a workaround, use synchronous offloading for mkl_sparse_optimize(). 
  • Strided and group batched non-pivoting LU (getrfnp_batch) for complex precisions provides incorrect values on Intel® Data Center GPU Max Series with certain drivers. 
  • oneMKL SYCL DLL could leak memory after unloading on Windows. The problem can be avoided by adding mkl_free_buffer before unloading the DLL.  
  • The Intel® oneMKL NuGet packages intelmkl.static.cluster.win-x64 and intelmkl.devel.cluster.win-x64 cannot be added to a .Net Standard 2.0 or higher project because a dependent package (intelmpi.devel.win-x64) is not compatible with the 2.0 standard. An updated intelmpi.devel.win-x64 package will be published to address the compatibility with the 2.0 standard.

Known Issues and Limitations for Intel® GPU Driver Version 20231219 

The limitations in this section do not apply to the execution of Intel® oneMKL on CPUs. 

  • The LAPACK batch strided least squares solver (oneapi::mkl::lapack::gels_batch, ?gels_batch_strided with OpenMP* offload) may return incorrect results on all Intel® GPUs. As a workaround, the previous GPU driver version 20231031 can be used. A list of supported GPUs of that version can be found in the driver 20231031 release notes. 
  • oneMKL double precision FFT may fail or crash on the integrated GPUs of Intel® Core Ultra processors for driver version 20231219. The issue will be fixed in future releases of the driver.
  • oneMKL RNG Sobol Host API and Stats routines may throw an exception in case of execution on any Intel® GPU device.  As a workaround, the previous GPU driver version 20231031 can be used. A list of supported GPUs of that version can be found in the driver 20231031 release notes. 

Deprecation/Removal 

  • Graph domain APIs have been removed in the oneMKL 2024.0 release. 
  • Intel® oneAPI Math Kernel Library (oneMKL) for macOS deprecated in release 2023.0 and will now be discontinued as of Intel ® oneMKL release version 2024.0 and later releases. 

 

Previous oneAPI Releases

2023

Release Notes, System Requirements and Bug Fix Log

2022

Release Notes, System Requirements and Bug Fix Log

2021

Release Notes, System Requirements and Bug Fix Log

2017-2020

Release Notes, System Requirements and Bug Fix Log

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.