3. Programmer’s Guide#

3.1. Library Source Code Organization#

The rocWMMA code is split into four major parts:

  • The library directory contains all source code for the library.

  • The samples directory contains real-world use-cases of the rocWMMA API.

  • The test directory contains all validation, performance and unit tests of rocWMMA API.

  • Infrastructure

3.1.1. The library directory#

3.1.1.1. library/include/rocwmma/#

Contains C++ include files for the rocWMMA API. These files also contain Doxygen comments that document the API.

3.1.1.2. library/include/internal#

Internal include files for:

  • Type support

  • Input / output configuration, shapes and traits

  • Layout

  • Mapping Utility

  • Cross-lane operation utility

  • Vector blend utility

  • Packing and unpacking

  • Conversion and broadcasting

  • Load and store

  • Matrix multiply-accumulate

  • Cooperative load and store

  • Threadblock synchronization

  • Utility code

3.1.2. The samples directory#

3.1.2.1. samples/hipRTC_gemm.cpp#

sample code for calling Simple GEMM algorithm demonstration without LDS memory usage and no transpose, from within the hipRTC environment.

3.1.2.2. samples/simple_sgemv.cpp#

sample code for calling Simple matrix multiply-accumulate with a vector demonstration, without LDS and no transpose for single-precision floating point types.

3.1.2.3. samples/simple_dgemv.cpp#

sample code for calling Simple matrix multiply-accumulate with a vector demonstration, without LDS and no transpose for double-precision floating point types.

3.1.2.4. samples/simple_sgemm.cpp#

Sample code for calling Simple GEMM algorithm demonstration without LDS memory usage and no transpose for single-precision floating point types.

3.1.2.5. samples/simple_dgemm.cpp#

Sample code for calling Simple GEMM algorithm demonstration without LDS memory usage and no transpose for double-precision floating point types.

3.1.2.6. samples/simple_hgemm.cpp#

Sample code for calling Simple GEMM algorithm demonstration without LDS memory usage and no transpose for half-precision floating point types.

3.1.2.7. samples/perf_sgemm.cpp#

Sample code for calling the best performant multi-block GEMM algorithm demonstration with LDS memory, Macro Tile Collaboration, Data Re-use and Optimized pipeline for single-precision floating point types.

3.1.2.8. samples/perf_dgemm.cpp#

Sample code for calling the best performant multi-block GEMM algorithm demonstration with LDS memory, Macro Tile Collaboration, Data Re-use and Optimized pipeline for double-precision floating point types.

3.1.2.9. samples/perf_hgemm.cpp#

Sample code for calling the best performant multi-block GEMM algorithm demonstration with LDS memory, Macro Tile Collaboration, Data Re-use and Optimized pipeline for half-precision floating point types.

3.1.2.10. samples/simple_dlrm.cpp#

Sample code for calling Simple Deep Learning Recommendation Model (DLRM) for machine learning.

3.1.2.11. samples/common.hpp#

Common code used by all the above rocWMMA samples files.

3.1.3. The test directory#

3.1.3.1. test/bin#

Script to generate benchmark plots from the gtest output dumps of benchmark tests of rocWMMA.

3.1.3.2. test/dlrm#

Test code for various strategies of DLRM application. This test is used to validate dlrm functions using rocWMMA API.

3.1.3.3. test/gemm#

Test Code for various strategies of GEMM application. This test is used to validate and benchmark GEMM functions using rocWMMA API.

3.1.3.4. test/unit#

Test code for testing the basic functional units of rocWMMA library.

3.1.4. Infrastructure#

  • CMake is used to build and package rocWMMA. There are CMakeLists.txt files throughout the code.

  • Doxygen/Breathe/Sphinx/ReadTheDocs are used to produce documentation. Content for the documentation is from:

    • Doxygen comments in include files in the directory library/include

    • files in the directory docs/source.

  • Jenkins is used to automate Continuous Integration testing.

  • clang-format is used to format C++ code.