nsablu.blogg.se - Implement block multiply matrix mpi

cuSPARSE Block-SpMM: Efficient, block-wise SpMMįigure 1 shows the general matrix multiplication (GEMM) operation by using the block sparse format. Starting with cuSPARSE 11.4.0, the CUDA Toolkit provides a new high-performance block sparse matrix multiplication routine that allows exploiting NVIDIA GPU dense Tensor Cores for nonzero sub-matrices and significantly outperforms dense computations on Volta and newer architecture GPUs. In deep learning, block sparse matrix multiplication is successfully adopted to reduce the complexity of the standard self-attention mechanism, such as in Sparse Transformer models or in its extensions like Longformer. This coarse-grained sparsity allows regular access pattern and locality, making the computation amenable for GPUs. The primary alternative to fine-grained sparsity is through the organization of matrix entries/network weights in groups, such as vectors or blocks. Recently, NVIDIA introduced the cuSPARSELt library to fully exploit third-generation Sparse Tensor Core capabilities. To overcome this limitation, the NVIDIA Ampere architecture introduces the concept of fine-grained structured sparsity, which doubles throughput of dense-matrix multiplies by skipping the computation of zero values in a 2:4 pattern. In fact, many of the linear algebra applications that benefit from sparsity have over 99% sparsity in their matrices. This is due to irregular computation and scattered memory accesses. Even though sparse linear algebra allows representing huge matrices very efficiently, it typically does not provide competitive performance compared to dense counterparts in cases when sparsity is below 95%.