The codes in this example are migrated from FlashMLA, it implements an efficient MLA decoding kernel for Hopper GPU.
python setup.py install
python tests/test_flash_mla.py