Login
From:
Colfax Research
(Uncensored)
subscribe
FlashAttention-3 for Inference: INT8 Quantization and Query Head Packing for MQA/GQA (External)
https://research.colfax-intl.com/flashattention-3-for-inference-int8-quantization-and-query-head-packing-for-mqa-gqa-external/
links
backlinks
Tagged with:
benchmarks
deep learning
publications
In this blog post presented on the Character.AI research blog, we explain two techniques that are important for using FlashAttention-3 for inference: in-kernel pre-processing of tensors via warp specialization and query head packing for MQA/GQA.
Roast topics
Find topics
Find it!