Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
-
Updated
Aug 5, 2025 - Cuda
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
A lightweight Bun + Express template that connects to the Testune AI API and streams chat responses in real time using Server-Sent Events (SSE)
Add a description, image, and links to the llm-infra topic page so that developers can more easily learn about it.
To associate your repository with the llm-infra topic, visit your repo's landing page and select "manage topics."