ELSA：不靠 Tensor Core 的 exact attention，讓 FP32 與邊緣 GPU 重新有優化空間

這篇 Threads 分享 ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers。官方 arXiv 是 2604.23798，CVPR Findings 2026，作者來自陽明交通大學 Advanced Computer Vision Laboratory。

ELSA 的定位很清楚：

不需要 Tensor Core、不需要重新訓練 Foundation Model，作為 drop-in replacement 的 exact attention kernel，在 FP32 與受限硬體上補 FlashAttention 沒照顧好的缺口。

為什麼這重要？

FlashAttention-2 / 3 很強，但它的高效路徑依賴 Ampere / Hopper 的 Tensor Core 指令，例如 HMMA / GMMA。對某些硬體或場景，這會形成限制：

Jetson TX2 這類 edge device 沒有對應 Tensor Core 能力。
AMD GPU / 舊伺服器 GPU 不能直接走 NVIDIA Tensor Core 路徑。
高解析影像、醫療影像、高光譜遙測、科學計算等場景常需要 FP32 precision。
FlashAttention 的 FP32 fallback 可能回到較未最佳化的 SIMD / math path，speedup 消失。

ELSA 的核心想法：把 online softmax attention 改寫成 prefix scan。

傳統 online softmax 需要依序維護 running maximum，下一步依賴上一步，parallel depth 是 O(n)。ELSA 把 softmax 狀態寫成 monoid triple：

(m, S, W)

並定義 merge operator，讓任意相鄰 block 的狀態可以精確合併。這樣每個 block 可以先平行計算，再用 reduction tree / prefix scan 合併，parallel depth 從 O(n) 降到 O(log n)。

官方 arXiv abstract 的三個要點：

保留 exact softmax semantics

在 real arithmetic 下保留 exact softmax semantics，FP32 relative error bound 是 O(u log n)。這比 sequential depth 造成的 O(nu) 誤差更有利。

O(n) extra memory + O(log n) parallel depth

把 online softmax update cast 成 associative monoid 上的 prefix scan，不需要 O(n²) score matrix，額外記憶體 O(n)。

Tensor-Core independent

實作包含 Triton 與 CUDA C++，可部署為 drop-in replacement，不需重新訓練或修改權重。

Threads 裡補充了實作方式：

同時載入多個 tile 到 SRAM block。
block 內用 Hillis–Steele scan。
block 間用 Blelloch sweep / reduction 整合跨 block 狀態。
不需要 Tensor Core。

官方 performance highlights：

A100 FP32 benchmarks 1K–16K tokens：比 memory-efficient SDPA 快 1.3–3.5×。
BERT FP32：1.97–2.27×。
Jetson TX2：比 Math kernel 快約 1.5–1.6×。
LLaMA-13B offloading：32K+ tokens 下 throughput 提升 17.8–20.2%。
Project page 提到 CLIP ViT-L/14 strict FP32 full image encoder latency 降 3.7%，attention module level 1.46–2.15×；full pipeline gain 被非 attention compute 攤薄。

GitHub repo：ming053l/ELSA

查詢時資訊：

描述：[CVPR 2026 Findings] ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
code / README 提供 Triton / CUDA kernels、PyTorch module、timm ViT / Swin patch、ElsaViT、ElsaSwinTransformerV2、HuggingFace LLaMA patch examples。
授權在 GitHub API 顯示 NOASSERTION / Other，因此導入前要確認 license 細節。

這件事對 AI infra 的啟發：

高效 attention 不應只服務最新資料中心 GPU

如果 attention 最佳化只在 A100 / H100 這類新卡上有效，edge AI、醫療、科學計算、舊硬體部署會被排除。ELSA 的價值在於重新打開這些硬體的優化空間。

FP32 場景仍然存在

很多人做 LLM 推論時已經習慣 FP16 / BF16 / INT4，但醫療影像、科學資料、遙測等場景常需要 FP32 精度。ELSA 針對 exact FP32 attention 的優化，反而對這些垂直場景重要。

演算法 reformulation 可能比硬體堆料更有效

ELSA 不是發明新硬體，而是把 softmax attention 的 sequential dependency 重新寫成可合併的 monoid scan。這是很漂亮的系統研究：從數學結構下手，讓 GPU 平行性重新打開。

對 BigIntTech / local AI 的判斷：

ELSA 短期不一定是我們馬上 production 導入的工具，但它值得放進 AI infra radar。尤其如果未來做：

edge device inference
medical / imaging AI
high-resolution vision transformer
long-context local model
舊 GPU / 非 Tensor Core 硬體優化

就應該把 ELSA 跟 FlashAttention、xFormers、SDPA、Lucebox 這類方案一起 benchmark。

我的結論：

ELSA 的價值不是「又快 3.5 倍」這個數字，而是它指出一條不同於 Tensor Core 路線的 attention optimization：只用每張 GPU 都有的基本指令，透過 exact prefix-scan reformulation，在 FP32 與 edge hardware 上拿回性能。

參考來源：

Threads 原文：https://www.threads.com/@ming0531___/post/DX0zySZiSVp
arXiv: ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers, 2604.23798
GitHub：ming053l/ELSA
Project page：https://ming053l.github.io/ELSA_projectpage/

原始來源：https://www.threads.com/@ming0531___/post/DX0zySZiSVp?xmt=AQF0ZzKNCcSU7BKe7sGmyvsgx7BNXuiVlwvk54jeLLsr-ndPVK04V01zt-fovr7WRmY4hFmd&slof=1