New comment by ow5 in "Lossless LLM compression for efficient GPU inference via dynamic-length float"

ow5 — Fri, 25 Apr 2025 20:39:10 +0000

Hi! one of the contributors to the paper — we have kernels not released yet that can shave down decoding latency by >20%.

Also when we ran experiments for streaming with the current kernels, we were median ~1.3x slower at inference

Hacker News: ow5