https://arxiv.org/abs/2405.04437 vAttention: Dynamic Memory Management for Serving LLMs without PagedAttentionPagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in priorarxiv.org 서론대규모 언어 모델(LLM)을 서비스할 ..