Global Multi-Level KV Cache¶
Background¶
In the decoding phase of large language models (LLMs), frequent access to historical KV cache due to autoregressive generation creates a bottleneck in memory bandwidth. As model sizes and context windows expand (e.g., 128K Tokens consuming over 40GB of memory), the pressure on single-device memory increases dramatically. Existing solutions (such as vLLM) exhibit significant limitations in long-context scenarios: prefill time surges, severe memory bandwidth contention during decoding, and the need for excessive resource reservation to meet SLO requirements (TTFT < 2s, TBT < 100ms). This often results in GPU utilization below 40% and difficulties in leveraging cross-server resources. To address this, we propose a distributed global multi-level KV cache management system, adopting a memory-compute integrated architecture to break through single-machine resource constraints.
Feature Introduction¶
The xLLM Global KV Cache feature is primarily implemented through the following three modules: - etcd: For cluster service registration, load information synchronization, and global cache state management. - xLLM Service: For scheduling requests and managing all compute instances. - xLLM: The compute instances handling requests.
The overall architecture is shown in the diagram below: 
Usage Example¶
Preparation¶
Install Dependencies¶
- xLLM: Refer to Installation && Compilation
- xLLM Service: Refer to PD disaggregation
Usage Instructions¶
-
etcd Startup Configuration:
-
xLLM Service Startup Configuration:
-
xLLM Startup Configuration: Add the following gflag parameters when starting xLLM: