Disaggregated PD¶
Background¶
LLM online inference services typically need to meet two performance metrics: TTFT and TPOT. Traditional Contiguous Batching scheduling strategies mix Prefill and Decode requests during scheduling, causing Prefill and Decode phases to compete for computational resources. This prevents maximized utilization of computing resources and impacts performance metrics. To resolve this conflict, the Prefill and Decode phases are split to run on independent computational resources, enabling parallel execution. This simultaneously reduces TTFT and TPOT while improving throughput.
Introduction¶
The xLLM PD Separation feature is primarily implemented through the following three modules:
- etcd: Stores metadata such as instance information.
- xLLM Service: Schedules requests and manages all computing instances.
- xLLM: Handles request computation instances.
The overall architecture is shown below: 
Usage¶
Preparation¶
Install Dependencies¶
- xLLM: Refer to Installation && Compilation
- xLLM Service: Refer to PD disaggregation
Obtain Environment Information¶
Deploying Disaggregated PD Service requires obtaining the Device IP of the machine to create communication resources. Execute the command cat /etc/hccn.conf | grep address on the current AI Server to get the Device IP, for example:
address_xx represents the Device IP. Start Disaggregated PD Service¶
- Start etcd
- Start xLLM Service
- Start xLLM
-
Taking Qwen2-7B as an example
- Start Prefill Instance
/path/to/xllm --model=Qwen2-7B-Instruct \ --port=8010 \ --devices="npu:0" \ --master_node_addr="127.0.0.1:18888" \ --enable_prefix_cache=false \ --enable_chunked_prefill=false \ --enable_disagg_pd=true \ --instance_role=PREFILL \ --etcd_addr="127.0.0.1:12389" \ --transfer_listen_port=26000 \ --disagg_pd_port=7777 \ --node_rank=0 \ --nnodes=1 -
Start Decode Instance
Important notes:/path/to/xllm --model=Qwen2-7B-Instruct \ --port=8020 \ --devices="npu:1" \ --master_node_addr="127.0.0.1:18898" \ --enable_prefix_cache=false \ --enable_chunked_prefill=false \ --enable_disagg_pd=true \ --instance_role=DECODE \ --etcd_addr="127.0.0.1:12389" \ --transfer_listen_port=26100 \ --disagg_pd_port=7787 \ --node_rank=0 \ --nnodes=1 -
PD disaggregation requires reading the
/etc/hccn.conffile. Make sure this file on the physical machine is mapped into the container. -
etcd_addrmust match theetcd_addrofxllm_service
- Start Prefill Instance
Notice¶
With the chunked-prefill PD scheduler, the Prefill instance supports prefix cache. When enabled, the scheduler matches existing prefix-cache blocks before calculating the current chunk budget, so cached prompt blocks are not recomputed:
Decode instance: prefix cache is supported (recommended)¶
Starting from xLLM v0.x, the Decode instance supports --enable_prefix_cache=true. When a request carries best_of > 1 (best-of-N sampling), the Decode instance must enable prefix cache, otherwise the request is rejected. With prefix cache enabled, expanded candidate sequences reuse the first sequence's prompt KV via the prefix cache, avoiding duplicate compute and memory.
best_of_n support in disaggregated PD¶
- Requirements: Decode instance must enable
--enable_prefix_cache=true. Any request withbest_of > 1while the Decode instance hasenable_prefix_cache=falseis rejected. - Flow: The Prefill instance only prefills the first sequence and ships its KV to the Decode instance. On the Decode side, after the first token is received, the prompt KV is registered into the prefix cache so that the expanded
best_of-1candidate sequences hit it via prefix match. - MLU platform: PD + best_of_n is not supported yet (gated by
normalize_mlu). best_of != n: Streaming is forcibly disabled (same as the non-PD behavior).