Skip to content

PD disaggregation

xllm supports PD disaggregation deployment, which requires integration with our other open-source library xllm service.

xLLM Service Dependencies

First, download and install xllm service, similar to installing and compiling xllm:

git clone https://github.com/jd-opensource/xllm-service
cd xllm_service
git submodule init
git submodule update

etcd Installation

xllm_service compilation and operation depend on etcd.Use the installation script provided by etcd for installation. The default installation path provided by the script is /tmp/etcd-download-test/etcd. You can either manually modify the installation path in the script or manually migrate after running the script:

mv /tmp/etcd-download-test/etcd /path/to/your/etcd

xLLM Service Compilation

Apply patch:

sh prepare.sh
Then execute the compilation:
mkdir -p build
cd build
cmake ..
make -j 8
cd ..

Potential Errors

You may encounter installation errors related to boost-locale and boost-interprocess: vcpkg-src/packages/boost-locale_x64-linux/include: No such file or directory, /vcpkg-src/packages/boost-interprocess_x64-linux/include: No such file or directory Reinstall these packages using vcpkg:

/path/to/vcpkg remove boost-locale boost-interprocess
/path/to/vcpkg install boost-locale:x64-linux
/path/to/vcpkg install boost-interprocess:x64-linux

PD Disaggregation Execution

Start etcd:

./etcd-download-test/etcd --listen-peer-urls 'http://localhost:2390'  --listen-client-urls 'http://localhost:2389' --advertise-client-urls 'http://localhost:2391'

Start xllm service:

ENABLE_DECODE_RESPONSE_TO_SERVICE=true ./xllm_master_serving --etcd_addr="127.0.0.1:12389" --http_server_port 28888 --rpc_server_port 28889 --tokenizer_path=/path/to/tokenizer_config_dir/

Taking Qwen2-7B as an example:

  • Start Prefill Instance
    ./xllm --model=path/to/Qwen2-7B-Instruct \
           --port=8010 \
           --devices="npu:0" \
           --master_node_addr="127.0.0.1:18888" \
           --enable_prefix_cache=false \
           --enable_chunked_prefill=false \
           --enable_disagg_pd=true \
           --instance_role=PREFILL \
           --etcd_addr=127.0.0.1:12389 \
           --device_ip=xx.xx.xx.xx \ # Replace with actual Device IP 
           --transfer_listen_port=26000 \
           --disagg_pd_port=7777 \
           --node_rank=0 \
           --nnodes=1
    
  • Start Decode Instance
    ./xllm --model=path/to/Qwen2-7B-Instruct \
           --port=8020 \
           --devices="npu:1" \
           --master_node_addr="127.0.0.1:18898" \
           --enable_prefix_cache=false \
           --enable_chunked_prefill=false \
           --enable_disagg_pd=true \
           --instance_role=DECODE \
           --etcd_addr=127.0.0.1:12389 \
           --device_ip=xx.xx.xx.xx \ # Replace with actual Device IP 
           --transfer_listen_port=26100 \
           --disagg_pd_port=7787 \
           --node_rank=0 \
           --nnodes=1
    

Important notes:

  • For PD disaggregation when specifying NPU Device, the corresponding device_ip is required. This is different for each device. You can see this by executing the following command on the physical machine outside the container environment. The value after address_{i}= displayed is the device_ip corresponding to NPU {i}.

    sudo cat /etc/hccn.conf
    

  • etcd_addr must match the etcd_addr of xllm_service

The test command is similar to above. Note that the PORT in curl http://localhost:{PORT}/v1/chat/completions ... should be the port of the http_server_port of xLLM service.