Skip to content

Service Startup Parameters

xLLM uses gflags to manage service startup parameters. The specific parameter meanings are as follows:

Common Parameters

Parameter Name Data Type Default Value Other Values Description Notes
master_node_addr string "127.0.0.1:19888" ip:port The listening address of the master node's rpc server Details
host string "" The machine IP where the current device is located The host IP used by the current device for communication. An rpc server is started on each device for multi-device communication.
port int32 8010 Any available port Used in conjunction with the host parameter. The combination is used for rpc communication between devices.
model string "" Path to the model.
devices string "npu:0" Specifies the NPU devices used by the current process.
nnodes int32 1 The total number of devices used by the current service.
node_rank int32 0 0 ~ (total devices - 1) The rank id of each device.
max_memory_utilization double 0.8 Between 0-1 The maximum proportion of device memory available for model weights and KV Cache combined.
max_tokens_per_batch int32 10240 The maximum number of tokens that can be computed per step.
max_seqs_per_batch int32 1024 The maximum number of sequences that can be computed per step.
enable_chunked_prefill bool true false Whether to enable chunked prefill.
enable_prefill_sp bool false true Whether to enable prefill-only sequence parallel. enable_chunked_prefill=true is supported only for prefill-only batches (PREFILL / CHUNKED_PREFILL); MIXED and DECODE batches do not run with sequence parallel.
enable_schedule_overlap bool false true Whether to enable asynchronous scheduling. Details
enable_prefix_cache bool true false Whether to enable prefix cache (not supported by DeepSeek currently).
communication_backend string "hccl" "lccl" The backend used for communication operations.
block_size int32 128 The block size for KV Cache storage.
task string "generate" "embed", "mm_embed" Service type: generation, embedding, or multimodal embedding.
max_cache_size int64 0 The usable KV Cache size in bytes.
kv_cache_dtype string "auto" "int8" KV Cache data type. "auto" aligns with model dtype (no quantization), "int8" enables INT8 quantization to save ~50% memory. MLU backend only.
Parameter Name Type Default Value Other Values Description Notes
dp_size int32 1 Power of 2 The dp scale size for the Attention part.
ep_size int32 1 Power of 2 The ep scale size for the MoE part.
expert_parallel_degree int32 0 1,2 Parameter related to ep parallelism. Defaults to 0 when ep is not used, and to 1 when ep is enabled. Can be set to 2 when ep_size equals the total number of devices (uses all2all communication).
Parameter Name Type Default Value Other Values Description Notes
enable_disagg_pd bool false true Whether to enable P-D separation. Details
disagg_pd_port int32 7777 Any available port Configuration when P-D separation is enabled. Corresponds to the listening port number of the pd separation rpc server started on each card.
instance_role string DEFAULT PREFILL, DECODE, MIX Defaults to DEFAULT. Must be configured as PREFILL, DECODE, or MIX when P-D separation is enabled.
kv_cache_transfer_mode string "PUSH" "PULL" The mode for transferring KV Cache in P-D separation. PUSH mode: Prefill transmits layer by layer to Decode; PULL mode: Decode pulls the KV Cache from Prefill in one go.
transfer_listen_port int32 26000 Any available port Configuration when P-D separation is enabled. Corresponds to the listening port for KV Cache Transfer on each card.
Parameter Name Type Default Value Other Values Description Notes
draft_model string "" Path to the MTP model. Details
draft_devices string "npu:0" Same format as devices, e.g. npu:0 or npu:0,npu:1 Should be set consistently with the devices parameter.
num_speculative_tokens int32 0 Any integer, suggestion 1 or 2 The number of tokens output by the MTP model per step.
Parameter Name Type Default Value Other Values Description Notes
enable_graph bool false true Whether to enable graph execution mode to optimize decode phase performance. Only applied during decode phase and does not take effect during prefill phase. Supports ACL Graph (NPU), and MLU Graph. Details
enable_graph_mode_decode_no_padding bool false true Builds decode graphs with the actual num_tokens instead of the padded shape.
enable_prefill_piecewise_graph bool false true Whether to enable piecewise graph for prefill phase. Attention runs eagerly while other ops are captured into CUDA graphs.
max_tokens_for_graph_mode int32 2048 Any integer greater than or equal to 0 Maximum number of tokens for graph execution. If 0, no limit is applied.

Parameters for Use with xLLM-service

Parameter Name Type Default Value Other Values Description Notes
etcd_addr string "" ip:port The listening address of the etcd's rpc server.
enable_service_routing bool false true Whether the request from the xllm service, use this when enable the xllm service.

Other Parameters

Parameter Name Type Default Value Other Values Description Notes
max_concurrent_requests int32 200 Any integer greater than or equal to 0 For rate limiting, restricts the total number of requests being processed in the instance. Set to 0 for no limit.
model_id string "" Model name, not a path.
num_request_handling_threads int32 4 Any integer greater than 0 The thread pool size for handling input requests.
prefill_scheduling_memory_usage_threshold double 0.95 Value between 0-1 When kv cache usage reaches this threshold, scheduling of prefill requests is paused.
num_response_handling_threads int32 4 Any integer greater than 0 The thread pool size for handling outputs.
rank_tablefile string "" Configuration file for creating the communication domain. Required for multi-node scenarios.