The machine IP where the current device is located
The host IP used by the current device for communication. An rpc server is started on each device for multi-device communication.
port
int32
8010
Any available port
Used in conjunction with the host parameter. The combination is used for rpc communication between devices.
model
string
""
Path to the model.
devices
string
"npu:0"
Specifies the NPU devices used by the current process.
nnodes
int32
1
The total number of devices used by the current service.
node_rank
int32
0
0 ~ (total devices - 1)
The rank id of each device.
max_memory_utilization
double
0.8
Between 0-1
The maximum proportion of device memory available for model weights and KV Cache combined.
max_tokens_per_batch
int32
10240
The maximum number of tokens that can be computed per step.
max_seqs_per_batch
int32
1024
The maximum number of sequences that can be computed per step.
enable_chunked_prefill
bool
true
false
Whether to enable chunked prefill.
enable_prefill_sp
bool
false
true
Whether to enable prefill-only sequence parallel.
enable_chunked_prefill=true is supported only for prefill-only batches (PREFILL / CHUNKED_PREFILL); MIXED and DECODE batches do not run with sequence parallel.
Parameter related to ep parallelism. Defaults to 0 when ep is not used, and to 1 when ep is enabled. Can be set to 2 when ep_size equals the total number of devices (uses all2all communication).
Configuration when P-D separation is enabled. Corresponds to the listening port number of the pd separation rpc server started on each card.
instance_role
string
DEFAULT
PREFILL, DECODE, MIX
Defaults to DEFAULT. Must be configured as PREFILL, DECODE, or MIX when P-D separation is enabled.
kv_cache_transfer_mode
string
"PUSH"
"PULL"
The mode for transferring KV Cache in P-D separation. PUSH mode: Prefill transmits layer by layer to Decode; PULL mode: Decode pulls the KV Cache from Prefill in one go.
transfer_listen_port
int32
26000
Any available port
Configuration when P-D separation is enabled. Corresponds to the listening port for KV Cache Transfer on each card.
Whether to enable graph execution mode to optimize decode phase performance. Only applied during decode phase and does not take effect during prefill phase. Supports ACL Graph (NPU), and MLU Graph.