Skip to content

MTP Speculative Inference

Background

MTP (Multi-Token Prediction) is an innovative inference acceleration technique that addresses efficiency bottlenecks in large language model generation. By incorporating specialized pre-training designs, MTP provides efficient draft token prediction capabilities during inference, significantly improving generation speed. Its core value lies in balancing inference efficiency with output quality, offering an optimal solution for long-sequence generation problems in LLMs, ultimately optimizing inference performance.

Key Features

MTP offers the following core acceleration capabilities:

  • Efficient Draft Generation: Uses a lightweight MTP architecture to rapidly generate draft tokens that serve as input for the main model's verification, dramatically reducing computation overhead compared to traditional autoregressive generation.

  • Batch Verification Mechanism: The main model can simultaneously verify multiple MTP-generated draft tokens in batch, rather than processing them sequentially, significantly boosting inference speed.

  • High Sampling Accuracy: MTP solves the critical pain point of low token acceptance rates in post-training draft modules (like Eagle and Medusa). By optimizing draft generation during pre-training, MTP produces tokens with higher accuracy, reducing the verification burden on the main model.

  • Reduced Inference Latency: By pre-generating multiple potential subsequent tokens, MTP effectively decreases cumulative latency during long-text generation, creating a smoother user experience.

  • Optimized Resource Consumption: Compared to other inference acceleration techniques, MTP maintains acceleration effects while requiring fewer additional computational resources, making it suitable for deployment in resource-constrained environments.

MTP technology provides a novel efficiency optimization solution for LLM inference, particularly well-suited for real-time applications requiring rapid responses, representing an important direction in language model inference optimization.

Model Support

Currently only supports Deepseek's MTP architecture. Support for other models will be added in future updates.

Usage Example

Export Model

./tools/export_deepseek_mtp.py --input-dir /path/to/DeepSeek-V3 --output-dir /path/to/DeepSeek-V3-mtp
Input model reference:Deepseek-V3

Launch Script

MODEL_PATH="/models/DeepSeek-V3"
DRAFT_MODEL_PATH="/models/DeepSeek-V3-MTP"
MASTER_NODE_ADDR="127.0.0.1:42123"
START_PORT=13222
START_DEVICE=0
LOG_DIR="log"
NNODES=16

for (( i=0; i<$NNODES; i++ ))
do
  PORT=$((START_PORT + i))
  DEVICE=$((START_DEVICE + i))
  LOG_FILE="$LOG_DIR/node_$i.log"
  nohup ./xllm \
    --model $MODEL_PATH \
    --devices="npu:$DEVICE" \
    --port $PORT \
    --master_node_addr=$MASTER_NODE_ADDR \
    --nnodes=$NNODES \
    --draft_model $DRAFT_MODEL_PATH \
    --draft_devices="npu:$DEVICE" \
    --num_speculative_tokens 1 \
    --max_memory_utilization=0.90 \
    --max_tokens_per_batch=10000 \
    --max_seqs_per_batch=256 \
    --enable_mla=true \
    --block_size=128 \
    --ep_size=1 \
    --dp_size=1 \
    --enable_prefix_cache=false \
    --enable_chunked_prefill=false \
    --node_rank=$i > $LOG_FILE 2>&1 &
  sleep 0.5
done

Performance Data

Based on ShareGPT dataset with input length=2500, output length=1500, total requests=80.

method Concurrency Mean TPOT(ms) Mean TTFT(ms) Output Tokens/s Total Tokens/s
baseline 1 40.61 141.80 24.20 65.77
mtp 1 28.33 142.35 35.19 95.52
baseline 2 42.69 178.59 45.16 122.74
mtp 2 29.81 187.97 64.75 175.78
baseline 4 46.18 172.34 79.83 216.96
mtp 4 33.54 194.22 111.18 301.81
baseline 8 53.16 181.49 110.68 300.81
mtp 8 40.99 203.37 154.46 419.34
baseline 16 68.50 213.89 143.81 390.84
mtp 16 57.04 254.99 201.89 548.04
baseline 20 74.72 228.80 154.77 420.65
mtp 20 61.73 264.34 206.24 559.84
baseline 40 119.68 559.32 180.22 489.80
mtp 40 105.70 544.54 252.91 686.74
baseline 80 180.89 2996.21 192.09 522.06
mtp 80 152.19 2163.72 278.07 755.12