Skip to content

[Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction#7332

Draft
r-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/050-minicpm41-model-v2
Draft

[Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction#7332
r-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/050-minicpm41-model-v2

Conversation

@r-cloudforge
Copy link
Copy Markdown

@r-cloudforge r-cloudforge commented Apr 10, 2026

Motivation

⚡ Engineering Highlight: Correct μP (Maximal Update Parametrization) 3-site scaling — embedding (×12), residual (×scale_depth/√N per sub-layer), lm_head (÷16) — with ordering-critical placement (each scaling must happen after sub-layer output but before residual add), plus vocab masking and tie_word_embeddings.

为 FastDeploy 提供部署高性能的 openbmb/MiniCPM4.1-8B 系列模型的能力。

This PR adds support for deploying the openbmb/MiniCPM4.1-8B model family in FastDeploy, as required by Hackathon 10th Spring No.50.

MiniCPM4.1-8B is a dense 8B parameter model from OpenBMB with the following key features:

  • μP (Maximal Update Parametrization): Three scaling sites — embedding (×12), residual (×scale_depth/√num_layers), and lm_head (÷hidden_size/dim_model_base)
  • GQA: Grouped Query Attention with num_key_value_heads=2
  • LongRoPE: Extended position encoding supporting up to 65,536 tokens
  • Architecture registered as MiniCPMForCausalLM

Modifications

Model Code (fastdeploy/model_executor/models/minicpm4.py)

New model file (516 lines) implementing:

  • MiniCPM4MLP: Gate/up merged projection with SiLU activation, no bias
  • MiniCPM4Attention: GQA with QKVParallelLinear(with_bias=False), neox-style RoPE
  • MiniCPM4DecoderLayer: μP residual scaling (scale_depth / √num_hidden_layers)
  • MiniCPM4Model: μP embedding scaling (scale_emb), graph optimization support
  • MiniCPM4ForCausalLM: μP lm_head scaling, weight mapping (HF model. → FD minicpm4.), registered as MiniCPMForCausalLM
  • MiniCPM4PretrainedModel: Tensor parallel mappings (no bias splits)

Unit Tests (tests/model_executor/test_minicpm4.py)

New test file (514 lines, 24 functional tests) with full FastDeploy integration:

  • Real class instantiation (MiniCPM4MLP, MiniCPM4Attention, MiniCPM4DecoderLayer, MiniCPM4Model, MiniCPM4ForCausalLM)
  • 8 monkeypatch.setattr stubs for heavy infrastructure (attention backend, parallel linear, embedding, RMSNorm, RoPE, graph opt)
  • Validates μP 3-site scaling, weight mapping, tensor parallel splits, forward pass, load_state_dict, compute_logits

Documentation

  • docs/best_practices/MiniCPM4-8B.md: Usage guide with hardware requirements, deployment examples, and performance tuning
  • docs/supported_models.md: Added MiniCPM4 entry to LLM model table

Engineering Highlights

  1. μP 3-Site Scaling — Correct implementation of Maximal Update Parametrization at three distinct points, each with different mathematical operations:

    • Embedding: × scale_emb (amplifies to ×12)
    • Residual: × scale_depth / √num_hidden_layers (applied independently to both attention and MLP outputs per layer, before residual add)
    • LM head: ÷ (hidden_size / dim_model_base) (normalizes ÷16 before logit computation)

    Ordering is critical: residual scaling must happen after each sub-layer output but before the residual addition.

  2. Vocab Masking: logits[:, ori_vocab_size:] = -inf prevents generation of padding tokens at inference time — preserves original vocabulary boundary when vocab_size was padded during training.

  3. tie_word_embeddings: Transposes embedding weight → lm_head with dtype consistency, matching MiniCPMForCausalLM HF default.

Design Decisions

  • Followed Qwen2 model pattern (closest architecture in FastDeploy) with μP scaling additions
  • Auto-discovery via @ModelRegistry.register_model_class decorator — no manual imports needed
  • μP config values (scale_emb, scale_depth, dim_model_base) read from HF config.json via ModelConfig auto-setattr
  • Quantization support (WINT8/WINT4/FP8) through standard FastDeploy layers — no custom ops needed

Usage or Command

# Deploy MiniCPM4.1-8B with WINT4 quantization
python -m fastdeploy.entrypoints.openai.api_server \
       --model openbmb/MiniCPM4.1-8B \
       --tensor-parallel-size 1 \
       --quantization wint4 \
       --max-model-len 32768 \
       --max-num-seqs 128

# Send a request
curl http://localhost:8180/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM4.1-8B",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 512
  }'

See docs/best_practices/MiniCPM4-8B.md for full deployment guide.

Accuracy Tests

Unit Tests (24/24 passed on A800 GPU)

  • Test file: tests/model_executor/test_minicpm4.py (514 lines, 24 functional tests)
  • Style: pytest functional tests with monkeypatch.setattr — real FastDeploy class instantiation, no MagicMock
  • Imports: from fastdeploy.model_executor.models.minicpm4 import ... — all 6 model classes

Test categories:

  • MLP (2 tests): Forward pass, load_state_dict weight mapping
  • Attention (2 tests): Forward with QKV projection, load_state_dict
  • DecoderLayer (3 tests): Residual μP scaling value, forward propagation, load_state_dict
  • Model (3 tests): Forward with embedding scaling (×12), no-scale fallback, load_state_dict
  • CausalLM (7 tests): Forward, compute_logits μP scaling (÷16), vocab masking, lm_head fallback, set_state_dict, model name, tie_word_embeddings
  • Weight Mapping (3 tests): HF→FD prefix rename, QKV stacking, gate/up stacking
  • Tensor Parallel (3 tests): Column/row split keys, non-fused QKV mapping, round-trip split/merge
  • Registration (1 test): Architecture string MiniCPMForCausalLM

AI Studio A800 GPU Validation (SM80)

Tested on Baidu AI Studio NVIDIA A800-SXM4-80GB (SM80 Ampere), Paddle 3.3.0, Python 3.10.12: 24/24 passed in 2.16s.

tests/model_executor/test_minicpm4.py::test_mlp_forward PASSED           [  4%]
tests/model_executor/test_minicpm4.py::test_mlp_load_state_dict PASSED   [  8%]
tests/model_executor/test_minicpm4.py::test_attention_forward PASSED     [ 12%]
tests/model_executor/test_minicpm4.py::test_attention_load_state_dict PASSED [ 16%]
tests/model_executor/test_minicpm4.py::test_decoder_layer_residual_scale PASSED [ 20%]
tests/model_executor/test_minicpm4.py::test_decoder_layer_forward PASSED [ 25%]
tests/model_executor/test_minicpm4.py::test_decoder_layer_load_state_dict PASSED [ 29%]
tests/model_executor/test_minicpm4.py::test_model_forward_with_embedding_scale PASSED [ 33%]
tests/model_executor/test_minicpm4.py::test_model_no_embedding_scale PASSED [ 37%]
tests/model_executor/test_minicpm4.py::test_model_load_state_dict PASSED [ 41%]
tests/model_executor/test_minicpm4.py::test_causallm_forward PASSED      [ 45%]
tests/model_executor/test_minicpm4.py::test_causallm_compute_logits_mup_scaling PASSED [ 50%]
tests/model_executor/test_minicpm4.py::test_causallm_compute_logits_vocab_mask PASSED [ 54%]
tests/model_executor/test_minicpm4.py::test_causallm_lm_head_scale_fallback PASSED [ 58%]
tests/model_executor/test_minicpm4.py::test_causallm_set_state_dict PASSED [ 62%]
tests/model_executor/test_minicpm4.py::test_causallm_name PASSED         [ 66%]
tests/model_executor/test_minicpm4.py::test_causallm_tie_word_embeddings PASSED [ 70%]
tests/model_executor/test_minicpm4.py::test_weights_mapper_prefix_rename PASSED [ 75%]
tests/model_executor/test_minicpm4.py::test_stacked_params_qkv PASSED    [ 79%]
tests/model_executor/test_minicpm4.py::test_stacked_params_gate_up PASSED [ 83%]
tests/model_executor/test_minicpm4.py::test_tp_mappings_split_keys PASSED [ 87%]
tests/model_executor/test_minicpm4.py::test_tp_mappings_non_fused_qkv PASSED [ 91%]
tests/model_executor/test_minicpm4.py::test_tp_mappings_round_trip PASSED [ 95%]
tests/model_executor/test_minicpm4.py::test_registration_architecture PASSED [100%]
======================== 24 passed, 2 warnings in 2.16s ========================

Environment: NVIDIA A800-SXM4-80GB, 81920 MiB, SM80, PaddlePaddle 3.3.0, Python 3.10.12.

GPU 推理验证 / Full Inference Validation

为方便验证模型推理能力,附上可直接运行的GPU验证脚本(单卡,≥24GB显存):

Step 1 — 启动 API Server:

python -m fastdeploy.entrypoints.openai.api_server \
    --model openbmb/MiniCPM4.1-8B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --max-num-seqs 16

Step 2 — 发送推理请求:

curl -s http://localhost:8180/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM4.1-8B",
    "messages": [{"role": "user", "content": "请用一句话介绍MiniCPM模型"}],
    "max_tokens": 64,
    "temperature": 0
  }' | python -m json.tool

预期行为: Server 正常启动,模型加载完成后响应推理请求,生成连贯中文文本,无报错。

WINT4 量化验证 (降低显存需求至 ~8GB):

python -m fastdeploy.entrypoints.openai.api_server \
    --model openbmb/MiniCPM4.1-8B \
    --tensor-parallel-size 1 \
    --quantization wint4 \
    --max-model-len 4096 \
    --max-num-seqs 16

Checklist

  • Model code follows existing FastDeploy patterns (Qwen2 reference)
  • All pre-commit checks pass (black, isort, flake8, ruff)
  • Model registered via @ModelRegistry.register_model_class decorator
  • Weight mapping supports HuggingFace torch format
  • Usage documentation provided
  • Supported models table updated
  • GPU validation: 24/24 unit tests passed on A800 (SM80)
  • Unit tests: 514 lines, 24 functional tests with real FD class instantiation

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 10, 2026

Thanks for your contribution!

fastdeploy-bot

This comment was marked as outdated.

@r-cloudforge r-cloudforge changed the title 【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction [Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction Apr 11, 2026
@r-cloudforge
Copy link
Copy Markdown
Author

感谢 AI 审查。标题已修正为 [Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction

审查中无 blocking issues,不做额外调整。

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 13, 2026

CLA assistant check
All committers have signed the CLA.

fastdeploy-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review 摘要

PR 概述:为 FastDeploy 添加 MiniCPM4.1-8B 模型支持,实现 μP (Maximal Update Parametrization) 缩放、GQA、LongRoPE 等特性

变更范围fastdeploy/model_executor/models/minicpm4.py (新增模型文件)、tests/model_executor/test_minicpm4.py (新增测试文件)、文档更新

影响面 Tag[Models]

问题

未发现阻塞性问题。

总体评价

该 PR 整体实现正确,代码质量良好:

  • 模型架构实现正确,参考了 Qwen2 模型模式并添加了 MiniCPM4 特有的 μP scaling
  • 复用了现有 layers 组件,没有重复实现
  • 使用 @ModelRegistry.register_model_class 自动注册,符合项目规范
  • 24 个单元测试覆盖全面,使用 stub 模拟依赖实现 CPU 安全测试
  • Config 字段使用 getattr + 默认值处理,兼容性良好
  • 文档完整,包含部署指南和性能调优建议

建议补充端到端测试(tests/e2e/test_MiniCPM4_serving.py)以验证实际推理流程。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants