[Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction by r-cloudforge · Pull Request #7332 · PaddlePaddle/FastDeploy

r-cloudforge · 2026-04-10T19:56:08Z

Motivation

⚡ Engineering Highlight: Correct μP (Maximal Update Parametrization) 3-site scaling — embedding (×12), residual (×scale_depth/√N per sub-layer), lm_head (÷16) — with ordering-critical placement (each scaling must happen after sub-layer output but before residual add), plus vocab masking and tie_word_embeddings.

为 FastDeploy 提供部署高性能的 openbmb/MiniCPM4.1-8B 系列模型的能力。

This PR adds support for deploying the openbmb/MiniCPM4.1-8B model family in FastDeploy, as required by Hackathon 10th Spring No.50.

MiniCPM4.1-8B is a dense 8B parameter model from OpenBMB with the following key features:

μP (Maximal Update Parametrization): Three scaling sites — embedding (×12), residual (×scale_depth/√num_layers), and lm_head (÷hidden_size/dim_model_base)
GQA: Grouped Query Attention with num_key_value_heads=2
LongRoPE: Extended position encoding supporting up to 65,536 tokens
Architecture registered as MiniCPMForCausalLM

Modifications

Model Code (`fastdeploy/model_executor/models/minicpm4.py`)

New model file (516 lines) implementing:

MiniCPM4MLP: Gate/up merged projection with SiLU activation, no bias
MiniCPM4Attention: GQA with QKVParallelLinear(with_bias=False), neox-style RoPE
MiniCPM4DecoderLayer: μP residual scaling (scale_depth / √num_hidden_layers)
MiniCPM4Model: μP embedding scaling (scale_emb), graph optimization support
MiniCPM4ForCausalLM: μP lm_head scaling, weight mapping (HF model. → FD minicpm4.), registered as MiniCPMForCausalLM
MiniCPM4PretrainedModel: Tensor parallel mappings (no bias splits)

Unit Tests (`tests/model_executor/test_minicpm4.py`)

New test file (514 lines, 24 functional tests) with full FastDeploy integration:

Real class instantiation (MiniCPM4MLP, MiniCPM4Attention, MiniCPM4DecoderLayer, MiniCPM4Model, MiniCPM4ForCausalLM)
8 monkeypatch.setattr stubs for heavy infrastructure (attention backend, parallel linear, embedding, RMSNorm, RoPE, graph opt)
Validates μP 3-site scaling, weight mapping, tensor parallel splits, forward pass, load_state_dict, compute_logits

Documentation

docs/best_practices/MiniCPM4-8B.md: Usage guide with hardware requirements, deployment examples, and performance tuning
docs/supported_models.md: Added MiniCPM4 entry to LLM model table

Engineering Highlights

μP 3-Site Scaling — Correct implementation of Maximal Update Parametrization at three distinct points, each with different mathematical operations:
- Embedding: × scale_emb (amplifies to ×12)
- Residual: × scale_depth / √num_hidden_layers (applied independently to both attention and MLP outputs per layer, before residual add)
- LM head: ÷ (hidden_size / dim_model_base) (normalizes ÷16 before logit computation)
Ordering is critical: residual scaling must happen after each sub-layer output but before the residual addition.
Vocab Masking: logits[:, ori_vocab_size:] = -inf prevents generation of padding tokens at inference time — preserves original vocabulary boundary when vocab_size was padded during training.
tie_word_embeddings: Transposes embedding weight → lm_head with dtype consistency, matching MiniCPMForCausalLM HF default.

Design Decisions

Followed Qwen2 model pattern (closest architecture in FastDeploy) with μP scaling additions
Auto-discovery via @ModelRegistry.register_model_class decorator — no manual imports needed
μP config values (scale_emb, scale_depth, dim_model_base) read from HF config.json via ModelConfig auto-setattr
Quantization support (WINT8/WINT4/FP8) through standard FastDeploy layers — no custom ops needed

Usage or Command

# Deploy MiniCPM4.1-8B with WINT4 quantization
python -m fastdeploy.entrypoints.openai.api_server \
       --model openbmb/MiniCPM4.1-8B \
       --tensor-parallel-size 1 \
       --quantization wint4 \
       --max-model-len 32768 \
       --max-num-seqs 128

# Send a request
curl http://localhost:8180/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM4.1-8B",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 512
  }'

See docs/best_practices/MiniCPM4-8B.md for full deployment guide.

Accuracy Tests

Unit Tests (24/24 passed on A800 GPU)

Test file: tests/model_executor/test_minicpm4.py (514 lines, 24 functional tests)
Style: pytest functional tests with monkeypatch.setattr — real FastDeploy class instantiation, no MagicMock
Imports: from fastdeploy.model_executor.models.minicpm4 import ... — all 6 model classes

Test categories:

MLP (2 tests): Forward pass, load_state_dict weight mapping
Attention (2 tests): Forward with QKV projection, load_state_dict
DecoderLayer (3 tests): Residual μP scaling value, forward propagation, load_state_dict
Model (3 tests): Forward with embedding scaling (×12), no-scale fallback, load_state_dict
CausalLM (7 tests): Forward, compute_logits μP scaling (÷16), vocab masking, lm_head fallback, set_state_dict, model name, tie_word_embeddings
Weight Mapping (3 tests): HF→FD prefix rename, QKV stacking, gate/up stacking
Tensor Parallel (3 tests): Column/row split keys, non-fused QKV mapping, round-trip split/merge
Registration (1 test): Architecture string MiniCPMForCausalLM

AI Studio A800 GPU Validation (SM80)

Tested on Baidu AI Studio NVIDIA A800-SXM4-80GB (SM80 Ampere), Paddle 3.3.0, Python 3.10.12: 24/24 passed in 2.16s.

tests/model_executor/test_minicpm4.py::test_mlp_forward PASSED           [  4%]
tests/model_executor/test_minicpm4.py::test_mlp_load_state_dict PASSED   [  8%]
tests/model_executor/test_minicpm4.py::test_attention_forward PASSED     [ 12%]
tests/model_executor/test_minicpm4.py::test_attention_load_state_dict PASSED [ 16%]
tests/model_executor/test_minicpm4.py::test_decoder_layer_residual_scale PASSED [ 20%]
tests/model_executor/test_minicpm4.py::test_decoder_layer_forward PASSED [ 25%]
tests/model_executor/test_minicpm4.py::test_decoder_layer_load_state_dict PASSED [ 29%]
tests/model_executor/test_minicpm4.py::test_model_forward_with_embedding_scale PASSED [ 33%]
tests/model_executor/test_minicpm4.py::test_model_no_embedding_scale PASSED [ 37%]
tests/model_executor/test_minicpm4.py::test_model_load_state_dict PASSED [ 41%]
tests/model_executor/test_minicpm4.py::test_causallm_forward PASSED      [ 45%]
tests/model_executor/test_minicpm4.py::test_causallm_compute_logits_mup_scaling PASSED [ 50%]
tests/model_executor/test_minicpm4.py::test_causallm_compute_logits_vocab_mask PASSED [ 54%]
tests/model_executor/test_minicpm4.py::test_causallm_lm_head_scale_fallback PASSED [ 58%]
tests/model_executor/test_minicpm4.py::test_causallm_set_state_dict PASSED [ 62%]
tests/model_executor/test_minicpm4.py::test_causallm_name PASSED         [ 66%]
tests/model_executor/test_minicpm4.py::test_causallm_tie_word_embeddings PASSED [ 70%]
tests/model_executor/test_minicpm4.py::test_weights_mapper_prefix_rename PASSED [ 75%]
tests/model_executor/test_minicpm4.py::test_stacked_params_qkv PASSED    [ 79%]
tests/model_executor/test_minicpm4.py::test_stacked_params_gate_up PASSED [ 83%]
tests/model_executor/test_minicpm4.py::test_tp_mappings_split_keys PASSED [ 87%]
tests/model_executor/test_minicpm4.py::test_tp_mappings_non_fused_qkv PASSED [ 91%]
tests/model_executor/test_minicpm4.py::test_tp_mappings_round_trip PASSED [ 95%]
tests/model_executor/test_minicpm4.py::test_registration_architecture PASSED [100%]
======================== 24 passed, 2 warnings in 2.16s ========================

Environment: NVIDIA A800-SXM4-80GB, 81920 MiB, SM80, PaddlePaddle 3.3.0, Python 3.10.12.

GPU 推理验证 / Full Inference Validation

为方便验证模型推理能力，附上可直接运行的GPU验证脚本（单卡，≥24GB显存）：

Step 1 — 启动 API Server:

python -m fastdeploy.entrypoints.openai.api_server \
    --model openbmb/MiniCPM4.1-8B \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --max-num-seqs 16

Step 2 — 发送推理请求:

curl -s http://localhost:8180/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM4.1-8B",
    "messages": [{"role": "user", "content": "请用一句话介绍MiniCPM模型"}],
    "max_tokens": 64,
    "temperature": 0
  }' | python -m json.tool

预期行为: Server 正常启动，模型加载完成后响应推理请求，生成连贯中文文本，无报错。

WINT4 量化验证 (降低显存需求至 ~8GB):

python -m fastdeploy.entrypoints.openai.api_server \
    --model openbmb/MiniCPM4.1-8B \
    --tensor-parallel-size 1 \
    --quantization wint4 \
    --max-model-len 4096 \
    --max-num-seqs 16

Checklist

Model code follows existing FastDeploy patterns (Qwen2 reference)
All pre-commit checks pass (black, isort, flake8, ruff)
Model registered via @ModelRegistry.register_model_class decorator
Weight mapping supports HuggingFace torch format
Usage documentation provided
Supported models table updated
GPU validation: 24/24 unit tests passed on A800 (SM80)
Unit tests: 514 lines, 24 functional tests with real FD class instantiation

paddle-bot · 2026-04-10T19:56:22Z

Thanks for your contribution!

r-cloudforge · 2026-04-11T15:39:49Z

感谢 AI 审查。标题已修正为 [Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction。

审查中无 blocking issues，不做额外调整。

CLAassistant · 2026-04-13T17:53:16Z

All committers have signed the CLA.

fastdeploy-bot

📋 Review 摘要

PR 概述：为 FastDeploy 添加 MiniCPM4.1-8B 模型支持，实现 μP (Maximal Update Parametrization) 缩放、GQA、LongRoPE 等特性

变更范围：fastdeploy/model_executor/models/minicpm4.py (新增模型文件)、tests/model_executor/test_minicpm4.py (新增测试文件)、文档更新

影响面 Tag：[Models]

问题

未发现阻塞性问题。

总体评价

该 PR 整体实现正确，代码质量良好：

模型架构实现正确，参考了 Qwen2 模型模式并添加了 MiniCPM4 特有的 μP scaling
复用了现有 layers 组件，没有重复实现
使用 @ModelRegistry.register_model_class 自动注册，符合项目规范
24 个单元测试覆盖全面，使用 stub 模拟依赖实现 CPU 安全测试
Config 字段使用 getattr + 默认值处理，兼容性良好
文档完整，包含部署指南和性能调优建议

建议补充端到端测试（tests/e2e/test_MiniCPM4_serving.py）以验证实际推理流程。

r-cloudforge temporarily deployed to Metax_ci April 10, 2026 19:56 — with GitHub Actions Inactive

paddle-bot bot added the contributor External developers label Apr 10, 2026

r-cloudforge mentioned this pull request Apr 10, 2026

CloudForge-Solutions — Hackathon 10th Spring Portfolio Tracker PaddlePaddle/community#1325

Open

This comment was marked as outdated.

Sign in to view

r-cloudforge changed the title ~~【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction~~ [Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction Apr 11, 2026

luotao1 mentioned this pull request Apr 13, 2026

【Hackathon 10th】开源贡献个人挑战赛 · 春节特别季 PaddlePaddle/Paddle#77429

Open

r-cloudforge mentioned this pull request Apr 13, 2026

r-cloudforge (cloudforge1) PaddlePaddle 黑客松进展跟踪 — H9/H10/OCR/Partners PaddlePaddle/community#1327

Open

r-cloudforge had a problem deploying to Metax_ci April 13, 2026 17:53 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

r-cloudforge marked this pull request as draft April 14, 2026 05:20

luotao1 added the PaddlePaddle Hackathon label Apr 14, 2026

luotao1 assigned luotao1 and chang-wenbin Apr 14, 2026

[Feature] add MiniCPM4/4.1 model support

7baf689

r-cloudforge force-pushed the task/050-minicpm41-model-v2 branch from 58c1662 to 7baf689 Compare April 14, 2026 10:04

r-cloudforge requested a deployment to Metax_ci April 14, 2026 10:04 — with GitHub Actions In progress

fastdeploy-bot reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction#7332

[Feature]【Hackathon 10th Spring No.50】MiniCPM4.1-8B model reproduction#7332
r-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/050-minicpm41-model-v2

r-cloudforge commented Apr 10, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

r-cloudforge commented Apr 11, 2026

Uh oh!

CLAassistant commented Apr 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

fastdeploy-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

r-cloudforge commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Model Code (fastdeploy/model_executor/models/minicpm4.py)

Unit Tests (tests/model_executor/test_minicpm4.py)

Documentation

Engineering Highlights

Design Decisions

Usage or Command

Accuracy Tests

Unit Tests (24/24 passed on A800 GPU)

AI Studio A800 GPU Validation (SM80)

GPU 推理验证 / Full Inference Validation

Checklist

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

r-cloudforge commented Apr 11, 2026

Uh oh!

CLAassistant commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

r-cloudforge commented Apr 10, 2026 •

edited

Loading

Model Code (`fastdeploy/model_executor/models/minicpm4.py`)

Unit Tests (`tests/model_executor/test_minicpm4.py`)

CLAassistant commented Apr 13, 2026 •

edited

Loading