前言：之前的文章有一篇是translategemma 4B在IOS手机上运行的

自己也很想做个这样的手机翻译软件，用本地模型，如果没有网络，可以用来救急，在Codex、Gemini的加持下，还真给弄出来了

开源：https://github.com/zhangrr/ios-translate

其中里面有2个模型：

opus-mt-tiny-zh-en-ct2-int8 源自于huggingface，才19M

奇怪的是没有对应的en-zh的小模型，于是乎自己炼了一个出来，也20M

opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8

下面就是详细的炼丹过程了：

模型 opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 完整训练过程

本文档从零开始，逐步解释最终模型 outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8/ 的整个训练链路。每个名词、参数、步骤都有详细阐述，便于复现。

整体架构：知识蒸馏是什么
底模（教师模型）：Helsinki-NLP/opus-mt-en-zh
Marian 模型架构详解
训练数据集：三层来源
自定义词表（Joint 32k SentencePiece）
学生模型配置：从 44M 到 20M 的演化
蒸馏训练核心原理
完整训练链路：12 个阶段
CTranslate2 转换与 int8 量化
CT2 推理流程详解
Pad Embedding 归零问题（CT2 对齐）
关键参数汇总表
完整复现命令清单

1. 整体架构：知识蒸馏是什么

知识蒸馏（Knowledge Distillation, KD） 是一种模型压缩技术。核心思想是让一个**小模型（学生）学习一个大模型（教师）**的输出分布，而不是直接学习人工标注的"标准答案"（ground truth / hard labels）。

输入: "Hello world."
         │
         ├─→ 教师模型 (opus-mt-en-zh, 52M参数) ─→ "你好世界。" (高质量)
         │                                        产生 logits（概率分布）
         │
         └─→ 学生模型 (Tiny, ~20M参数) ─→ 模仿教师的 logits
                                         最终学会产出接近教师的翻译

为什么用 KD 而不是直接训练？

方法	训练目标	效果
直接训练（CE-only）	学生学 “ground truth” 标签	学生只能学到标准答案的信息
知识蒸馏（KL）	学生同时学 ground truth + 教师的输出分布	教师输出包含"软标签"信息（例如哪些词概率相近），比硬标签信息量更大

在本项目中，蒸馏损失函数为：

loss = alpha_ce * CE(student, labels) + (1 - alpha_ce) * KL(softmax(T_teacher/T), softmax(T_student/T)) * T^2

CE：交叉熵损失（学生预测 vs ground truth 标签）
KL：KL 散度（学生预测 vs 教师预测的分布差异）
T（Temperature）：温度参数，控制"软化"程度。T=2 时，概率分布更平滑，学生更容易学到细微差异
alpha_ce：CE 损失权重。alpha_ce=0.5 时 CE 和 KL 各占一半；alpha_ce=1.0 时只用 CE（不做蒸馏）

为什么最终阶段用 alpha_ce=1.0（纯 CE）？ 因为词表缩小到 32k 后，学生词表和教师词表不同，无法做 token-level KL 蒸馏。但此时学生已经从之前的 teacher-init 阶段获得了足够知识，纯 CE 也够用。

2. 底模（教师模型）：Helsinki-NLP/opus-mt-en-zh

2.1 什么是 opus-mt-en-zh

Helsinki-NLP/opus-mt-en-zh 是 Hugging Face Hub 上的一个开源英→中翻译模型，由赫尔辛基大学的 Opus 项目团队训练。它是本项目所有"学生模型"的起点。

2.2 如何获取

# 在 Python 中直接下载并加载（需要代理时可设置 HTTP_PROXY）
from transformers import MarianMTModel, MarianTokenizer
teacher = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

模型会被缓存到 ~/.cache/huggingface/。如果已经下载过，distill_en_zh.py 会直接从缓存加载。

2.3 教师模型的完整配置

参数	值	含义
`model_type`	`marian`	Marian 架构（OPUS 系列使用的轻量 seq2seq）
`d_model`	512	隐藏层维度（embedding 维度）
`encoder_layers`	6	编码器层数
`decoder_layers`	6	解码器层数
`encoder_attention_heads`	8	编码器多头注意力的头数
`decoder_attention_heads`	8	解码器多头注意力的头数
`encoder_ffn_dim`	2048	编码器前馈网络隐藏维度
`decoder_ffn_dim`	2048	解码器前馈网络隐藏维度
`vocab_size`	65001	词表大小（65000 词 + 1 个 `<pad>`）
`pad_token_id`	65000	padding token 的 id（放在词表最后）
`eos_token_id`	0	句子结束 token（`</s>`）
`decoder_start_token_id`	65000	解码器起始 token（Marian 用 `<pad>` 作为起始）
`activation_function`	`swish`	激活函数
`dropout`	0.1	dropout 比率
`tie_word_embeddings`	true	输入 embedding 和输出 LM head 共享权重
`share_encoder_decoder_embeddings`	true	编码器和解码器共享词 embedding
`scale_embedding`	true	用 sqrt(d_model) 缩放 embedding
`max_position_embeddings`	512	最大位置编码长度

参数量：约 52M（5200 万参数）

Embedding 层：65001 × 512 ≈ 33.3M（这是大头，因为词表大）
Transformer 层：约 18.7M
其他（层归一化、偏置等）：少量

2.4 教师模型的本地文件（HF 缓存中）

下载后缓存目录里包含：

文件	含义
`pytorch_model.bin` / `model.safetensors`	模型权重（~100MB）
`config.json`	MarianConfig（上面列的所有参数）
`source.spm`	SentencePiece 源语言（英文）tokenizer 模型
`target.spm`	SentencePiece 目标语言（中文）tokenizer 模型
`vocab.json`	词表映射（piece → id）
`tokenizer_config.json`	tokenizer 配置（model_max_length 等）
`generation_config.json`	生成参数（num_beams, max_length 等）

2.5 教师模型在本项目中的用途

提供 tokenizer：学生模型使用与教师相同的 tokenizer（source.spm / target.spm），保证输入/输出词表一致
生成软标签：在 KD 训练中，教师模型对每个输入生成 logits，作为学生的"软目标"
权重初始化（可选）：通过 --init-from-teacher，把教师权重拷贝到学生模型（需要对齐的维度）

3. Marian 模型架构详解

Marian 是一个 Transformer-based seq2seq 模型，结构如下：

                    ┌─────────────────────────┐
                    │       MarianMTModel     │
                    ├─────────────────────────┤
                    │                         │
  input tokens ──→  │  Encoder                │
  [1, 8, 23, ...]   │  ┌─────────────┐        │  →  hidden states
  (subword units)   │  │ Embedding   │  32001 │  →  [batch, seq_len, d_model]
                    │  │   (shared)  │ × d_m  │
                    │  └──────┬──────┘        │
                    │         │               │
                    │  ┌──────▼──────┐        │
                    │  │ Encoder     │        │
                    │  │ Layers × N  │        │
                    │  │ (Self-Attn  │        │
                    │  │  + FFN)     │        │
                    │  └─────────────┘        │
                    │                         │
                    │  Decoder                │
                    │  ┌─────────────┐        │
                    │  │ Decoder     │        │  →  logits [batch, seq_len, vocab]
                    │  │ Layers × M  │        │  →  argmax → 翻译输出
                    │  │ (Self-Attn  │        │
                    │  │  + Cross-Attn│       │
                    │  │  + FFN)     │        │
                    │  └─────────────┘        │
                    │                         │
                    └─────────────────────────┘

各维度参数对参数量的影响：

参数量主要由以下几个部分构成：

Embedding 层（最大头）：vocab_size × d_model
- 教师：65001 × 512 ≈ 33.3M
- 最终模型：32001 × 320 ≈ 10.2M
每个 Encoder/Decoder 层：
- Self-Attention：4 × d_model × d_model（Q/K/V 投影 + 输出投影）
- FFN：2 × d_model × ffn_dim（两个线性层）
- LayerNorm：2 × d_model（两个归一化层）
- Encoder 层：d_model=320, ffn_dim=1280 → 约 1.8M/层
- Decoder 层：额外有 Cross-Attention → 约 2.5M/层
LM Head（输出层）：d_model × vocab_size（如果 tie_word_embeddings=true 则与 embedding 共享）

最终模型（d_model=320, enc=5, dec=2, vocab=32001）：

Embedding：32001 × 320 ≈ 10.2M
Encoder ×5：约 9.0M
Decoder ×2：约 5.0M
LM Head（共享）：0（共享 embedding）
总计：约 20M 参数

4. 训练数据集：三层来源

最终模型的训练数据来自三层叠加：

4.1 第一层：OPUS-100 通用平行语料（基底）

属性	值
数据集名	`opus100`（Hugging Face `datasets` 库）
语言对	en-zh（英语→中文）
训练集规模	1,000,000 句对
验证集规模	2,000 句对
数据格式	HF Dataset，每行包含 `{"translation": {"en": "...", "zh": "..."}}`
来源	OPUS 语料库（https://opus.nlpl.eu/），包含 Tatoeba、WikiMatrix、OpenSubtitles 等多个子语料的合并

如何加载：

from datasets import load_dataset
ds = load_dataset("opus100", "en-zh")
# ds["train"] → 1,000,000 条
# ds["validation"] → 2,000 条

distill_en_zh.py 通过 --dataset opus100 --dataset-config en-zh 参数加载。

4.2 第二层：Coffee/Latte 增补数据（场景增强）

问题：OPUS-100 中几乎没有 latte、cappuccino、espresso 等咖啡相关词汇。小模型对这些罕见词泛化极差，翻译结果不可控。

解决思路：用教师 CT2 模型生成一批"高质量"咖啡短句的中文翻译作为 ground truth，然后过采样（重复 N 次）加入训练。

属性	值
生成脚本	`scripts/make_extra_coffee_data.py`
内置英文短句	28 条（latte、coffee、cappuccino、espresso 等）
中文目标来源	教师 CT2 模型（`outputs/teacher-en-zh-ct2-int8-full`）beam=4 生成
repeat 次数	100
产出文件	`data/extra.coffee.r100.csv`
行数	2,800 行（28 × 100）

生成命令：

uv run -s scripts/make_extra_coffee_data.py \
  --output-csv data/extra.coffee.r100.csv \
  --repeat 100 \
  --device cuda \
  --compute-type int8_float16

为什么用教师生成而不是人工写？ 因为需要保证英文短句和中文翻译之间的风格一致性（教师用的就是 OPUS-100 训练的，风格一致）。人工写的翻译可能与 OPUS-100 的风格偏差。

4.3 第三层：Money/金额合成数据（数字保真增强）

问题：小模型会把 19.99 翻译为 1999（小数点丢失），这是 seq2seq 翻译的常见问题——数字子序列不被当作一个"整体"处理。

解决思路：用纯规则模板生成英文句子和对应的中文翻译（不是教师生成，因为教师也可能翻错）。中文模板中金额字符串保持不变，确保 19.99 在输入和输出中完全一致。

属性	值
生成脚本	`scripts/make_extra_money_data.py`
方法	纯模板合成（非教师生成）
模板数	15 种模板（The total is {amount} USD., Subtotal/Tax/Tip 等）
金额随机池	固定金额列表（19.99, 9.99, 0.99 等）+ 随机生成
中文目标	金额字符串直接填入中文模板（“总额为{amount}美元。"）
repeat=20 产出	`data/extra.money.synth.r20.csv`（40,000 行）
repeat=50 产出	`data/extra.money.synth.r50.csv`（100,000 行）

最终模型用的是 r20（40,000 行），repeat 较少，避免过度偏向金额场景。

4.4 训练数据叠加方式

OPUS-100 train (最多 50,000 条，由 --max-train-samples 控制)
    + extra.coffee.r100.csv (2,800 行)
    + extra.money.synth.r20.csv (40,000 行)
    = 总计约 92,800 行训练数据

distill_en_zh.py 通过 --extra-train-csv 参数（可传多次）自动拼接到训练集中。

4.5 人工纠错（Manual Corrections）

问题：Tatoeba 数据中混有古文/粤语/不自然的翻译。例如：

This is a dog. → 犬也。（文言文）
This is a cat. → 是狗（错误翻译）

纠错流程：

用 scripts/audit_parallel_data.py 筛选可疑行（中文译文过短、无中文字符）
手动填写 data/manual_corrections.csv（纠错对照表）
用 scripts/apply_corrections.py 批量修正原始 CSV

纠错表内容：

source	target
This is a dog.	这是一条狗。
This is a cat.	这是一只猫。
It’s not a cat. It’s a dog.	它不是猫。它是一条狗。
…	…

对于 OPUS-100 训练，可以通过 --extra-train-csv data/manual_corrections.csv --extra-train-repeat 200 过采样纠错样本，确保模型学到正确翻译。

5. 自定义词表（Joint 32k SentencePiece）

5.1 为什么要压缩词表

教师模型的词表是 65001（65000 词 + 1 个 <pad>），其中：

Embedding 参数量 = 65001 × 512 ≈ 33.3M，占总参数的大头
很多词是低频词，对翻译质量贡献很小

压缩到 32001 后：

Embedding 参数量 = 32001 × 320 ≈ 10.2M（大幅减少）
CT2 int8 量化后 model.bin 从 44MB 降到 ~20MB

5.2 Joint 词表的概念

什么是 Joint 词表？

传统做法：英文一个 tokenizer，中文一个 tokenizer（两个独立的 SentencePiece 模型）
Joint 词表：把英文和中文的文本混在一起训练一个 SentencePiece 模型，编码器和解码器共享同一个词表

传统：source.spm (英文) + target.spm (中文) → 两个独立词表
Joint：source.spm == target.spm（同一份文件）→ 英中共享词表

好处：

编码器和解码器可以共享 embedding（share_encoder_decoder_embeddings=true），再省一半参数
词表统一，不需要处理词表映射问题

5.3 训练 Joint 32k Tokenizer

脚本：scripts/build_joint_tokenizer.py

uv run -s scripts/build_joint_tokenizer.py \
  --output-dir tokenizers/opus100_joint32k \
  --max-samples 200000 \
  --extra-csv data/extra.coffee.r100.csv \
  --extra-csv data/extra.travel.r20.csv \
  --vocab-size 32000 \
  --model-type unigram

参数	值	含义
`--dataset`	`opus100`	数据来源
`--dataset-config`	`en-zh`	语言对
`--max-samples`	200000	从 OPUS-100 中采样 20 万句对用于训练 tokenizer
`--extra-csv`	可传多次	额外语料（coffee + travel）也混入，确保这些词被收录到词表
`--vocab-size`	32000	SentencePiece 词表大小（不含 `<pad>`）
`--model-type`	`unigram`	SentencePiece 算法类型（unigram 对中文效果较好）
`--character-coverage`	0.9995	字符覆盖率（中文需要接近 1.0）

SentencePiece 训练参数详解：

eos_id=0：</s> 对应 id 0
unk_id=1：<unk> 对应 id 1
bos_id=-1：不使用 <bos>（Marian 不需要）
pad_id=-1：SentencePiece 不分配 pad（后续手动添加）

产出文件（tokenizers/opus100_joint32k/）：

文件	含义
`spm.model`	SentencePiece 二进制模型（32000 个 subword piece）
`source.spm`	同 spm.model 的拷贝（兼容 MarianTokenizer 接口）
`target.spm`	同 spm.model 的拷贝（同上）
`vocab.json`	词表映射：piece → id（32000 + 1 个 `<pad>` = 32001）
`tokenizer_config.json`	`{"model_max_length": 512}`
`corpus.joint.txt`	用于训练 tokenizer 的语料文本

<pad> 的添加： SentencePiece 训练时 pad_id=-1（不分配）。训练完成后，脚本在 vocab.json 中额外添加 <pad>，id = 词表大小 = 32000。所以最终词表大小是 32001。

5.4 Travel 增补数据（数字/日期/人名/地名增强）

在训练 tokenizer 的同时，用 scripts/make_extra_travel_data.py 生成了一批包含人名、地名、日期、航班号等的增强数据：

uv run -s scripts/make_extra_travel_data.py \
  --output-csv data/extra.travel.r20.csv \
  --num-examples 2000 \
  --repeat 20 \
  --device cuda \
  --compute-type int8_float16

产出 40,000 行，包含如：

Tom is going to Beijing on 2026-04-20.
My flight AA123 departs at 14:30 on Apr 20, 2026.
The price is $89.99.

这些数据的作用：

训练 tokenizer 时确保人名、地名等专有名词能被正确分词
作为增补训练数据，让模型学会翻译包含数字/日期的句子

6. 学生模型配置：从 44M 到 20M 的演化

6.1 配置生成脚本

scripts/make_student_config.py 以教师配置为基线，仅修改结构参数：

uv run -s scripts/make_student_config.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --output student_config_320d_enc5_dec2_joint32k.json \
  --d-model 320 \
  --encoder-layers 5 \
  --decoder-layers 2 \
  --attention-heads 8 \
  --ffn-dim 1280 \
  --vocab-size 32001

6.2 各代学生模型配置对比

配置	阶段9（512d OPUS-100）	阶段11（512d Joint32k）	阶段12（320d Joint32k，最终）
`d_model`	512	512	320
`encoder_layers`	6	6	5
`decoder_layers`	2	2	2
`encoder_attention_heads`	8	8	8
`decoder_attention_heads`	8	8	8
`encoder_ffn_dim`	2048	2048	1280
`decoder_ffn_dim`	2048	2048	1280
`vocab_size`	65001	32001	32001
`pad_token_id`	65000	32000	32000
参数量	~44M	~44M	~20M
CT2 int8 体积	~60MB	~44MB	~20MB

6.3 学生配置最终文件（student_config_320d_enc5_dec2_joint32k.json）

完整字段见 student_config_320d_enc5_dec2_joint32k.json，关键字段：

{
  "model_type": "marian",
  "d_model": 320,
  "encoder_layers": 5,
  "decoder_layers": 2,
  "encoder_attention_heads": 8,
  "decoder_attention_heads": 8,
  "encoder_ffn_dim": 1280,
  "decoder_ffn_dim": 1280,
  "vocab_size": 32001,
  "decoder_vocab_size": 32001,
  "pad_token_id": 32000,
  "decoder_start_token_id": 32000,
  "eos_token_id": 0,
  "activation_function": "swish",
  "dropout": 0.1,
  "tie_word_embeddings": true,
  "share_encoder_decoder_embeddings": true,
  "scale_embedding": true,
  "max_position_embeddings": 512
}

7. 蒸馏训练核心原理

7.1 损失函数

distill_en_zh.py 中的 DistillSeq2SeqTrainer 类重写了 compute_loss 方法：

# alpha_ce >= 1.0 → 纯 CE（不蒸馏）
loss = alpha_ce * CE(student_logits, labels)

# alpha_ce < 1.0 → CE + KL 蒸馏
loss = alpha_ce * CE(student, labels)
     + (1 - alpha_ce) * KL(softmax(teacher/T), softmax(student/T)) * T^2

各参数含义：

参数	含义	典型值
`alpha_ce`	CE 损失权重。=1.0 时只有 CE	0.5（蒸馏）/ 1.0（纯 CE）
`temperature` (T)	软化分布的温度。T 越大，分布越平滑	2.0
`teacher_dtype`	教师模型精度（fp16 省显存）	“fp16”

温度 T 的作用举例：

原始 logits：[10, 2, -1, -3]

T=1：softmax → [0.997, 0.003, 0.000, 0.000]（非常尖，和 hard label 几乎一样）
T=2：softmax(logits/2) → [0.916, 0.075, 0.009, 0.001]（更平滑，能看到第 2 个 token 也有一定概率）
T=4：softmax(logits/4) → [0.803, 0.150, 0.037, 0.011]（更平滑，信息更丰富）

T=2 是一个经验甜点：既不过于尖锐（让学生学到更多），也不过于平滑（保持主要信息）。

7.2 教师模型的固定策略

# 教师模型：eval 模式 + 不需要梯度 + fp16
teacher.eval()
teacher.requires_grad_(False)
teacher.half()  # fp16

# 推理时不需要梯度
with torch.no_grad():
    teacher_outputs = teacher(input_ids=..., attention_mask=..., decoder_input_ids=...)

教师模型从不更新参数，只用于前向推理生成 logits。

7.3 Tokenizer 处理流程

英文原文: "give me a cup of latte"
    │
    ▼ SentencePiece encode (source.spm)
tokens: ["give", "▁me", "▁a", "▁cup", "▁of", "▁lat", "te"]
    │
    ▼ 加 </s> (eos)
tokens: ["give", "▁me", "▁a", "▁cup", "▁of", "▁lat", "te", "</s>"]
    │
    ▼ 转为 id
input_ids: [123, 456, 78, ...]
    │
    ▼ 送入模型
encoder → decoder → logits [seq_len, vocab_size]
    │
    ▼ argmax / beam search
output_ids: [789, 234, ...]
    │
    ▼ SentencePiece decode (target.spm)
中文: "给我一杯拿铁"

7.4 Seq2SeqTrainer 评估流程

评估时使用 predict_with_generate=True：

模型用 beam search 生成翻译
用 sacrebleu.corpus_bleu() 计算 BLEU 分数
BLEU 是翻译质量的核心指标（越高越好）

8. 完整训练链路：12 个阶段

最终模型 opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 的诞生经历了多个阶段的迭代。以下是按时间顺序的完整链路，每个阶段产出的模型都是下一个阶段的基础或参考。

阶段概览

教师模型 (opus-mt-en-zh, 52M)
    │
    ▼
[阶段1] OPUS-100 全量训练 → opus-mt-small512d-opus100 (512d, vocab=65k, ~44M)
    │
    ▼
[阶段2] 咖啡增补微调 → opus-mt-small512d-opus100-ft-mix-coffee
    │
    ▼
[阶段3] Joint 32k Tokenizer 训练 + Smoke 训练 → opus-mt-small512d-opus100-joint32k-smoke
    │
    ▼
[阶段4] OPUS-100 全量 Joint32k → opus-mt-small512d-opus100-joint32k (~44M → CT2 44MB)
    │
    ▼
[阶段5] 320d 架构从随机初始化训练 → opus-mt-small320d-opus100-joint32k (~20M)
    │
    ▼
[阶段6] 金额增强短训（修小数但伤 coffee）→ opus-mt-small320d-opus100-joint32k-ft-decimal
    │
    ▼
[阶段7] 金额+咖啡联合短训（推荐）→ opus-mt-small320d-opus100-joint32k-ft-money-coffee
    │
    ▼
[阶段8] CTranslate2 int8 量化 → opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 (~20MB)

阶段1：OPUS-100 全量训练（512d，vocab=65k）

目的：用大规模语料训练一个 512 维的学生模型，作为后续所有迭代的"中间教师”。

uv run -s distill_en_zh.py \
  --student-config student_config_512d_dec2.json \
  --init-from-teacher \
  --dataset opus100 \
  --dataset-config en-zh \
  --output-dir outputs/opus-mt-small512d-opus100 \
  --max-eval-samples 2000 \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 3e-5 \
  --warmup-steps 2000 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 5000 \
  --save-steps 5000 \
  --logging-steps 50 \
  --fp16 \
  --teacher-dtype fp16 \
  --temperature 2 \
  --alpha-ce 0.5 \
  --num-beams 1

参数	值	含义
`--init-from-teacher`	—	从教师模型初始化权重（`d_model=512` 等维度对齐，可精确拷贝）
`--dataset opus100`	—	使用 HF opus100 数据集（1M 句对）
`--num-train-epochs`	1	全量跑 1 个 epoch
`--learning-rate`	3e-5	较小学习率（因为从 teacher 初始化，不需要大学习率）
`--alpha-ce`	0.5	CE 和 KL 蒸馏各占一半（词表相同，可以做 KD）
`--temperature`	2	蒸馏温度
`--per-device-train-batch-size`	4	每 GPU batch size=4
`--gradient-accumulation-steps`	8	梯度累积 8 步 = 有效 batch size = 4×8 = 32
`--warmup-steps`	2000	前 2000 步线性预热学习率
`--fp16`	—	混合精度训练（省显存）
`--num-beams`	1	评估时 greedy decoding（快）

结果：eval_bleu ≈ 14.37（2000 条验证集）

阶段2：Coffee 增补微调（512d，vocab=65k）

目的：在阶段1模型基础上，混入咖啡短句做微调，修复 latte 等词的翻译。

uv run -s distill_en_zh.py \
  --student-model outputs/opus-mt-small512d-opus100 \
  --dataset opus100 \
  --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --output-dir outputs/opus-mt-small512d-opus100-ft-mix-coffee \
  --max-train-samples 100000 \
  --max-eval-samples 2000 \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 1e-5 \
  --warmup-steps 0 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 1000 \
  --save-steps 2000 \
  --logging-steps 50 \
  --fp16 \
  --alpha-ce 1.0 \
  --num-beams 1

参数	值	含义
`--student-model`	阶段1产出	从阶段1模型加载权重（不是随机初始化）
`--max-train-samples`	100000	从 OPUS-100 中只取 10 万条做锚定（防止灾难性遗忘）
`--learning-rate`	1e-5	微调用小学习率
`--warmup-steps`	0	不需要预热（微调）
`--alpha-ce`	1.0	纯 CE（不做蒸馏，因为重点是学 coffee 短句的 ground truth）

结果：eval_bleu ≈ 14.62，latte → 给我一杯拿铁

阶段3：Joint 32k Tokenizer + Smoke 训练

3a. 训练 Joint 32k Tokenizer（见第5节） 3b. 生成学生配置（vocab=32001） 3c. Smoke 训练（5 万样本验证流程）

uv run -s distill_en_zh.py \
  --student-config student_config_512d_dec2_joint32k.json \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --init-from-teacher \
  --init-from-teacher-decoder-layer-map 0,5 \
  --dataset opus100 \
  --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.travel.r20.csv \
  --output-dir outputs/opus-mt-small512d-opus100-joint32k-smoke \
  --max-train-samples 50000 \
  --max-eval-samples 2000 \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 3e-4 \
  --warmup-steps 200 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 1000 \
  --save-steps 1000 \
  --logging-steps 50 \
  --fp16 \
  --alpha-ce 1.0 \
  --num-beams 1

新增参数解释：

参数	值	含义
`--tokenizer-dir`	`tokenizers/opus100_joint32k`	使用自定义 joint 32k tokenizer（不用教师的）
`--init-from-teacher-decoder-layer-map`	`0,5`	decoder 层映射：学生 decoder 第0层 ← teacher decoder 第0层，学生第1层 ← teacher 第5层（“跳层抽取”，取首尾层信息最丰富）
`--learning-rate`	3e-4	词表变了，权重需要更大学习率来适应

初始化输出：copied_exact=150, copied_mapped=52, skipped_shape_mismatch=5

150 个参数精确拷贝（encoder 层等）
52 个参数通过 layer map 映射拷贝（decoder 层）
5 个参数因形状不匹配跳过（如 decoder.layers.1 等未被映射的层）

结果：eval_bleu ≈ 2.79（仅 smoke，质量低，需要全量训练）

阶段4：OPUS-100 全量 Joint 32k（512d）

uv run -s distill_en_zh.py \
  --student-model outputs/opus-mt-small512d-opus100-joint32k-smoke \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --dataset opus100 \
  --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.travel.r20.csv \
  --output-dir outputs/opus-mt-small512d-opus100-joint32k \
  --max-eval-samples 2000 \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 3e-4 \
  --warmup-steps 2000 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 10000 \
  --save-steps 5000 \
  --logging-steps 100 \
  --fp16 \
  --alpha-ce 1.0 \
  --num-beams 1

与 smoke 的区别：

--student-model 指向 smoke 产出（作为初始化，而非随机初始化）
没有 --max-train-samples（全量 1M + 增强集 ≈ 1.04M 条）
--warmup-steps 增大到 2000

结果：

eval_bleu ≈ 21.55
eval_loss ≈ 2.859
max_steps = 32588
model.safetensors ≈ 167MB

阶段5：320d 架构全量训练（从随机初始化）

这是模型体积降到 ~20MB 的关键一步。

uv run -s distill_en_zh.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --student-config student_config_320d_enc5_dec2_joint32k.json \
  --dataset opus100 \
  --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.travel.r20.csv \
  --output-dir outputs/opus-mt-small320d-opus100-joint32k \
  --max-eval-samples 2000 \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 3e-4 \
  --warmup-steps 2000 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 10000 \
  --save-steps 5000 \
  --logging-steps 100 \
  --fp16 \
  --alpha-ce 1.0 \
  --num-beams 1

关键变化：

参数	值	含义
`--teacher-model`	`outputs/opus-mt-small512d-opus100-joint32k`	教师从原始 HF 模型改为阶段4产出（512d+Joint32k），因为词表相同（32k）
`--student-config`	`student_config_320d_enc5_dec2_joint32k.json`	d_model=320 的新配置
无 `--init-from-teacher`	—	从随机初始化开始（因为 d_model=320 与 teacher 的 512 不匹配，无法初始化）

为什么不能用 --init-from-teacher？ --init-from-teacher 要求 d_model、encoder_attention_heads、encoder_ffn_dim 等维度与教师完全一致。学生 d_model=320，教师 d_model=512，权重矩阵形状不同（320×320 vs 512×512），无法直接拷贝。

结果：

eval_bleu ≈ 17.36（比 512d 的 21.55 略低，符合预期——参数少了）
eval_loss ≈ 3.278
model.safetensors ≈ 76MB

阶段6：金额增强短训（修小数，但伤 coffee）

uv run -s distill_en_zh.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --student-model outputs/opus-mt-small320d-opus100-joint32k \
  --dataset opus100 \
  --dataset-config en-zh \
  --max-train-samples 20000 \
  --max-eval-samples 500 \
  --extra-train-csv data/extra.money.synth.r50.csv \
  --output-dir outputs/opus-mt-small320d-opus100-joint32k-ft-decimal \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 2e-4 \
  --warmup-steps 200 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 1000 \
  --save-steps 1000 \
  --logging-steps 50 \
  --fp16 \
  --alpha-ce 1.0 \
  --num-beams 1

参数	值	含义
`--student-model`	阶段5产出	从阶段5模型加载权重继续微调
`--max-train-samples`	20000	只取 2 万条 OPUS-100 做锚定
`--extra-train-csv`	`extra.money.synth.r50.csv`	10 万行金额增强数据
`--learning-rate`	2e-4	微调学习率

问题：金额修好了（19.99 → 19.99），但 latte 退化为 给我来杯茶。原因：10 万行金额数据占比太大，模型被"带偏"，忘了咖啡相关的知识。

阶段7：金额 + Coffee 联合短训（最终 HF 模型）

这是最终的 HF 格式模型，修复了金额同时保住了 coffee 场景。

uv run -s distill_en_zh.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --student-model outputs/opus-mt-small320d-opus100-joint32k \
  --dataset opus100 \
  --dataset-config en-zh \
  --max-train-samples 50000 \
  --max-eval-samples 500 \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.money.synth.r20.csv \
  --output-dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee \
  --preprocess-num-proc 8 \
  --num-train-epochs 1 \
  --learning-rate 1e-4 \
  --warmup-steps 200 \
  --per-device-train-batch-size 4 \
  --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 \
  --eval-steps 1000 \
  --save-steps 1000 \
  --logging-steps 50 \
  --fp16 \
  --alpha-ce 1.0 \
  --num-beams 1

参数	值	含义
`--student-model`	阶段5产出	直接从阶段5（320d 全量）加载权重
`--max-train-samples`	50000	5 万条 OPUS-100 做锚定
`--extra-train-csv` ×2	coffee + money	同时混入两类增强数据
`--learning-rate`	1e-4	更小学习率（更温和的微调）

训练数据构成：

OPUS-100：50,000 条
Coffee 增补：2,800 条（28 × 100）
Money 合成：40,000 条（2,000 × 20）
总计：约 92,800 条

结果：

eval_bleu ≈ 16.34
eval_loss ≈ 3.301
max_steps = 2900

阶段8：CTranslate2 int8 量化（最终产物）

rm -rf outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8

uv run -m ctranslate2.converters.transformers \
  --model outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee \
  --quantization int8 \
  --copy_files source.spm target.spm vocab.json tokenizer_config.json generation_config.json \
  --output_dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8

结果：model.bin ≈ 20MB

9. CTranslate2 转换与 int8 量化

9.1 什么是 CTranslate2

CTranslate2 是一个针对推理优化的 C++ 库，支持 Transformer 系列模型的高效部署。它将 HF 格式的模型（model.safetensors + 各种 JSON 配置文件）转换为 CTranslate2 特有的格式（model.bin）。

9.2 为什么用 CT2

维度	HF Transformers	CTranslate2
推理速度	较慢（Python + PyTorch）	快（C++，高度优化）
内存占用	大（加载整个 PyTorch）	小（仅加载模型权重）
量化支持	有限	原生 int8 / int8_float16
部署难度	需要 Python 环境	只需 CT2 库（C++ / Python bindings）

9.3 转换过程详解

输入（HF 格式目录）：

outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee/
├── model.safetensors      ← 模型权重（76MB）
├── config.json            ← MarianConfig
├── generation_config.json ← 生成参数
├── source.spm             ← SentencePiece tokenizer
├── target.spm             ← SentencePiece tokenizer
├── vocab.json             ← 词表映射
├── tokenizer_config.json  ← tokenizer 配置
└── run_args.json          ← 训练参数记录

转换命令：

python -m ctranslate2.converters.transformers \
  --model <HF目录> \
  --quantization int8 \
  --copy_files source.spm target.spm vocab.json tokenizer_config.json generation_config.json \
  --output_dir <CT2目录>

转换过程：

读取 config.json，识别模型类型为 marian
加载 model.safetensors 中的所有权重

对每个权重矩阵做 int8 对称量化：

# 量化公式
int8_weight = round(float32_weight / scale)  # scale = max(abs(weight)) / 127
# 存储: int8_weight (1 byte/参数) + scale (4 bytes/矩阵)

生成 model.bin（CT2 二进制格式）
生成 config.json（CT2 格式的配置）
生成 shared_vocabulary.json（词表）
拷贝指定的文件（spm、vocab.json 等）

输出（CT2 格式目录）：

outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8/
├── model.bin                 ← 量化后的模型权重（~20MB）
├── config.json               ← CT2 配置
├── shared_vocabulary.json    ← 共享词表
├── source.spm                ← SentencePiece tokenizer（拷贝）
├── target.spm                ← SentencePiece tokenizer（拷贝）
├── vocab.json                ← 词表映射（拷贝）
├── tokenizer_config.json     ← tokenizer 配置（拷贝）
└── generation_config.json    ← 生成参数（拷贝）

9.4 int8 量化详解

int8 对称量化原理：

每个权重矩阵独立量化：

scale = max(abs(W)) / 127
W_int8 = round(W_float32 / scale)  # 范围 [-127, 127]

推理时：

# 反量化回 float 计算
W_float_approx = W_int8 * scale

参数量 vs 文件大小关系：

int8 量化后：每个参数 ≈ 1 byte
20M 参数 ≈ 20MB（加上少量 scale 开销）

量化精度对比：

模型	safetensors	CT2 int8 model.bin	BLEU 损失
512d Joint32k	167MB	44MB	极小
320d Joint32k	76MB	20MB	极小

9.5 `--copy_files` 参数说明

--copy_files source.spm target.spm vocab.json tokenizer_config.json generation_config.json

这些文件转换器不会自动生成，需要从 HF 目录拷贝过来：

source.spm / target.spm：CT2 推理时需要做 tokenization
vocab.json：词表映射（某些工具需要）
tokenizer_config.json / generation_config.json：配置信息

注意：不要拷贝 config.json，因为转换器会自己生成 CT2 格式的 config.json。如果拷贝了会冲突覆盖。

10. CT2 推理流程详解

10.1 推理脚本

scripts/translate_ct2.py 是交互式翻译工具。

10.2 推理流程

用户输入: "give me a cup of latte"
    │
    ▼ SentencePiece.encode (source.spm)
tokens: ["give", "▁me", "▁a", "▁cup", "▁of", "▁lat", "te", "</s>"]
    │
    ▼ CTranslate2.Translator.translate_batch (beam_size=4)
CT2 内部：
  1. Encoder 编码 tokens → hidden states
  2. Decoder beam search（beam=4）：
     - 维护 4 个候选序列
     - 每一步扩展 4 个候选，保留 top 4
  3. 输出最佳序列的 token ids
    │
    ▼ 过滤特殊 token
去掉 </s>, <pad>, >>zh<< 等
    │
    ▼ SentencePiece.decode (target.spm)
中文: "给我一杯拿铁"

10.3 使用方式

交互式：

uv run -s scripts/translate_ct2.py
# 输入 EN> give me a cup of latte
# 输出 ZH> 给我一杯拿铁

单句模式：

uv run -s scripts/translate_ct2.py \
  --model-dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 \
  --device cuda \
  --compute-type int8_float16 \
  --beam-size 4 \
  --text "give me a cup of latte"

参数	含义
`--device`	`cuda`（GPU）/ `cpu` / `auto`（自动检测）
`--compute-type`	`int8_float16`（int8 权重 + fp16 计算，速度快质量好）
`--beam-size`	beam search 宽度。越大质量越好但越慢。推荐 4
`--max-decoding-length`	最大解码长度（默认 64）

10.4 最终模型翻译效果

英文输入	中文输出	状态
give me a cup of latte	给我一杯拿铁	✅
give me a cup of caffe	给我一杯咖啡	✅
The total is 19.99 USD.	总额为19.99美元。	✅
The total is $19.99.	总额为19.99美元。	✅
Subtotal: $1,234.56. Tax: $1.60. Total: $19.99.	小计:1,234.56美元。税费:1.60美元。总计:19.99美元。	✅
This is a dog	这是狗狗	⚠️（“狗狗"不是最自然的说法）
Tom is going to Beijing on 2026-04-20.	汤姆将于2026-04-20号去北京	✅

11. Pad Embedding 归零问题（CT2 对齐）

11.1 问题描述

同一个 student 模型，在 HF 中 generate(num_beams=4) 输出看起来还行，但转成 CT2 后输出质量显著下降。

11.2 根因分析

Marian 模型的特殊行为：

Marian 没有 <bos> token，解码器起始 token 是 <pad>
解码时，第一个输入是 <pad> 的 embedding 向量

CT2 的优化：

CT2 转换器会移除词表中的 <pad>（因为它在推理中只作为起始信号）
CT2 使用 start_from_zero_embedding 策略：解码从全 0 向量开始

关键矛盾：

教师模型（Helsinki-NLP）的 <pad> embedding 本来就是 0（CT2 的假设成立）
学生模型从随机初始化训练，<pad> embedding 被训练成非 0 向量
CT2 用 0 向量开始解码，HF 用实际的 <pad> embedding 开始 → 两者行为不一致

11.3 解决方案：ZeroPadEmbeddingCallback

在 distill_en_zh.py 中实现：

class ZeroPadEmbeddingCallback(TrainerCallback):
    def __init__(self, pad_token_id: int):
        self.pad_token_id = pad_token_id

    def _zero_pad(self, model):
        embeddings = model.get_input_embeddings()
        weight = embeddings.weight
        with torch.no_grad():
            weight[self.pad_token_id].zero_()

    def on_train_begin(self, args, state, control, **kwargs):
        self._zero_pad(kwargs.get("model"))

    def on_step_end(self, args, state, control, **kwargs):
        self._zero_pad(kwargs.get("model"))

工作方式：

训练开始前：把 <pad> embedding 行置为 0
每个训练 step 结束后：再把 <pad> embedding 行置为 0（防止梯度更新导致漂移）

训练脚本中也做了初始化置零：

# 创建 student 后立即置零
pad_id = student.config.pad_token_id
with torch.no_grad():
    student.get_input_embeddings().weight[pad_id].zero_()

11.4 验证方法

训练完成后检查 <pad> embedding 是否为 0：

uv run - <<'PY'
from transformers import MarianMTModel
m = MarianMTModel.from_pretrained("outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee")
pad = m.config.pad_token_id
print(m.model.shared.weight[pad].norm().item())  # 应输出 0.0
PY

12. 关键参数汇总表

12.1 蒸馏训练参数

参数	阶段5（320d 全量）	阶段7（最终短训）	含义
`--teacher-model`	`outputs/opus-mt-small512d-opus100-joint32k`	同左	教师模型路径
`--tokenizer-dir`	`tokenizers/opus100_joint32k`	同左	自定义 tokenizer
`--student-model`	（无，随机初始化）	`outputs/opus-mt-small320d-opus100-joint32k`	学生模型路径（加载已有权重）
`--dataset`	`opus100`	`opus100`	数据集名
`--max-train-samples`	无（全量）	50000	训练样本上限
`--extra-train-csv`	coffee + travel	coffee + money	增补数据
`--num-train-epochs`	1	1	训练轮数
`--learning-rate`	3e-4	1e-4	学习率（全量大，微调小）
`--warmup-steps`	2000	200	预热步数
`--per-device-train-batch-size`	4	4	每 GPU batch size
`--gradient-accumulation-steps`	8	8	梯度累积步数
`--effective-batch-size`	32	32	有效 batch size = batch × accum
`--alpha-ce`	1.0	1.0	CE 权重（词表不同，纯 CE）
`--temperature`	2.0	2.0	蒸馏温度
`--fp16`	是	是	混合精度训练
`--num-beams`	1	1	评估 beam 宽度

12.2 模型架构参数

参数	值	含义
`d_model`	320	隐藏层/embedding 维度
`encoder_layers`	5	编码器层数
`decoder_layers`	2	解码器层数
`encoder_attention_heads`	8	编码器注意力头数
`decoder_attention_heads`	8	解码器注意力头数
`encoder_ffn_dim`	1280	编码器前馈维度
`decoder_ffn_dim`	1280	解码器前馈维度
`vocab_size`	32001	词表大小
`pad_token_id`	32000	padding token id
参数量	~20M	总参数量
CT2 int8 体积	~20MB	量化后文件大小

12.3 数据集统计

数据源	行数	来源
OPUS-100 train	1,000,000	HF datasets
OPUS-100 validation	2,000	HF datasets
extra.coffee.r100.csv	2,800	教师 CT2 生成，28 句 × 100
extra.money.synth.r20.csv	40,000	规则模板合成，2000 句 × 20
extra.travel.r20.csv	40,000	教师 CT2 生成，2000 句 × 20
manual_corrections.csv	6	人工纠错对照表

13. 完整复现命令清单

以下是从零开始复现最终模型的完整命令序列（按顺序执行）。

13.0 环境准备

# 代理设置（如果需要）
export HTTP_PROXY=http://10.8.2.26:10808
export HTTPS_PROXY=http://10.8.2.26:10808
export ALL_PROXY=http://10.8.2.26:10808

# 确认 Python 版本
python --version  # 应 >= 3.10

# 确认 GPU
nvidia-smi -L  # 确认有 NVIDIA GPU

# 安装依赖
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu124
uv pip install --python .venv/bin/python transformers datasets==2.21.0 sentencepiece sacremoses accelerate evaluate sacrebleu ctranslate2

13.1 训练 Joint 32k Tokenizer

export HTTP_PROXY=http://10.8.2.26:10808
export HTTPS_PROXY=http://10.8.2.26:10808
export ALL_PROXY=http://10.8.2.26:10808

rm -rf tokenizers/opus100_joint32k

# 先生成增补数据（tokenizer 训练需要用到）
uv run -s scripts/make_extra_coffee_data.py \
  --output-csv data/extra.coffee.r100.csv --repeat 100 --device cuda --compute-type int8_float16

uv run -s scripts/make_extra_travel_data.py \
  --output-csv data/extra.travel.r20.csv --num-examples 2000 --repeat 20 --device cuda --compute-type int8_float16

# 训练 tokenizer
uv run -s scripts/build_joint_tokenizer.py \
  --output-dir tokenizers/opus100_joint32k \
  --max-samples 200000 \
  --extra-csv data/extra.coffee.r100.csv \
  --extra-csv data/extra.travel.r20.csv \
  --vocab-size 32000 \
  --model-type unigram

13.2 训练中间教师（512d Joint32k，OPUS-100 全量）

这一步需要较长的训练时间（取决于 GPU）。

# 先生成学生配置
uv run -s scripts/make_student_config.py \
  --teacher-model Helsinki-NLP/opus-mt-en-zh \
  --output student_config_512d_dec2_joint32k.json \
  --d-model 512 --encoder-layers 6 --decoder-layers 2 \
  --attention-heads 8 --ffn-dim 2048 --vocab-size 32001

rm -rf outputs/opus-mt-small512d-opus100-joint32k-smoke

# Smoke 训练（验证流程）
uv run -s distill_en_zh.py \
  --student-config student_config_512d_dec2_joint32k.json \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --init-from-teacher \
  --init-from-teacher-decoder-layer-map 0,5 \
  --dataset opus100 --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.travel.r20.csv \
  --output-dir outputs/opus-mt-small512d-opus100-joint32k-smoke \
  --max-train-samples 50000 --max-eval-samples 2000 \
  --preprocess-num-proc 8 --num-train-epochs 1 \
  --learning-rate 3e-4 --warmup-steps 200 \
  --per-device-train-batch-size 4 --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 --eval-steps 1000 --save-steps 1000 \
  --logging-steps 50 --fp16 --alpha-ce 1.0 --num-beams 1

# 全量训练
rm -rf outputs/opus-mt-small512d-opus100-joint32k

uv run -s distill_en_zh.py \
  --student-model outputs/opus-mt-small512d-opus100-joint32k-smoke \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --dataset opus100 --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.travel.r20.csv \
  --output-dir outputs/opus-mt-small512d-opus100-joint32k \
  --max-eval-samples 2000 --preprocess-num-proc 8 \
  --num-train-epochs 1 --learning-rate 3e-4 --warmup-steps 2000 \
  --per-device-train-batch-size 4 --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 --eval-steps 10000 --save-steps 5000 \
  --logging-steps 100 --fp16 --alpha-ce 1.0 --num-beams 1

13.3 训练最终 320d 学生模型（全量）

# 生成 320d 配置
uv run -s scripts/make_student_config.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --output student_config_320d_enc5_dec2_joint32k.json \
  --d-model 320 --encoder-layers 5 --decoder-layers 2 \
  --attention-heads 8 --ffn-dim 1280 --vocab-size 32001

# 全量训练（从随机初始化）
rm -rf outputs/opus-mt-small320d-opus100-joint32k

uv run -s distill_en_zh.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --student-config student_config_320d_enc5_dec2_joint32k.json \
  --dataset opus100 --dataset-config en-zh \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.travel.r20.csv \
  --output-dir outputs/opus-mt-small320d-opus100-joint32k \
  --max-eval-samples 2000 --preprocess-num-proc 8 \
  --num-train-epochs 1 --learning-rate 3e-4 --warmup-steps 2000 \
  --per-device-train-batch-size 4 --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 --eval-steps 10000 --save-steps 5000 \
  --logging-steps 100 --fp16 --alpha-ce 1.0 --num-beams 1

13.4 生成金额合成数据 + 联合短训

# 生成金额数据
uv run -s scripts/make_extra_money_data.py \
  --output-csv data/extra.money.synth.r20.csv \
  --num-examples 2000 --repeat 20 --seed 42

# 联合短训（money + coffee）
rm -rf outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee

uv run -s distill_en_zh.py \
  --teacher-model outputs/opus-mt-small512d-opus100-joint32k \
  --tokenizer-dir tokenizers/opus100_joint32k \
  --student-model outputs/opus-mt-small320d-opus100-joint32k \
  --dataset opus100 --dataset-config en-zh \
  --max-train-samples 50000 --max-eval-samples 500 \
  --extra-train-csv data/extra.coffee.r100.csv \
  --extra-train-csv data/extra.money.synth.r20.csv \
  --output-dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee \
  --preprocess-num-proc 8 --num-train-epochs 1 \
  --learning-rate 1e-4 --warmup-steps 200 \
  --per-device-train-batch-size 4 --gradient-accumulation-steps 8 \
  --per-device-eval-batch-size 4 --eval-steps 1000 --save-steps 1000 \
  --logging-steps 50 --fp16 --alpha-ce 1.0 --num-beams 1

13.5 验证 Pad Embedding 为零

uv run - <<'PY'
from transformers import MarianMTModel
m = MarianMTModel.from_pretrained("outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee")
pad = m.config.pad_token_id
norm = m.model.shared.weight[pad].norm().item()
print(f"Pad embedding norm: {norm}")  # 应为 0.0
assert norm == 0.0, "Pad embedding is not zero!"
PY

13.6 CTranslate2 int8 转换

rm -rf outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8

uv run -m ctranslate2.converters.transformers \
  --model outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee \
  --quantization int8 \
  --copy_files source.spm target.spm vocab.json tokenizer_config.json generation_config.json \
  --output_dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8

# 检查体积
du -h outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8/model.bin
# 应显示 ~20MB

13.7 测试翻译效果

uv run -s scripts/translate_ct2.py \
  --model-dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 \
  --device cuda --compute-type int8_float16 --beam-size 4 \
  --text "give me a cup of latte"

uv run -s scripts/translate_ct2.py \
  --model-dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 \
  --device cuda --compute-type int8_float16 --beam-size 4 \
  --text "The total is 19.99 USD."

uv run -s scripts/translate_ct2.py \
  --model-dir outputs/opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 \
  --device cuda --compute-type int8_float16 --beam-size 4 \
  --text "Tom is going to Beijing on 2026-04-20."

附录：名词速查表

名词	解释
Marian	OPUS 项目使用的 seq2seq Transformer 架构，轻量高效
SentencePiece	Google 开源的 subword tokenizer，支持 BPE 和 Unigram
Subword Tokenization	把词拆成更小的单元（subword），解决 OOV 问题
CTranslate2 (CT2)	针对 Transformer 推理优化的 C++ 库
int8 量化	把 FP32 权重压缩到 8-bit 整数，体积减为 1/4
Knowledge Distillation (KD)	让小模型学习大模型的输出分布
Cross-Entropy (CE)	预测分布与真实分布的差异
KL Divergence	两个概率分布的差异
Temperature (T)	软化概率分布的参数
BLEU	翻译质量评估指标（0-100，越高越好）
Beam Search	解码时维护多个候选序列，取最优
Gradient Accumulation	累积多个 batch 的梯度再更新，模拟大 batch
FP16 混合精度	用半精度浮点数训练，省显存加快速度
Warmup	训练初期线性增加学习率，避免初期梯度爆炸
Layer Map	学生层与教师层的映射关系（用于权重初始化）
灾难性遗忘	微调时模型忘记了原来学过的知识
ZeroPadEmbedding	把 `<pad>` 对应的 embedding 向量强制设为零
OPUS-100	包含 100 种语言的大规模平行语料库
Tatoeba	开源多语言句子翻译数据集
HF (Hugging Face)	开源 ML 平台，提供 transformers/datasets 库和模型托管

模型 opus-mt-small320d-opus100-joint32k-ft-money-coffee-ct2-int8 完整训练过程#

目录#

1. 整体架构：知识蒸馏是什么#

2. 底模（教师模型）：Helsinki-NLP/opus-mt-en-zh#

2.1 什么是 opus-mt-en-zh#

2.2 如何获取#

2.3 教师模型的完整配置#

2.4 教师模型的本地文件（HF 缓存中）#

2.5 教师模型在本项目中的用途#

3. Marian 模型架构详解#

4. 训练数据集：三层来源#

4.1 第一层：OPUS-100 通用平行语料（基底）#

4.2 第二层：Coffee/Latte 增补数据（场景增强）#

4.3 第三层：Money/金额 合成数据（数字保真增强）#

4.4 训练数据叠加方式#

4.5 人工纠错（Manual Corrections）#

5. 自定义词表（Joint 32k SentencePiece）#

5.1 为什么要压缩词表#

5.2 Joint 词表的概念#

5.3 训练 Joint 32k Tokenizer#

5.4 Travel 增补数据（数字/日期/人名/地名增强）#

6. 学生模型配置：从 44M 到 20M 的演化#

6.1 配置生成脚本#

6.2 各代学生模型配置对比#

6.3 学生配置最终文件（student_config_320d_enc5_dec2_joint32k.json）#

7. 蒸馏训练核心原理#

7.1 损失函数#

7.2 教师模型的固定策略#

7.3 Tokenizer 处理流程#

7.4 Seq2SeqTrainer 评估流程#

8. 完整训练链路：12 个阶段#

阶段概览#

阶段1：OPUS-100 全量训练（512d，vocab=65k）#

阶段2：Coffee 增补微调（512d，vocab=65k）#

阶段3：Joint 32k Tokenizer + Smoke 训练#

阶段4：OPUS-100 全量 Joint 32k（512d）#

阶段5：320d 架构全量训练（从随机初始化）#

阶段6：金额增强短训（修小数，但伤 coffee）#

阶段7：金额 + Coffee 联合短训（最终 HF 模型）#

阶段8：CTranslate2 int8 量化（最终产物）#

9. CTranslate2 转换与 int8 量化#

9.1 什么是 CTranslate2#

9.2 为什么用 CT2#

9.3 转换过程详解#

9.4 int8 量化详解#

9.5 --copy_files 参数说明#

10. CT2 推理流程详解#

10.1 推理脚本#

10.2 推理流程#

10.3 使用方式#

10.4 最终模型翻译效果#

11. Pad Embedding 归零问题（CT2 对齐）#

11.1 问题描述#

11.2 根因分析#

11.3 解决方案：ZeroPadEmbeddingCallback#

11.4 验证方法#

12. 关键参数汇总表#

12.1 蒸馏训练参数#

12.2 模型架构参数#

12.3 数据集统计#

13. 完整复现命令清单#

13.0 环境准备#

13.1 训练 Joint 32k Tokenizer#

13.2 训练中间教师（512d Joint32k，OPUS-100 全量）#

13.3 训练最终 320d 学生模型（全量）#

13.4 生成金额合成数据 + 联合短训#

13.5 验证 Pad Embedding 为零#

13.6 CTranslate2 int8 转换#

13.7 测试翻译效果#

附录：名词速查表#