使用ms-swift训练模型-个人可运行的示例-25年04月

Written with StackEdit.

环境：魔搭社区modelscope，notebooksGPU环境

大纲：

swifft是什么
加载、训练、输出的示例
自定义损失函数

疑惑：
4. 为什么要用template?
5. 能否dataset也用swift加载。swift的dataset逻辑与Dataset类有什么不同？

swift是什么

swift有很多种意思，例如：
它可以表示一种Apple开发的编程语言；
它可以作为形容词表示“快速的，迅速的”；
……

在这里，swift是一个第三方 Python 库，通常是用于对大模型进行高效微调和推理。可以通过下面的命令安装：

pip install ms-swift
```    
根据[swift官方文档](https://swift.readthedocs.io/en/latest/GetStarted/Quick-start.html)，可以得知：
`ms-swift` 是由阿里巴巴的 ModelScope 社区开发和维护的 Python 框架，全称为 **SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）**。它专为大语言模型（LLM）和多模态大模型的微调、推理、评估、量化和部署而设计，支持超过 450 个文本模型和 150 个多模态模型，包括 Qwen、InternLM、GLM、Baichuan、Yi、LLaMA、Mistral 等主流模型。

---
### 加载、微调、输出的实例

前言：
如果你是一位做深度学习的研究者，有过项目经验，你可以直接阅读官方github项目的示例代码：
[10-minute self-cognition SFT](https://github.com/modelscope/ms-swift/blob/main/examples/notebook/qwen2_5-self-cognition/self-cognition-sft.ipynb)。（直接询问AI助手很容易因为"swift"本身的混淆产生幻觉，补充官方文档的信息即可自行利用）

简单概括，下面的过程是“使用swift的LoRA微调方法微调Qwen2.5-3B-Instruct，并尝试添加自定义损失函数”。

#### 加载模型和分词器
```python
from swift.llm import (
    get_model_tokenizer, 
    get_template,
    # load_dataset  # 由于编程顺序原因，个人加载数据集采用的是Dataset模块，没用swift
)
# from swift import Swift, LoRAConfig # 这两可以直接从swift import，但后面的get_template就不行了
from swift.tuners import Swift, LoRAConfig
import torch

model_id = 'Qwen/Qwen2.5-3B-Instruct'
# 加载模型和分词器
model, tokenizer = get_model_tokenizer(
    model_id_or_path=model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    model_kwargs={"device_map": "auto"}
)

LoRA配置

# 配置LoRA参数

lora_config = LoRAConfig(

r=16, # 推荐较小的rank值（8-32）

lora_alpha=32,

target_modules=[

"q_proj", # 查询投影

"k_proj", # 键投影

"v_proj", # 值投影

"o_proj", # 输出投影

"gate_proj", # 门控投影(MLP)

"up_proj", # 上投影(MLP)

"down_proj"  # 下投影(MLP)

],

lora_dropout=0.05,

bias="none"  # 不训练bias参数

)

  

# 若报错，ValueError: Target modules {'dense_4h_to_h', 'query_key_value', 'dense_h_to_4h', 'dense'} not found in the base model. Please check the target modules and try again.
# 这个意思是LoRAConfig中的target_modules含有模型不具有的层。target_modules是要被微调的部分，可以通过"print(model)"查看模型结构后进行更正。

# 用Swift载入lora配置，并将模型装在到gpu上

model = Swift.prepare_model(model, lora_config)

if  torch.cuda.is_available():

model = model.to('cuda')

数据集加载部分

这部分略过，请结合具体数据集自行进行处理后载入。
个人实现的大致思路为：

# 这段代码不公开，只提取思路主干做参考，必须补足逻辑后才能正常运行
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("json", data_files="dataset.jsonl", split="train")

# 定义预处理函数
def preprocess_function(examples):
    dialogues = []
    for messages in examples["messages"]:
        dialogue = ""
        for msg in messages:
			# ....... # 对dialogue+="构建后的内容"
        dialogues.append(dialogue.strip())

    inputs = tokenizer(dialogues, truncation=True, padding="max_length", max_length=256)
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

# 应用预处理
dataset = dataset.map(preprocess_function, batched=True, remove_columns=["messages"])

# 拆分训练集和验证集
train_dataset, eval_dataset = dataset.train_test_split(test_size=0.1).values()

训练

from swift.llm import get_template
# 5. 设置模板
template = get_template(
    model.model_meta.template, tokenizer,
    max_length=512
)
template.set_mode('train')  # 设置为训练模式

from swift import Trainer, TrainingArguments
# help(get_template)
# !pip show ms-swift
torch.set_grad_enabled(True)
# 设置训练参数
training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    optim="adamw_torch",
    learning_rate=3.7e-5,
    save_steps=10_000,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    fp16=False,
    bf16=torch.cuda.is_bf16_supported(),  # 仅当GPU支持时启用
)
# 注意，必须运行下面这句代码，才能开启梯度传播！默认是不开启的，直接运行会报错
model.enable_input_require_grads()

# 如果自定义了损失类CustomLoss，可以启用"compute_loss_func=CustomLoss(),"一行
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    # compute_loss_func=CustomLoss(),
    template=template, 
)

# 开始微调
trainer.train()

# 保存微调后的模型
model.save_pretrained("./output")

推理

# 基本的推理方法
def inference(model, tokenizer, input_text):
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(model.device)
    output = model.generate(input_ids, max_length=256) # max_length限制输出长度
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

# 示例推理
input_text = "输入，毕竟它是3B模型，不用期待它有满血版ds质量的输出。"
response = inference(model, tokenizer, input_text)
print(response)

补充：自定义损失函数

贴出一个使用交叉熵损失（如果没有指明损失函数，默认采用的就是交叉熵损失）的基本实现。

class CustomLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.cross_entropy_loss = nn.CrossEntropyLoss()
        
    def forward(self, outputs, labels, **kwargs):
        """
        参数说明:
        - outputs: 模型输出（需包含logits和生成的文本）
        - labels: 真实标签
        - kwargs: Trainer可能传入的其他参数（如num_items_in_batch），如果不添加**kwargs并且还未对特定参数进行单独处理，会报传入未知参数的错误
        - attention_mark是掩码，当你处于让数据长度相同等目的，对数据进行padding时，用于记录哪些位是填充前的有效数据
        """
        # 从outputs中提取必要信息
        logits = outputs['logits'] if isinstance(outputs, dict) else outputs
        mask = outputs.get('attention_mask', None)        
        
        # 带mask的交叉熵计算
        if mask is not None:
            # 确保mask与labels形状一致
            assert mask.shape == labels.shape, f"Mask shape {mask.shape} != labels shape {labels.shape}"
            
            # 计算每个位置的loss（未求平均）
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                labels.view(-1),
                reduction='none'
            )
            
            # 应用mask
            valid_tokens = mask.view(-1).sum()
            ce_loss = (loss * mask.view(-1)).sum() / (valid_tokens + 1e-8)
        else:
            ce_loss = self.cross_entropy_loss(
                logits.view(-1, logits.size(-1)),
                labels.view(-1)
            )

		return ce_loss

Qijia Huang

https://mys109hqj.github.io/2025/04/25/250425ms-swift-notes/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Qijia Huang !

study

自然语言微调技术(NLFT)-25年04月-克隆运行实践

2025-04-26 major

study

250105QuizArena

2025-01-05 major

study