使用 DeepSpeed 和 Hugging Face 🤗 Transformer 微调 FLAN-T5 XL/XXL

Scaling Instruction-Finetuned Language Models 论文发布了 FLAN-T5 模型，它是 T5 模型的增强版。FLAN-T5 由很多各种各样的任务微调而得，因此，简单来讲，它就是个方方面面都更优的 T5 模型。相同参数量的条件下，FLAN-T5 的性能相比 T5 而言有两位数的提高。Google 在 Hugging Face 上开源了 5 个 FLAN-T5 的 checkpoints，参数量范围从 8000 万到 110 亿。

在之前的一篇博文中，我们已经学习了如何针对聊天对话数据摘要生成任务微调 FLAN-T5，那时我们使用的是 Base (250M 参数) 模型。本文，我们将研究如何将训练从 Base 扩展到 XL (30 亿参数) 或 XXL (110 亿参数)。

这意味着我们将学习如何利用模型并行、多 GPU 以及 DeepSpeed ZeRO 来微调 FLAN-T5 XL 和 XXL。

除了作为教程的部分之外，我们还跑了一系列实验，这些实验数据可以帮助你选择正确的硬件设置。你可以在 结果和实验 部分找到详细信息。

# install git lfs for pushing artifacts

!sudo apt install git-lfs

# install torch with the correct cuda version, check nvcc –version

!pip install torch –extra-index-url https://download.pytorch.org/whl/cu116 –upgrade

# install Hugging Face Libraries

!pip install “transformers==4.26.0” “datasets==2.9.0” “accelerate==0.16.0” “evaluate==0.4.0” –upgrade

# install deepspeed and ninja for jit compilations of kernels

!pip install “deepspeed==0.8.0” ninja –upgrade

# install additional dependencies needed for training

!pip install rouge-score nltk py7zr tensorboard

处理数据集

与针对聊天对话的摘要生成任务微调 FLAN-T5 一文中类似，我们需要先准备一个用于微调的数据集。本文，我们将在 CNN Dailymail 数据集上微调 FLAN-T5-XXL。我们不会赘述如何生成数据集，如果你想了解数据集生成的详细步骤，请参阅前文提到的 Fine Tune FLAN-T5。

我们定义了一些参数，本文的示例都会基于这些参数，但你可以根据实际需要进行调整。

# 实验配置

model_id = “google/flan-t5-xxl” # Hugging Face 模型 Id

dataset_id = “cnn_dailymail” # Hugging Face 数据集 Id

dataset_config = “3.0.0” # 数据集版本

save_dataset_path = “data” # 存放处理后数据的本地路径

text_column = “article” # 输入文本所属列

summary_column = “highlights” # 输出文本所属列

# 定制指令提示格式

prompt_template = f”Summarize the following news article:n{{input}}nSummary:n”

与 Fine Tune FLAN-T5 不同，这次我们把预处理和训练分开。这样我们就可以在非 GPU 实例上运行预处理。我们先对数据集进行预处理 (即分词) 并将其保存到磁盘，然后训练脚本再从磁盘中加载预处理后的数据集。

from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np

# Load dataset from the hub
dataset = load_dataset(dataset_id,name=dataset_config)
# Load tokenizer of FLAN-t5-base
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f”Train dataset size: {len(dataset[‘train’])}”)
print(f”Test dataset size: {len(dataset[‘test’])}”)

# Train dataset size: 287113
# Test dataset size: 11490

我们在配置文件中定义了一个 prompt_template，其可用于来构建指令提示，以提高我们模型的性能。 prompt_template 有“固定”的开始词和结束词，文档放在中间。这意味着我们需要确保 “固定”模板词 + 文档 总长不超过模型支持的最大序列长度。因此我们需要计算模型支持的最大文档长度，稍后我们会根据它来填充或截断模板中的文档。

prompt_length = len(tokenizer(prompt_template.format(input=””))[“input_ids”])
max_sample_length = tokenizer.model_max_length – prompt_length
print(f”Prompt length: {prompt_length}”)
print(f”Max input length: {max_sample_length}”)

# Prompt length: 12
# Max input length: 500

Prompt length: 12
Max input length: 500

现在我们知道，模型支持的最大输入文档长度为 500。除了输入之外，我们还需要知道最大“目标”序列长度，我们可以通过遍历数据集中的摘要长度来得到。(代码需要运行几分钟)

from datasets import concatenate_datasets
import numpy as np

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset[“train”], dataset[“test”]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
max_source_length = max([len(x) for x in tokenized_inputs[“input_ids”]])
max_source_length = min(max_source_length, max_sample_length)
print(f”Max source length: {max_source_length}”)

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.”
tokenized_targets = concatenate_datasets([dataset[“train”], dataset[“test”]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
target_lenghts = [len(x) for x in tokenized_targets[“input_ids”]]
# use 95th percentile as max target length
max_target_length = int(np.percentile(target_lenghts, 95))
print(f”Max target length: {max_target_length}”)

0%| | 0/299 [00:00<?, ?ba/s]
Max source length: 500

0%| | 0/299 [00:00<?, ?ba/s]
Max target length: 129

现在一切准备就绪，可以处理数据集了。

import os

def preprocess_function(sample, padding=”max_length”):
# created prompted input
inputs = [prompt_template.format(input=item) for item in sample[text_column]]

# tokenize inputs
model_inputs = tokenizer(inputs, max_length=tokenizer.model_max_length, padding=padding, truncation=True)

# Tokenize targets with the `text_target` keyword argument
labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)

# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == “max_length”:
labels[“input_ids”] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[“input_ids”]
]

model_inputs[“labels”] = labels[“input_ids”]
return model_inputs

# process dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset[“train”].features))

# save dataset to disk
tokenized_dataset[“train”].save_to_disk(os.path.join(save_dataset_path,”train”))
tokenized_dataset[“test”].save_to_disk(os.path.join(save_dataset_path,”eval”))

使用 deepspeed 微调模型

准备完毕！我们现在可以开始训练模型了！如前所述，我们将使用集成了 DeepSpeed 的 Hugging Face Trainer。因此我们需要创建一个 deespeed_config.json。DeepSpeed 配置定义了要使用的 ZeRO 策略以及是否要使用混合精度训练等配置项。 Hugging Face Trainer 允许我们从 deepspeed_config.json 中的 TrainingArguments 继承相关配置以避免重复设置，查看文档了解更多信息。

我们创建了 4 组 deepspeed 配置文件用于实验，包括 CPU 卸载和混合精度:

ds_flan_t5_z3_config.json

ds_flan_t5_z3_config_bf16.json

ds_flan_t5_z3_offload.json

ds_flan_t5_z3_offload_bf16.json

你可以根据你的运行环境选择，例如如果在 NVIDIA V100s 上运行，你就不能使用带 bf16 的配置，因为 V100 不支持 bfloat16 数据类型。

❝ 在微调 T5 模型时，不能使用 fp16，因为它会导致精度溢出问题，参见问题 #4586，#10830，和拉取请求 #10956

如开头所述，我们使用的是 p4dn.24xlarge AWS EC2 实例，该实例包含 8 张显存为 40GB 的 NVIDIA A100。这意味着我们可以使用 bf16，它将减少近一半的模型显存占用，使我们能够在不卸载的情况下高效训练。

我们将使用 ds_flan_t5_z3_config_bf16.json。如果你不想用 auto 值，可以查看文档。

{

  “bf16”: {

    “enabled”: “auto”

  },

  “optimizer”: {

    “type”: “AdamW”,

    “params”: {

      “lr”: “auto”,

      “betas”: “auto”,

      “eps”: “auto”,

      “weight_decay”: “auto”

    }

  },

  “scheduler”: {

    “type”: “WarmupLR”,

    “params”: {

      “warmup_min_lr”: “auto”,

      “warmup_max_lr”: “auto”,

      “warmup_num_steps”: “auto”

    }

  },

  “zero_optimization”: {

    “stage”: 3,

    “overlap_comm”: true,

    “contiguous_gradients”: true,

    “sub_group_size”: 1e9,

    “reduce_bucket_size”: “auto”,

    “stage3_prefetch_bucket_size”: “auto”,

    “stage3_param_persistence_threshold”: “auto”,

    “stage3_max_live_parameters”: 1e9,

    “stage3_max_reuse_distance”: 1e9,

    “stage3_gather_16bit_weights_on_model_save”: false

  },

  “gradient_accumulation_steps”: “auto”,

  “gradient_clipping”: “auto”,

  “steps_per_print”: 2000,

  “train_batch_size”: “auto”,

  “train_micro_batch_size_per_gpu”: “auto”,

  “wall_clock_breakdown”: false

}

现在，该训练脚本上场了。我们根据 Fine Tune FLAN-T5 准备了一个 run_seq2seq_deepspeed.py 训练脚本，它支持我们配置 deepspeed 和其他超参数，包括 google/flan-t5-xxl 的模型 ID。

我们使用 deepspeed 启动器触发训练，输入给启动器的参数包括 GPU 数量、deepspeed 配置及其它超参数 (如 google/flan-t5-xxl 的模型 ID)。

!deepspeed –num_gpus=8 scripts/run_seq2seq_deepspeed.py
–model_id $model_id
–dataset_path $save_dataset_path
–epochs 3
–per_device_train_batch_size 8
–per_device_eval_batch_size 8
–generation_max_length $max_target_length
–lr 1e-4
–deepspeed configs/ds_flan_t5_z3_config_bf16.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
– Avoid using `tokenizers` before the fork if possible
– Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
deepspeed –num_gpus=8 scripts/run_seq2seq_deepspeed.py –model_id google/flan-t5-xxl –dataset_path data –epochs 3 –per_device_train_batch_size 8 –per_device_eval_batch_size 8 –generation_max_length 129 –lr 1e-4 –deepspeed configs/ds_flan_t5_z3_config_bf16.json

DeepSpeed 先将模型加载到 CPU 上，然后将其拆分到 8 张 A100 上然后开始训练。使用 CNN Dailymail 数据集进行训练大约需要 10 个小时，费用约为 322 美元。

结果与实验

为了更好地了解硬件要求，我们对 FLAN-T5 XL 和 XXL 进行了一系列实验，以帮助我们评估和了解硬件需求以及训练这些模型的成本。

下表列出了实验和相关设置的详细信息。

数据集: “cnn_dailymail”

训练样本数: 287113

验证样本数: 13368

超参:

epochs: 3

学习率: 1e-4

运行环境设置:

4x V100 16GB: p3.8xlarge

4x A10G 24GB: g5.24xlarge

8x V100 16GB: p3.16xlarge

8x A100 40GB: p4dn.24xlarge

模型
DeepSpeed 卸载
硬件
GPU每卡batch size
精度
时长
费用

FLAN-T5-XL (3B)
No
4x V100 16GB
OOM
fp32
–
–

FLAN-T5-XL (3B)
No
8x V100 16GB
1
fp32
105h
~$2570

FLAN-T5-XL (3B)
No
8x A100 40GB
72
bf16
2.5h
~$81

FLAN-T5-XL (3B)
Yes
4x V100 16GB
8
fp32
69h
~$828

FLAN-T5-XL (3B)
Yes
8x V100 16GB
8
fp32
32h
~$768

FLAN-T5-XXL (11B)
No
8x A100 40GB
8
bf16
10h
~$322

FLAN-T5-XXL (11B)
Yes
4x V100 16GB
OOM
fp32
–
–

FLAN-T5-XXL (11B)
Yes
8x V100 16GB
OOM
fp32
–
–

FLAN-T5-XXL (11B)
Yes
4x A10G 24GB
24
bf16
90h
~$732

FLAN-T5-XXL (11B)
Yes
8x A100 40GB
48
bf16
19h
~$613

我们可以看到 bf16 与 fp32 相比具有显著优势。FLAN-T5-XXL 能放进 4 张 A10G (24GB)，但放不进 8 张 V100 16GB。

我们的实验还表明，如果模型可以无需卸载同时以 batch size 大于 4 的配置跑在 GPU 上，其速度将比卸载模型和减小 batch size 的配置快约 2 倍且更具成本效益。

英文原文: https://www.philschmid.de/fine-tune-flan-t5-deepspeed
原文作者: Philipp Schmid
译者: Matrix Yao (姚伟峰)，英特尔深度学习工程师，工作方向为 transformer-family 模型在各模态数据上的应用及大规模模型的训练推理。
审校、排版: zhongdongy (阿东)

2023 年 3 月
一	二	三	四	五	六	日
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

使用 DeepSpeed 和 Hugging Face 🤗 Transformer 微调 FLAN-T5 XL/XXL

处理数据集

使用 deepspeed 微调模型

结果与实验

小说创作

清库存！DeepSeek突然补全R1技术报告，训练路径首次详细公开

训具身模型遇到的很多问题，在数据采集时就已经注定了丨鹿明联席CTO丁琰分享

「北京版幻方」冷不丁开源SOTA代码大模型！一张3090就能跑，40B参数掀翻Opus-4.5和GPT-5.2

开源“裸考”真实世界，国产具身智能基座模型拿下全球第二！

悲报！Stack Overflow彻底凉了，比18年前上线首月问题数量还少

全自主、更好用！北京人形 “干活机器人” 惊艳亮相 CES2026

1956-2026：人类与机器智能的七十年对话

港科大教授实测AI眼镜“作弊”：30分钟碾压95%的学生，把传统教学评估体系整破防了

海信CES发布全新一代RGB-Mini LED，全球首创玲珑4芯真彩背光

文心AIGC