扩展说明: 指令微调 Llama 2

这篇博客是一篇来自 Meta AI，关于指令微调 Llama 2 的扩展说明。旨在聚焦构建指令数据集，有了它，我们则可以使用自己的指令来微调 Llama 2 基础模型。

目标是构建一个能够基于输入内容来生成指令的模型。这么做背后的逻辑是，模型如此就可以由其他人生成自己的指令数据集。这在当想开发私人个性化定制模型，如发送推特、写邮件等，时很方便。这也意味着你可以通过你的邮件来生成一个指令数据集，然后用它来训练一个模型来为你写邮件。

好，那我们来开始吧？我们将进行:

定义应用场景细节并创建指令的提示词模板
构建指令数据集
使用 trl 与 SFTTrainer 指令微调 Llama 2
测试模型、进行推理

1. 定义应用场景细节并创建指令的提示词模板

在描述应用场景前，我们要更好的理解一下究竟什么是指令。

指令是一段文本或提供给大语言模型，类似 Llama，GPT-4 或 Claude，使用的提示词，用来指导它去生成回复。指令可以让人们做到把控对话，约束模型输出更自然、实用的输出，并使这些结果能够对齐用户的目的。制作清晰的、整洁的指令则是生成高质量对话的关键。

指令的例子如下表所示。

能力示例指令头脑风暴提供一系列新口味的冰淇淋的创意。分类根据剧情概要，将这些电影归类为喜剧、戏剧或恐怖片。确定性问答用一个单词回答“法国的首都是哪里？”生成用罗伯特·弗罗斯特的风格写一首关于大自然和季节变化的诗。信息提取从这篇短文中提取主要人物的名字。开放性问答为什么树叶在秋天会变色？用科学的理由解释一下。摘要用 2-3 句话概括一下这篇关于可再生能源最新进展的文章。

如开头所述，我们想要微调模型，以便根据输入 (或输出) 生成指令。我们希望将其用作创建合成数据集的方法，以赋予 LLM 和代理个性化能力。

把这个想法转换成一个基础的提示模板，按照 Alpaca 格式.

### Instruction: Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
Dear [boss name],

I‘m writing to request next week, August 1st through August 4th,
off as paid time off.

I have some personal matters to attend to that week that require
me to be out of the office. I wanted to give you as much advance
notice as possible so you can plan accordingly while I am away.

Please let me know if you need any additional information from me
or have any concerns with me taking next week off. I appreciate you
considering this request.

Thank you, [Your name]

### Response:
Write an email to my boss that I need next week 08/01 – 08/04 off.

2. 创建指令数据集

在定义了我们的应用场景和提示模板后，我们需要创建自己的指令数据集。创建高质量的指令数据集是获得良好模型性能的关键。研究表明，“对齐，越少越好” 表明，创建高质量、低数量 (大约 1000 个样本) 的数据集可以达到与低质量、高数量的数据集相同的性能。

创建指令数据集有几种方法，包括:

使用现有数据集并将其转换为指令数据集，例如 FLAN
使用现有的 LLM 创建合成指令数据集，例如 Alpaca
人力创建指令数据集，例如 Dolly。

每种方法都有其优缺点，这取决于预算、时间和质量要求。例如，使用现有数据集是最简单的，但可能不适合您的特定用例，而使用人力可能是最准确的，但必然耗时、昂贵。也可以结合几种不同方法来创建指令数据集，如 Orca: Progressive Learning from Complex Explanation Traces of GPT-4.。

为了简单起见，我们将使用 **Dolly**，这是一个开源的指令跟踪记录数据集，由数千名 Databricks 员工在 InstructGPT paper 中描述的几个行为类别中生成，包括头脑风暴、分类、确定性回答、生成、信息提取、开放性回答和摘要。

开始编程吧，首先，我们来安装依赖项。

!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

我们使用 🤗 Datasets library 的 load_dataset() 方法加载 databricks/databricks-dolly-15k 数据集。

from datasets import load_dataset from random import randrange

# 从hub加载数据集
dataset = load_dataset(“databricks/databricks-dolly-15k”, split=“train”)

print(f”dataset size: {len(dataset)}“)
print(dataset[randrange(len(dataset))])
# dataset size: 15011

为了指导我们的模型，我们需要将我们的结构化示例转换为通过指令描述的任务集合。我们定义一个 formatting_function ，它接受一个样本并返回一个符合格式指令的字符串。

def format_instruction(sample): return f"""### Instruction: Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
{sample[‘response’]}

### Response:
{sample[‘instruction’]}
“””

我们来在一个随机的例子上测试一下我们的结构化函数。

from random import randrange

print(format_instruction(dataset[randrange(len(dataset))]))

3. 使用 `trl` 和`SFTTrainer` 指令微调 Llama 2

我们将使用最近在由 Tim Dettmers 等人的发表的论文“QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation”中介绍的方法。QLoRA 是一种新的技术，用于在微调期间减少大型语言模型的内存占用，且并不会降低性能。QLoRA 的 TL;DR; 是这样工作的:

将预训练模型量化为 4bit 位并冻结它。
附加轻量化的、可训练的适配器层。(LoRA)
在使用冻结的量化模型基于文本内容进行微调时，仅微调适配器层参数。

如果您想了解有关 QLoRA 及其工作原理的更多信息，我建议您阅读 Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA 博客文章。

Flash Attention (快速注意力)

Flash Attention 是一种经过重新排序的注意力计算方法，它利用经典技术 (排列、重计算) 来显著加快速度，将序列长度的内存使用量从二次降低到线性。它基于论文“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”。

TL;DR; 将训练加速了 3 倍。在这儿获得更多信息 FlashAttention。Flash Attention 目前仅支持 Ampere (A10, A40, A100, …) & Hopper (H100, …) GPU。你可以检查一下你的 GPU 是否支持，并用下面的命令来安装它:

注意: 如果您的机器的内存小于 96GB，而 CPU 核心数足够多，请减少 MAX_JOBS 的数量。在我们使用的 g5.2xlarge 上，我们使用了 4 。

python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'" pip install ninja packaging MAX_JOBS=4 pip install flash-attn --no-build-isolation

_安装 flash attention 是会需要一些时间 (10-45 分钟)_。

该示例支持对所有 Llama 检查点使用 Flash Attention，但默认是未启用的。要开启 Flash Attention，请取消代码块中这段的注释， # COMMENT IN TO USE FLASH ATTENTION 。

import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

use_flash_attention = False

# COMMENT IN TO USE FLASH ATTENTION
# replace attention with flash attention
# if torch.cuda.get_device_capability()[0] >= 8:
#     from utils.llama_patch import replace_attn_with_flash_attn
#     print(“Using flash attention”)
#     replace_attn_with_flash_attn()
#     use_flash_attention = True

# Hugging Face 模型id
model_id = “NousResearch/Llama-2-7b-hf” # non-gated
# model_id = “meta-llama/Llama-2-7b-hf” # gated

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type=“nf4”,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 加载模型与分词器
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map=“auto”)
model.config.pretraining_tp = 1

# 通过对比doc中的字符串，验证模型是在使用flash attention
if use_flash_attention:
from utils.llama_patch import forward
assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, “Model is not using flash attention”

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “right”

SFTTrainer 支持与 peft 的本地集成，这使得高效地指令微调LLM变得非常容易。我们只需要创建 LoRAConfig 并将其提供给训练器。

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# 基于 QLoRA 论文来配置 LoRA
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias=“none”,
        task_type=“CAUSAL_LM”,
)

# 为训练准备好模型
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

在开始训练之前，我们需要定义自己想要的超参数 (TrainingArguments)。

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir=“llama-7-int4-dolly”,
    num_train_epochs=3,
    per_device_train_batch_size=6 if use_flash_attention else 4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim=“paged_adamw_32bit”,
    logging_steps=10,
    save_strategy=“epoch”,
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type=“constant”,
    disable_tqdm=True # 当配置的参数都正确后可以关闭tqdm
)

我们现在有了用来训练模型 SFTTrainer 所需要准备的每一个模块。

from trl import SFTTrainer

max_seq_length = 2048 # 数据集的最大长度序列

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=args,
)

通过调用 Trainer 实例上的 train() 方法来训练我们的模型。

# 训练 trainer.train() # tqdm关闭后将不显示进度条信息

# 保存模型
trainer.save_model()

不使用 Flash Attention 的训练过程在 g5.2xlarge 上花费了 03:08:00。实例的成本为 1,212$/h ，总成本为 3.7$ 。

使用 Flash Attention 的训练过程在 g5.2xlarge 上花费了 02:08:00。实例的成本为 1,212$/h ，总成本为 2.6$ 。

使用 Flash Attention 的结果令人满意，速度提高了 1.5 倍，成本降低了 30%。

4. 测试模型、进行推理

在训练完成后，我们想要运行和测试模型。我们会使用 peft 和 transformers 将 LoRA 适配器加载到模型中。

if use_flash_attention: # 停止 flash attention from utils.llama_patch import unplace_flash_attn_with_attn unplace_flash_attn_with_attn() import torch from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer

args.output_dir = “llama-7-int4-dolly”

# 加载基础LLM模型与分词器
model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)

我们来再次用随机样本加载一次数据集，试着来生成一条指令。

from datasets import load_dataset from random import randrange

# 从hub加载数据集并得到一个样本
dataset = load_dataset(“databricks/databricks-dolly-15k”, split=“train”)
sample = dataset[randrange(len(dataset))]

prompt = f”””### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
{sample[‘response’]}

### Response:
“””

input_ids = tokenizer(prompt, return_tensors=“pt”, truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)

print(f”Prompt:n{sample[‘response’]}n”)
print(f”Generated instruction:n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}“)
print(f”Ground truth:n{sample[‘instruction’]}“)

太好了！我们的模型可以工作了！如果想要加速我们的模型，我们可以使用 Text Generation Inference 部署它。因此我们需要将我们适配器的参数合并到基础模型中去。

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
args.output_dir,
low_cpu_mem_usage=True,
)

# 合并 LoRA 与 base model
merged_model = model.merge_and_unload()

# 保存合并后的模型
merged_model.save_pretrained(“merged_model”,safe_serialization=True)
tokenizer.save_pretrained(“merged_model”)

# push合并的模型到hub上
# merged_model.push_to_hub(“user/repo”)
# tokenizer.push_to_hub(“user/repo”)

原文作者: Philschmid

原文链接: https://www.philschmid.de/instruction-tune-llama-2

译者: Xu Haoran

2024 年 2 月
一	二	三	四	五	六	日
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

扩展说明: 指令微调 Llama 2

1. 定义应用场景细节并创建指令的提示词模板

2. 创建指令数据集

3. 使用 `trl` 和`SFTTrainer` 指令微调 Llama 2

Flash Attention (快速注意力)

4. 测试模型、进行推理

小说创作

清库存！DeepSeek突然补全R1技术报告，训练路径首次详细公开

训具身模型遇到的很多问题，在数据采集时就已经注定了丨鹿明联席CTO丁琰分享

「北京版幻方」冷不丁开源SOTA代码大模型！一张3090就能跑，40B参数掀翻Opus-4.5和GPT-5.2

开源“裸考”真实世界，国产具身智能基座模型拿下全球第二！

让欧美老外彻底“真香”，这家中国割草机器人品牌正在定义一个行业新标准

英特尔CES奇袭老黄大本营！英伟达显卡刚涨价，最强酷睿量产出货

海信CES发布全新一代RGB-Mini LED，全球首创玲珑4芯真彩背光

港科大教授实测AI眼镜“作弊”：30分钟碾压95%的学生，把传统教学评估体系整破防了

1956-2026：人类与机器智能的七十年对话

文心AIGC

小说创作

清库存！DeepSeek突然补全R1技术报告，训练路径首次详细公开

训具身模型遇到的很多问题，在数据采集时就已经注定了丨鹿明联席CTO丁琰分享

「北京版幻方」冷不丁开源SOTA代码大模型！一张3090就能跑，40B参数掀翻Opus-4.5和GPT-5.2

开源“裸考”真实世界，国产具身智能基座模型拿下全球第二！

让欧美老外彻底“真香”，这家中国割草机器人品牌正在定义一个行业新标准

英特尔CES奇袭老黄大本营！英伟达显卡刚涨价，最强酷睿量产出货

海信CES发布全新一代RGB-Mini LED，全球首创玲珑4芯真彩背光

港科大教授实测AI眼镜“作弊”：30分钟碾压95%的学生，把传统教学评估体系整破防了

1956-2026：人类与机器智能的七十年对话

扩展说明: 指令微调 Llama 2

1. 定义应用场景细节并创建指令的提示词模板

2. 创建指令数据集

3. 使用 trl 和SFTTrainer 指令微调 Llama 2

Flash Attention (快速注意力)

4. 测试模型、进行推理

文心AIGC

3. 使用 `trl` 和`SFTTrainer` 指令微调 Llama 2