目录

参考 https://www.datacamp.com/tutorial/fine-tuning-llama-2 进行Llama2 微调训练时发现使用稳重代码无法在 Apple M3 Max 上运行起来,经过一番实验后得以顺利运行,下面把过程记录下来。

相关代码: https://github.com/singleye/Llama2-finetune

1. 准备

pip install accelerate peft bitsandbytes transformers==4.38.1 trl

注意: * transformers 不能使用 4.38.2 版本,否则在 M3 上会碰到下面的错误

RuntimeError: User specified an unsupported autocast device_type 'mps'
  • bitsandbytes 无法在 M3 上使用
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

2. 模型配置

由于国内直接从 HuggingFace 网站下载模型速度太慢,可以使用镜像站进行下载。

设置环境变量 HF_ENDPOINT:

export HF_ENDPOINT=https://hf-mirror.com

下载模型:

huggingface-cli download --resume-download NousResearch/Llama-2-7b-chat-hf --local-dir Llama-2-7b-chat-hf

下载数据集:

huggingface-cli download --repo-type dataset --resume-download mlabonne/guanaco-llama2-1k --local-dir guanaco-llama2-1k
base_dir = '~/Llama2-finetuning'

# Model from local directory
base_model = base_dir + "/Llama-2-7b-chat-hf"

# Dataset from local directory
guanaco_dataset = base_dir + "/guanaco-llama2-1k"

# Fine-tuned model
new_model = "llama-2-7b-chat-guanaco"

3. 加载数据集

dataset = load_dataset(guanaco_dataset, split="train")

4. QLoRA 4-bit 量化配置 (M3 跳过)

Paper: “QLoRA: Efficient Finetuning of Quantized LLMs”

compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

5. 加载模型

注意,由于 BitsAndBytesConfig 无法在 Apple Silicon (M3) 上使用,所以需要进行平台判断并做相应处理。由于无法使用量化方法进行处理,所以在 Apple Silicon (M3) 上需要使用更多的内存进行微调训练,在这个例子中大约使用了 75 GB 的内存。

compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)
if torch.backends.mps.is_available():
    print("Using 'mps' (Apple Silicon)")
    active_device = torch.device('mps')
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=base_model,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        device_map=active_device
    )
elif torch.cuda.is_available():
    print("Using GPU")
    active_device = torch.device('cuda')
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=quant_config,
        device_map=active_device
    )
else:
    print("Using CPU")
    active_device = torch.device('cpu')
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=quant_config,
        device_map=active_device
    )

model.config.use_cache = False
model.config.pretraining_tp = 1

6. 加载模型的 tokenizer

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

7. 配置 PEFT 参数

peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

8. 配置训练参数

training_params = TrainingArguments(
	output_dir="./results",
	num_train_epochs=1,
	per_device_train_batch_size=4,
	gradient_accumulation_steps=1,
	gradient_checkpointing = True,
	learning_rate=2e-4,
	weight_decay=0.001,
	lr_scheduler_type="constant",
	warmup_ratio=0.03,
	max_grad_norm=0.3,
	max_steps=-1,
	save_steps=25,
	logging_steps=25,
	logging_dir="./logs", 
	group_by_length=True,
	fp16=False,
	report_to="tensorboard",
	adam_beta2=0.999,
	do_train=True
)

9. 模型微调训练

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_train,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)
trainer.train()

10. 保存训练好的模型

trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

11. 使用模型

logging.set_verbosity(logging.CRITICAL)

prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])