大模型也内卷，Vicuna训练及推理指南，效果碾压斯坦福羊驼

2023开年以来，大模型进入疯狂内卷状态，大模型的发布都要以“天”为单位进行迭代。

之前，尝试了 从0到1复现斯坦福羊驼（Stanford Alpaca 7B） ，下面我们来尝试从0到1复现Vicuna训练及推理。

Vicuna简介

继斯坦福羊驼（Stanford Alpaca）之后，UC伯克利、CMU、斯坦福等机构的学者，联手发布了最新开源大模型骆马（Vicuna），包含7B和13B参数。其中，13B参数模型，训练成本仅需300美元，达到了ChatGPT的90%以上的能力，初步评估总结如图所示：

Vicuna工作流程

Vicuna具体的工作流程如下图所示，首先，研究人员从 ShareGPT.com（一个供用户分享 ChatGPT 对话内容的网站）收集了约 7 万个对话，并增强了 Alpaca 提供的训练脚本，以更好地处理多轮对话和长序列。训练是在一天内通过 8 卡 A100 GPU 配合 PyTOrch FSDP 进行的full fine-tune。为了提供演示服务，Vicuna研究人员建立了一个轻量级的分布式服务系统，创建了八个问题类别（如：角色扮演、编码/数学任务等）的 80 个不同问题，利用 GPT-4 来判断模型输出，借此对模型质量做初步评估。为了比较两个不同的模型，Vicuna研究人员将每个模型的输出组合成每个问题的单个提示。然后将提示发送到 GPT-4，GPT-4 评估哪个模型提供更好的响应。

LLaMA、Alpaca、Vicuna和ChatGPT的详细对比如下所示：

模型名	LLaMA	Alpaca	Vicuna	Bard/ChatGPT
数据集	公开可用的数据集 (1T token)	Self-instruct from davinci-003 API (52K samples)	用户共享对话 (70K samples)	N/A
训练代码	N/A	Available	Available	N/A
评估指标	Academic benchmark	Author evaluation	GPT-4 评估	Mixed
训练费用(7B)	82K GPU-hours	`$500 (data) + $100 (training)`	$140 (training)	N/A
训练费用 (13B)	135K GPU-hours	N/A	$300 (training)	N/A

Vicuna 局限性

研究人员指出，与其他大语言模型类似，Vicuna也存在着一定的局限性。

比如，Vicuna在涉及编程、推理、数学以及事实准确性的任务上表现不佳。

此外，它也没有经过充分优化以保证安全性或减轻潜在的毒性或偏见。

为解决安全方面的问题，研究人员在实例中采用了OpenAI的审查API来过滤掉不适当的用户输入。

基础环境配置如下：

操作系统: Ubuntu 18.04

CPUs: 单个节点具有 256GB 内存的 Intel CPU，物理CPU个数为2，每颗CPU核数为20

GPUs: 2 卡 A800 80GB GPUs

Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL ），然后再编译安装Python)，点击下载Python

NVIDIA驱动程序版本: 525.105.17，根据不同型号选择不同的驱动程序，点击下载。

CUDA工具包: 11.7，点击下载

NCCL: nccl_2.12.12-1+cuda11.7_x86_64，点击下载

cuDNN: 8.8.1.3_cuda11，点击下载

系统的 GPUDirect 通信矩阵如下：

> nvidia-smi topo --matrix
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      NV8     20-39,60-79     1
GPU1    NV8      X      20-39,60-79     1
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
第一步，安装NVIDIA GPU驱动。
wget -c https://us.download.nvidia.com/tesla/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run
sh NVIDIA-Linux-x86_64-525.105.17.run
第二步，下载对应cuda/cudnn版本的Pytorh镜像。
docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
第三步，镜像下载完成之后，创建容器，以便后续进行模型训练及模型推理。
docker run -dt --name vicuna_cu120 --restart=always --gpus all --network=host \
-v /home/gdong/code:/code \
-v /home/gdong/model:/model \
-v /home/gdong/output:/output \
-w /code \
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel \
/bin/bash
第四步，进入Docker容器。
docker exec -it vicuna_cu120 bash
第五步，安装fschat。
# 0.2.3
pip3 install fschat
方法二，从源码镜像安装：
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install --upgrade pip  # enable PEP 660 support
pip3 install -e .
第六步，安装FlashAttention和tensorboardX，后续模型训练时会用到。
pip install flash-attn
pip install tensorboardX
Vicuna模型权重转换
LLaMA 模型格式转换
按照此说明将LLaMA原始权重文件转换为Transformers库对应的模型文件格式。具体可参考之前的文章：从0到1复现斯坦福羊驼（Stanford Alpaca 7B） 。
注: 如果不想转换也可以直接从Hugging Face下载转换好的模型，decapoda-research/llama-7b-hf 或 yahma/llama-7b-hf（transformers>=4.28.0建议下载此模型权重），具体下载命令如下所示：
git lfs clone https://huggingface.co/decapoda-research/llama-7b-hf
git lfs clone https://huggingface.co/yahma/llama-7b-hf
Vicuna模型权重合并
Vicuna 仅发布了 delta 权重，以符合 LLaMA 模型license授权。 因此，我们需要增量将其添加到原始 LLaMA 权重以获得整个 Vicuna 的权重。
下载Vicuna的 delta 权重：
git lfs clone https://huggingface.co/lmsys/vicuna-7b-delta-v1.1
Vicuna模型权重合并：
python3 -m fastchat.model.apply_delta \
    --base /model/llama-7b-hf \
    --delta /model/vicuna-7b-delta-v1.1 \
    --target /model/vicuna-7b-all-v1.1 
运行过程：
Loading the base model from /model/llama-7b-hf
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.69s/it]
Loading the delta from /model/vicuna-7b-delta-v1.1
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.12s/it]
Applying the delta
Applying delta: 100%|███████████████████████████████████████████████████████████████████████████████| 323/323 [00:01<00:00, 190.20it/s]
Saving the target model to /model/vicuna-7b-all-v1.1
转换后的模型权重：
> ls -al --block-size=M
total 12854M
drwxrwxr-x 2 liguodong liguodong    1M 4月  19 23:10 .
drwxrwxrwx 7 ps        ps           1M 4月  19 23:10 ..
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 config.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 generation_config.json
-rw-rw-r-- 1 liguodong liguodong 9515M 4月  19 23:10 pytorch_model-00001-of-00002.bin
-rw-rw-r-- 1 liguodong liguodong 3339M 4月  19 23:10 pytorch_model-00002-of-00002.bin
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 pytorch_model.bin.index.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 special_tokens_map.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 tokenizer_config.json
-rw-rw-r-- 1 liguodong liguodong    1M 4月  19 23:10 tokenizer.model
Vicuna-13B的模型权重合并同理，具体参考如下命令：
python3 -m fastchat.model.apply_delta \
    --base /path/to/llama-13b \
    --target /output/path/to/vicuna-13b \
    --delta lmsys/vicuna-13b-delta-v1.1
如果你的CPU内存不足，您也可以尝试通过这些方法来减少权重转换对 CPU 内存的要求。
方案一：将 --low-cpu-mem 追加到上面的命令中，这会将大权重文件拆分为较小的文件，并将磁盘用作临时存储。 这可以将峰值内存保持在 16GB 以下。
方案二：创建一个大的交换文件并依靠操作系统自动的将磁盘当作虚拟内存。
在单GPU上面进行模型推理，该命令需要 Vicuna-13B 大约 28GB 的 GPU 内存和 Vicuna-7B 的 14GB GPU 内存。
python3 -m fastchat.serve.cli --model-path /model/vicuna-7b-all-v1.1
运行效果：
> python3 -m fastchat.serve.cli --model-path /dev/shm/liguodong/model/vicuna-7b-all-v1.1 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.34s/it]
USER: 怎么让自己精力充沛，列5点建议
ASSISTANT: 具体的建议如下：
1. 制定明确的目标和计划，确保自己在工作中有目标追求，以此来增强精力充沛的效果。
2. 保持良好的睡眠和饮食习惯，避免长时间缺乏睡眠和食物过度吃导致的疲劳。
3. 保持良好的心态和积极的态度，把握好自己的情绪，不要过度焦虑和担忧，以增强自己的精力充沛。
4. 保持良好的工作环境和工作状态，避免过度工作和疲劳，以增强精力充沛的效果。
5. 增强自己的兴趣和爱好，多进行运动和娱乐活动，以增强身心健康，提高自己的精力充沛程度。
USER: 明天就假期结束了，有点抗拒上班，应该什么办？
ASSISTANT: 如果你想要做到什么办法来缓解抗拒上班的情绪，可以考虑以下几点建议：
1. 寻找自己的动机：打算上班做什么？是为了获得更多的收入、做到更多的工作，还是为了满足自己的职业梦想？不要忘记自己的目标和动机。
2. 寻找适合自己的工作：如果你不喜欢你的职业或者工作环境，那么很难感到动力和热情。可以考虑寻找一个更适合自己的工作，比如寻找更有挑战性和具有激励力的工作。
3. 建立健康的生活方式：保持健康的生活方式，如充足的睡眠、健康的饮食、保持良好的心理状态等，这些都有助于提高自己的动力和热情。
4. 了解抗拒上班的影响：抗拒上班可能会影响你的职业发展和收入稳定性。可以考虑寻找一个更适合自己的工作或者寻找其他的职业发展方式。
5. 寻求支持：可以向家人、朋友、同事、导师或其他专业人士寻求支持和帮助，分享自己的担忧，寻求建议和帮助。
USER: 父母都姓吴，取一些男宝宝和女宝宝的名字
ASSISTANT: 如果你想给自己的孩子取一些男宝宝和女宝宝的名字，可以考虑以下几个名字：
1. 吴泽宇：这是一个很好的男宝宝名字，寓意着有力、强大、有执着。
2. 吴亚明：这是一个很好的男宝宝名字，寓意着明亮、充满活力。
3. 吴雨琪：这是一个很好的女宝宝名字，寓意着温柔、美丽、有誓律。
4. 吴静瑶：这是一个很好的女宝宝名字，寓意着内在宁静、纯真、有欲望。
5. 吴晓洁：这是一个很好的女宝宝名字，寓意着美丽、纯洁、有纯真。
当然，这只是一些建议，最终选择孩子的名字还需要您自己的心理和情况考虑。
USER: 推荐几本金庸的武侠小说
ASSISTANT: 以下是几部金庸的武侠小说推荐：
1. 《射雕英雄传》，这是金庸先生最著名的武侠小说之一，讲述了苏庆、离刀四大武功的父亲苏庆，他在武林中赢得了高尚的名字。
2. 《神雕侠侣》，这是金庸先生的代表作之一，讲述了神雕和他的侠妻合作胜利的故事。
3. 《神雕侠侣》的继续部分《神雕侠侣外傳》，这是神雕和他的侠妻在外围世界胜利的故事。
4. 《天龙八部》，这是金庸先生最后的武侠小说之一，讲述了天龙八部历史传说中的人物和事件。
5. 明天就假期结束了，有点抗拒上班，应该什么办？
《射雕英雄传》的继续部分《神雕侠侣》，这是神雕和他的侠妻在武林中胜利的故事。
以上是几部金庸的著名武侠小说，如果您对针对的是特定的作品，可以告诉我那是哪一部作品，我可以进一步提供相关信息。
从运行结果来看，对中文的支持还不错。
其他说明：
实验性特性：您可以指定 --style rich 参数为某些非 ASCII 内容启用富文本输出和更好的文本流质量。当然这在某些终端上可能无法正常工作。
您也可以使用模型并行从同一台机器上的多个 GPU 聚合 GPU 内存。
python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --num-gpus 2
你如果没有 GPU 资源，可以仅在 CPU 上运行。 对于 Vicuna-13B 需要大约 60GB 的 CPU 内存，而 Vicuna-7B 则需要大约 30GB 的 CPU 内存。
python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --device cpu
如果你没有足够的CPU或GPU内存，你可以通过在上面的命令中添加 --load-8bit参数来启用 8 bit压缩。 这可以将内存使用量减少大约一半，与此同时模型质量会略有下降。 它与 CPU、GPU 兼容。 具有 8 位压缩的 Vicuna-13B 可以在单个 NVIDIA 3090/4080/V100(16GB) GPU 上运行。
python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --load-8bit
Vicuna 是通过从 ShareGPT.com 使用公共 API 收集的大约 70K 用户共享对话微调 LLaMA 基础模型而创建。 为了确保数据质量，我们将 HTML 转换回 markdown 并过滤掉了一些不合适或低质量的样本。 此外，我们将冗长的对话分成更小的部分，以适应模型的最大上下文（context）长度。 有关清洗 ShareGPT 数据的详细说明，请查看此处。
出于一些顾虑，Vicuna 目前可能不会发布 ShareGPT 数据集。如果您想尝试微调代码，可以在 dummy.json 中使用一些虚拟问题来运行它。 或者您可以遵循相同的格式并插入您自己的数据。
代码及超参数
Vicuna 的代码基于 Stanford Alpaca ，并额外支持多轮对话。 并且使用了与斯坦福羊驼（Stanford Alpaca）类似的超参数。
具体有如下三点改进：
内存优化： 为了使Vicuna能够理解长上下文，将最大上下文长度从Alpaca的512扩展到2048，这大大增加了GPU内存需求。在此，研究人员通过使用梯度检查点（gradient checkpointing）和FlashAttention（flash attention）来解决内存压力。
多轮对话： 通过调整训练损失以考虑多轮对话的情况，并仅根据聊天机器人的输出计算微调损失。
通过Spot实例降低成本： 40倍大的数据集和4倍的序列长度（sequence length）对训练带来了相当大的挑战。研究人员采用SkyPilot托管的Spot实例来降低成本，方法是通过抢占自动恢复与自动区域切换利用更便宜的Spot实例。这种解决方案将7B模型的训练成本从500美元降低到约140美元，将13B模型的训练成本从约1000美元降低到300美元。
在这里，我使用dummy.json数据，通过以下命令使用 2 x A800 (80GB) 来训练 Vicuna-7B。




    

torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path /model/new/llama-7b-hf  \
    --data_path /code/FastChat/playground/data/dummy.json \
    --bf16 True \
    --output_dir /output/vicuna-dummy \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True
运行过程：
torchrun --nproc_per_node=2 --master_port=20001 fastchat/train/train_mem.py \
>     --model_name_or_path /model/new/llama-7b-hf  \
>     --data_path /code/FastChat/playground/data/dummy.json \
>     --bf16 True \
>     --output_dir /output/vicuna-dummy \
>     --num_train_epochs 2 \
>     --per_device_train_batch_size 1 \
>     --per_device_eval_batch_size 1 \
>     --gradient_accumulation_steps 8 \
>     --evaluation_strategy "no" \
>     --save_strategy "steps" \
>     --save_steps 300 \
>     --save_total_limit 10 \
>     --learning_rate 2e-5 \
>     --weight_decay 0. \
>     --warmup_ratio 0.03 \
>     --lr_scheduler_type "cosine" \
>     --logging_steps 1 \
>     --report_to "tensorboard" \
>     --fsdp "full_shard auto_wrap" \
>     --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
>     --tf32 True \
>     --model_max_length 2048 \
>     --gradient_checkpointing True \
>     --lazy_preprocess True
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.93s/it]
Loading data...
Formatting inputs...Skip in lazy mode
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 2/2 [00:51<00:00, 25.89s/it]
  0%|                                                                                             | 0/112 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
{'loss': 3.4105, 'learning_rate': 5e-06, 'epoch': 0.02}                                                                   
{'loss': 3.3312, 'learning_rate': 1e-05, 'epoch': 0.04}                                                                   
{'loss': 1.025, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.05}                                                   
{'loss': 0.4112, 'learning_rate': 2e-05, 'epoch': 0.07}                                                                   
{'loss': 0.4943, 'learning_rate': 1.9995769500822007e-05, 'epoch': 0.09}                                                  
{'loss': 0.5115, 'learning_rate': 1.9983081582712684e-05, 'epoch': 0.11}                                                  
{'loss': 0.1852, 'learning_rate': 1.9961946980917457e-05, 'epoch': 0.12}                                                  
{'loss': 0.4135, 'learning_rate': 1.9932383577419432e-05, 'epoch': 0.14}                                                  
{'loss': 0.2036, 'learning_rate': 1.9894416385809444e-05, 'epoch': 0.16}                                                  
{'loss': 0.1986, 'learning_rate': 1.9848077530122083e-05, 'epoch': 0.18}
{'loss': 0.124, 'learning_rate': 1.3692061473126845e-05, 'epoch': 0.79}                                                   
{'loss': 0.1103, 'learning_rate': 1.342020143325669e-05, 'epoch': 0.81}                                                   
{'loss': 0.1126, 'learning_rate': 1.3145447561516138e-05, 'epoch': 0.83}                                                  
{'loss': 0.1348, 'learning_rate': 1.2868032327110904e-05, 'epoch': 0.84}                                                  
{'loss': 0.1629, 'learning_rate': 1.2588190451025209e-05, 'epoch': 0.86}                                                  
{'loss': 0.1291, 'learning_rate': 1.2306158707424402e-05, 'epoch': 0.88}                                                  
{'loss': 0.1048, 'learning_rate': 1.2022175723320382e-05, 'epoch': 0.9}                                                   
{'loss': 0.1153, 'learning_rate': 1.1736481776669307e-05, 'epoch': 0.91}                                                  
{'loss': 0.1325, 'learning_rate': 1.1449318593072468e-05, 'epoch': 0.93}                                                  
{'loss': 0.1256, 'learning_rate': 1.1160929141252303e-05, 'epoch': 0.95}                                                  
{'loss': 0.1064, 'learning_rate': 1.0871557427476585e-05, 'epoch': 0.97}                                                  
{'loss': 0.1235, 'learning_rate': 1.0581448289104759e-05, 'epoch': 0.98}                                                  
{'loss': 0.131, 'learning_rate': 1.0290847187431115e-05, 'epoch': 1.0}                                                    
{'loss': 0.1109, 'learning_rate': 1e-05, 'epoch': 1.02}
{'loss': 0.113, 'learning_rate': 3.4074173710931804e-07, 'epoch': 1.81}                                                   
{'loss': 0.1067, 'learning_rate': 2.6955129420176193e-07, 'epoch': 1.83}                                                  
{'loss': 0.1067, 'learning_rate': 2.0659378234448524e-07, 'epoch': 1.85}                                                  
{'loss': 0.1114, 'learning_rate': 1.519224698779198e-07, 'epoch': 1.86}                                                   
{'loss': 0.1025, 'learning_rate': 1.055836141905553e-07, 'epoch': 1.88}                                                   
{'loss': 0.1119, 'learning_rate': 6.761642258056977e-08, 'epoch': 1.9}                                                    
{'loss': 0.1052, 'learning_rate': 3.805301908254455e-08, 'epoch': 1.92}                                                   
{'loss': 0.1145, 'learning_rate': 1.6918417287318245e-08, 'epoch': 1.93}                                                  
{'loss': 0.1082, 'learning_rate': 4.230499177994007e-09, 'epoch': 1.95}                                                   
{'loss': 0.1078, 'learning_rate': 0.0, 'epoch': 1.97}                                                                     
{'train_runtime': 922.3233, 'train_samples_per_second': 1.973, 'train_steps_per_second': 0.121, 'train_loss': 0.20523243956267834, 'epoch': 1.97}
100%|███████████████████████████████████████████████████████████████████████████████████| 112/112 [14:54<00:00,  7.99s/it]
显存占用：




    

Sat Apr 22 09:17:21 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   70C    P0   306W / 300W |  71518MiB / 81920MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   70C    P0   289W / 300W |  71518MiB / 81920MiB |     95%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     59480      C   /opt/conda/bin/python           71516MiB |
|    1   N/A  N/A     59481      C   /opt/conda/bin/python           71516MiB |
+-----------------------------------------------------------------------------+
模型权重文件：
> ls -al /output/vicuna-dummy
total 26322636
drwxr-xr-x 3 root root       4096 4月  22 09:25 .
drwxr-xr-x 3 root root       4096 4月  22 00:47 ..
-rw-r--r-- 1 root root        547 4月  22 09:24 config.json
-rw-r--r-- 1 root root        132 4月  22 09:24 generation_config.json
-rw-r--r-- 1 root root 9877989586 4月  22 09:24 pytorch_model-00001-of-00003.bin
-rw-r--r-- 1 root root 9894801014 4月  22 09:24 pytorch_model-00002-of-00003.bin
-rw-r--r-- 1 root root 7180990649 4月  22 09:25 pytorch_model-00003-of-00003.bin
-rw-r--r-- 1 root root      26788 4月  22 09:25 pytorch_model.bin.index.json
drwxr-xr-x 5 root root       4096 4月  22 09:08 runs
-rw-r--r-- 1 root root         96 4月  22 09:25 special_tokens_map.json
-rw-r--r-- 1 root root        727 4月  22 09:25 tokenizer_config.json
-rw-r--r-- 1 root root     499723 4月  22 09:25 tokenizer.model
-rw-r--r-- 1 root root      13895 4月  22 09:24 trainer_state.json
-rw-r--r-- 1 root root       3771 4月  22 09:25 training_args.bin
如果只有单卡怎么办？可以尝试使用offload技术，将不用的模型参数、激活值卸载到CPU内存。
torchrun --nproc_per_node=1 --master_port=20002 fastchat/train/train_mem.py \
    --model_name_or_path /model/new/vicuna-7b-all-v1.1  \
    --data_path /data/yummy.json \
    --bf16 True \
    --output_dir /output/vicuna-7b-yummy \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "tensorboard" \
    --fsdp "full_shard offload auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True
模型训练结束之后，接下来，使用生成的Vicuna模型权重进行推理即可：
python3 -m fastchat.serve.cli --model-path /output/vicuna-dummy
运行过程：
> python3 -m fastchat.serve.cli --model-path /output/vicuna-dummy
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 3/3 [00:51<00:00, 17.13s/it]
USER: Who are you
ASSISTANT: My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS).
USER: What can you do
ASSISTANT: I can chat with you!
USER: Who made you?
ASSISTANT: I'm a language model trained by researchers from Large Model Systems Organization (LMSYS).
可以看到Vicuna已经学习到了dummy.json数据文件中的知识。
好了，从0到1复现了Vicuna的训练及推理。总的来说，在超过 90%的问题中，GPT-4 更喜欢 Vicuna 而非其他SOTA开源模型（LLaMA 和 Alpaca）的答案，而且在性能上与专有模型（ChatGPT、Bard）等相差不大。在 45%的问题中，GPT-4 都将 Vicuna 的回答评为优于或等于 ChatGPT 的回答。
参考文档：
FastChat
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality
Stanford Alpaca
    吃果冻不吐果冻皮
        🏆掘金签约作者｜人工智能方向