相关文章推荐
情感对话大模型研发测评

情感对话大模型研发测评

简介

目前基于情感对话数据集(12967样本)分别在BELLE-LLaMA-13B和BELLE-LLaMA-EXT-13B基础上全参微调了两版模型,DMAI-Health-LLaMA-13B, DMAI-Health-LLaMA-EXT-13B

大模型基本能力分成10个维度:code, open qa, brainstorming,classification,math,generation,summarization,rewrite,closed qa,extract。

大模型专业能力目前只配置2个维度:generation-情感咨询师角色扮演, open qa-心理问答咨询

实验目标

测试微调后的模型在基本能力上是否有训练遗忘现象。

测试微调后的模型在专业情感数据集上是否有明显提升。

评测方法

gpt3.5-turbo进行自动化评测,详情如下:
自动化评测主要借助chatgpt来实现,参考belle的评测方法,具体内容如下:

Use case Prompt
Math 你是一个数学老师,给定一道数学问题,你需要判断学生答案和标准答案是否一致。如果学生的答案结果和标准答案结果一致,则得 1分,如果不一致,则直接得0分。请按照"得分:"这样的形式输出学生分数。 You are a math teacher and you need to check if a student’s answer to a math problem matches the standard answer. If the student’s answer matches the standard answer, they receive 1 point. If not, they receive 0 points. Please output the student’s score in the format of "Score:".
Code 你是一个计算机科学老师,给定一道编程问题,你需要判断学生答案是否能够顺利执行并取得满足题目要求的结果。如果可以,则得 1分,不可以则得0分。你可以参考标准答案中的代码。请按照"得分:"这样的形式输出学生分数。 You are a computer science teacher who needs to evaluate whether a student’s programming answer can successfully execute and achieve the desired result for a given problem. If it can, the student gets 1 point, otherwise they get 0 points. You can refer to the code in the standard answer. Please output the student’s score in the format of "score:".
COT 你是一个逻辑学家,给定一个问题,你需要判断模型回答是否在符合常识、逻辑的前提下,很好的回答了这个问题。如果模型回答符 合逻辑,则模型回答得1分,如果模型回答不符合逻辑,则得0分。你可以参考标准回答中的内容。请按照"得分 :"这样的形式输出分数。 You are a logician, and given a question, you need to determine whether the model’s answer is logical and in accordance with common sense. If the model’s answer is logical, it will receive a score of 1, and if it is not logical, it will receive a score of 0. You can refer to the content of the standard answer. Please output the score in the format of "Score:".
Classification 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求 分类结果越准确,分数越高。 You need to give a score to the model’s answer based on the reference standard answer, with a maximum score of 1 and a minimum score of 0. Please output the score in the format of "Score:". The evaluation criteria require that the more accurate the classification result, the higher the score.
Extraction 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求 需要保证抽取出来的结果来自文本,并且符合问题的要求。 You need to score the model’s answer based on the reference standard answer, with a full score of 1 point and a minimum score of 0 point. Please output the score in the format of "Score:". The evaluation criteria require that the extracted results come from the text and meet the requirements of the question.
Open QA 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求 回答的结果越接近正确答案分数越高。 You need to score the model’s answer by referring to the standard answer, with a maximum score of 1 and a minimum score of 0. Please output the score in the format of "Score: ". The evaluation standard requires that the closer the answer given is to the standard answer, the higher the score.
Closed QA 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要 求回答的结果准确,且回答结果来自问题里面提供的信息。 You need to score the model’s answer by referencing the standard answer. The full score is 1 point, and the lowest score is 0 point. Please output the score in the format of "Score:". The evaluation criteria require that the answer is accurate and comes from the information provided in the question.
Generation 假设你是一个作家,你需要研究评价标准来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。 评价标准要求生成的结果语句通顺,内容主题符合要求。 Assuming you are a writer, you need to research evaluation criteria to give a score to the model’s answer, with a maximum score of 1 point and a minimum score of 0 points. Please output the score in the format of "Score:". The evaluation criteria require the generated sentence to be smooth and the content to be relevant to the topic.
Brainstorming 你需要研究评价标准来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求要求 回答的内容对于问题有帮助,并且是真实没有恶意的。 You need to study the evaluation criteria to give a score to the model’s answer, with a maximum score of 1 point and a minimum score of 0 points. Please output the score in the format of "Score:". The evaluation criteria require that the answer is helpful to the question and is truthful and non-malicious.
Rewrite 假设你是一个作家,你需要研究评价标准来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。 评价标准要求重写过后的句子保持原有的意思,并且重写过后的句子越通顺分数越高。 Assuming that you are a writer, you need to research the evaluation criteria to give a score for the model’s answer, with a maximum score of 1 point and a minimum score of 0 points. Please output the score in the format of "Score:". The evaluation criteria require that the rewritten sentence retains the original meaning, and the more fluent the rewritten sentence, the higher the score.
Translation 假设你是一个语言学家,你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形 式输出分数。评价标准要求翻译过后的句子保持原有的意思,并且翻译过后的句子越通顺分数越高。 Assuming you are a linguist, you need to score the model’s answer based on the reference answer, with a full score of 1 point and a minimum score of 0 point. Please output the score in the form of "Score:". Th
Summarization 假设你是一个作家,你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出 分数。评价标准要求生成的摘要内容能包含输入文本信息的重点. Assuming you are a writer, you need to score the model’s answer by referring to the standard answer, with a full score of 1 point and a minimum score of 0 points. Please output the score in the form of "Score:" The evaluation criteria require that the generated summary content can contain the key points of the input text.

实验结果

1. 基本能力实验结果

类别 样本数 chatglm-6b BELLE-LLaMA-13B BELLE-LLaMA-EXT-13B DMAI-Health-LLaMA-13B DMAI-Health-LLaMA-EXT-13B
code 38 0.60 0.73 0.56 0.64 0.52
open qa 285 0.50 0.49 0.42 0.46 0.42
brainstorming 179 0.91 0.90 0.80 0.89 0.85
classification 65 0.59 0.74 0.58 0.69 0.61
math 75 0.60 0.31 0.40 0.27 0.33
generation 98 0.94 0.94 0.75 0.96 0.91
summarization 40 0.75 0.71 0.43 0.69 0.44
rewrite 131 0.87 0.89 0.75 0.86 0.82
closed qa 52 0.58 0.59 0.43 0.60 0.39
extract 37 0.66 0.58 0.41 0.49 0.22

如图所示:

2. 专业能力实验结果

类别 样本数 chatglm-6b BELLE-LLaMA-13B BELLE-LLaMA-EXT-13B DMAI-Health-LLaMA-13B DMAI-Health-LLaMA-EXT-13B
generation-情感咨询师角色扮演 1315 0.97 0.99 0.91 0.97 0.98
open qa-心理问答咨询 20 0.47 0.38 0.27 0.35 0.25

如图所示:

评测效果展示

详情查看下面excel的样例评测效果:

链接: pan.baidu.com/s/1pOTgzn 提取码: nb9c

初步结论

1 BELLE-LLaMA-EXT-13B扩充词表后的相比BELLE-LLaMA-13B看来并没有明显提升,而且还普遍下降

2 如 SuperCLUE指标评测 展示的那样,belle math能力欠佳

3 训练后的模型除了extract之外,其它基本能力大致与训练前相当

4 现在评测还是一问一答形式,后续侧重多轮对话的评测切合实际

5 chatgpt自动化评测大体靠谱的,后续可能需要数据组协助,跟人工做一个对比实验。

6 本次训练用的语料不多(1万多条),可能样本遗忘没有那么严重,后续数据扩充之后再做一次对比测试。

7 大模型本身的角色扮演能力较强

8 心理问答语料较少,这个后续要多爬虫一些,特别中老年心理相关的语料,而且现在语料都是翻译过来的,而且训练数据较少(1万多条),效果欠佳,这个后续做重点处理

9 现在评测还是一问一答形式,跟真实情感多轮对话有一定差距,后续进一步侧重多轮情感对话的评测

10 chatgpt自动化评测大体靠谱,后续可能需要数据组协助,跟人工做一个对比实验

编辑于 2023-05-19 10:10 ・IP 属地广东

文章被以下专栏收录

 
推荐文章