情感对话大模型研发测评


简介
目前基于情感对话数据集(12967样本)分别在BELLE-LLaMA-13B和BELLE-LLaMA-EXT-13B基础上全参微调了两版模型,DMAI-Health-LLaMA-13B, DMAI-Health-LLaMA-EXT-13B
大模型基本能力分成10个维度:code, open qa, brainstorming,classification,math,generation,summarization,rewrite,closed qa,extract。
大模型专业能力目前只配置2个维度:generation-情感咨询师角色扮演, open qa-心理问答咨询
实验目标
测试微调后的模型在基本能力上是否有训练遗忘现象。
测试微调后的模型在专业情感数据集上是否有明显提升。
评测方法
gpt3.5-turbo进行自动化评测,详情如下:
自动化评测主要借助chatgpt来实现,参考belle的评测方法,具体内容如下:
Use case | Prompt |
---|---|
Math | 你是一个数学老师,给定一道数学问题,你需要判断学生答案和标准答案是否一致。如果学生的答案结果和标准答案结果一致,则得 1分,如果不一致,则直接得0分。请按照"得分:"这样的形式输出学生分数。 You are a math teacher and you need to check if a student’s answer to a math problem matches the standard answer. If the student’s answer matches the standard answer, they receive 1 point. If not, they receive 0 points. Please output the student’s score in the format of "Score:". |
Code | 你是一个计算机科学老师,给定一道编程问题,你需要判断学生答案是否能够顺利执行并取得满足题目要求的结果。如果可以,则得 1分,不可以则得0分。你可以参考标准答案中的代码。请按照"得分:"这样的形式输出学生分数。 You are a computer science teacher who needs to evaluate whether a student’s programming answer can successfully execute and achieve the desired result for a given problem. If it can, the student gets 1 point, otherwise they get 0 points. You can refer to the code in the standard answer. Please output the student’s score in the format of "score:". |
COT | 你是一个逻辑学家,给定一个问题,你需要判断模型回答是否在符合常识、逻辑的前提下,很好的回答了这个问题。如果模型回答符 合逻辑,则模型回答得1分,如果模型回答不符合逻辑,则得0分。你可以参考标准回答中的内容。请按照"得分 :"这样的形式输出分数。 You are a logician, and given a question, you need to determine whether the model’s answer is logical and in accordance with common sense. If the model’s answer is logical, it will receive a score of 1, and if it is not logical, it will receive a score of 0. You can refer to the content of the standard answer. Please output the score in the format of "Score:". |
Classification | 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求 分类结果越准确,分数越高。 You need to give a score to the model’s answer based on the reference standard answer, with a maximum score of 1 and a minimum score of 0. Please output the score in the format of "Score:". The evaluation criteria require that the more accurate the classification result, the higher the score. |
Extraction | 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求 需要保证抽取出来的结果来自文本,并且符合问题的要求。 You need to score the model’s answer based on the reference standard answer, with a full score of 1 point and a minimum score of 0 point. Please output the score in the format of "Score:". The evaluation criteria require that the extracted results come from the text and meet the requirements of the question. |
Open QA | 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求 回答的结果越接近正确答案分数越高。 You need to score the model’s answer by referring to the standard answer, with a maximum score of 1 and a minimum score of 0. Please output the score in the format of "Score: ". The evaluation standard requires that the closer the answer given is to the standard answer, the higher the score. |
Closed QA | 你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要 求回答的结果准确,且回答结果来自问题里面提供的信息。 You need to score the model’s answer by referencing the standard answer. The full score is 1 point, and the lowest score is 0 point. Please output the score in the format of "Score:". The evaluation criteria require that the answer is accurate and comes from the information provided in the question. |
Generation | 假设你是一个作家,你需要研究评价标准来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。 评价标准要求生成的结果语句通顺,内容主题符合要求。 Assuming you are a writer, you need to research evaluation criteria to give a score to the model’s answer, with a maximum score of 1 point and a minimum score of 0 points. Please output the score in the format of "Score:". The evaluation criteria require the generated sentence to be smooth and the content to be relevant to the topic. |
Brainstorming | 你需要研究评价标准来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。评价标准要求要求 回答的内容对于问题有帮助,并且是真实没有恶意的。 You need to study the evaluation criteria to give a score to the model’s answer, with a maximum score of 1 point and a minimum score of 0 points. Please output the score in the format of "Score:". The evaluation criteria require that the answer is helpful to the question and is truthful and non-malicious. |
Rewrite | 假设你是一个作家,你需要研究评价标准来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出分数。 评价标准要求重写过后的句子保持原有的意思,并且重写过后的句子越通顺分数越高。 Assuming that you are a writer, you need to research the evaluation criteria to give a score for the model’s answer, with a maximum score of 1 point and a minimum score of 0 points. Please output the score in the format of "Score:". The evaluation criteria require that the rewritten sentence retains the original meaning, and the more fluent the rewritten sentence, the higher the score. |
Translation | 假设你是一个语言学家,你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形 式输出分数。评价标准要求翻译过后的句子保持原有的意思,并且翻译过后的句子越通顺分数越高。 Assuming you are a linguist, you need to score the model’s answer based on the reference answer, with a full score of 1 point and a minimum score of 0 point. Please output the score in the form of "Score:". Th |
Summarization | 假设你是一个作家,你需要通过参考标准答案,来对模型的答案给出分数,满分为1分,最低分为0分。请按照"得分:"这样的形式输出 分数。评价标准要求生成的摘要内容能包含输入文本信息的重点. Assuming you are a writer, you need to score the model’s answer by referring to the standard answer, with a full score of 1 point and a minimum score of 0 points. Please output the score in the form of "Score:" The evaluation criteria require that the generated summary content can contain the key points of the input text. |
实验结果
1. 基本能力实验结果
类别 | 样本数 | chatglm-6b | BELLE-LLaMA-13B | BELLE-LLaMA-EXT-13B | DMAI-Health-LLaMA-13B | DMAI-Health-LLaMA-EXT-13B |
code | 38 | 0.60 | 0.73 | 0.56 | 0.64 | 0.52 |
open qa | 285 | 0.50 | 0.49 | 0.42 | 0.46 | 0.42 |
brainstorming | 179 | 0.91 | 0.90 | 0.80 | 0.89 | 0.85 |
classification | 65 | 0.59 | 0.74 | 0.58 | 0.69 | 0.61 |
math | 75 | 0.60 | 0.31 | 0.40 | 0.27 | 0.33 |
generation | 98 | 0.94 | 0.94 | 0.75 | 0.96 | 0.91 |
summarization | 40 | 0.75 | 0.71 | 0.43 | 0.69 | 0.44 |
rewrite | 131 | 0.87 | 0.89 | 0.75 | 0.86 | 0.82 |
closed qa | 52 | 0.58 | 0.59 | 0.43 | 0.60 | 0.39 |
extract | 37 | 0.66 | 0.58 | 0.41 | 0.49 | 0.22 |
如图所示:

2. 专业能力实验结果
类别 | 样本数 | chatglm-6b | BELLE-LLaMA-13B | BELLE-LLaMA-EXT-13B | DMAI-Health-LLaMA-13B | DMAI-Health-LLaMA-EXT-13B |
generation-情感咨询师角色扮演 | 1315 | 0.97 | 0.99 | 0.91 | 0.97 | 0.98 |
open qa-心理问答咨询 | 20 | 0.47 | 0.38 | 0.27 | 0.35 | 0.25 |
如图所示:

评测效果展示
详情查看下面excel的样例评测效果:
链接: https:// pan.baidu.com/s/1pOTgzn qkV_T7iyXr9yRhvw?pwd=nb9c 提取码: nb9c
初步结论
1 BELLE-LLaMA-EXT-13B扩充词表后的相比BELLE-LLaMA-13B看来并没有明显提升,而且还普遍下降
2 如 SuperCLUE指标评测 展示的那样,belle math能力欠佳
3 训练后的模型除了extract之外,其它基本能力大致与训练前相当
4 现在评测还是一问一答形式,后续侧重多轮对话的评测切合实际
5 chatgpt自动化评测大体靠谱的,后续可能需要数据组协助,跟人工做一个对比实验。
6 本次训练用的语料不多(1万多条),可能样本遗忘没有那么严重,后续数据扩充之后再做一次对比测试。
7 大模型本身的角色扮演能力较强
8 心理问答语料较少,这个后续要多爬虫一些,特别中老年心理相关的语料,而且现在语料都是翻译过来的,而且训练数据较少(1万多条),效果欠佳,这个后续做重点处理
9 现在评测还是一问一答形式,跟真实情感多轮对话有一定差距,后续进一步侧重多轮情感对话的评测
10 chatgpt自动化评测大体靠谱,后续可能需要数据组协助,跟人工做一个对比实验