jeinlee1991 / chinese-llm-benchmark Goto Github PK

中文大模型能力评测榜单：目前已囊括115个大模型，覆盖chatgpt、gpt4o、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型，以及百川、qwen2、glm4、yi、书生internLM2、llama3等开源大模型，多维度能力评测。不仅提供能力评分排行榜，也提供所有模型的原始输出结果！

chinese-llm-benchmark's Introduction

CLiB中文大模型能力评测榜单（持续更新）

目前已囊括115个大模型，覆盖chatgpt、gpt4、百度文心一言、阿里通义千问、讯飞星火、商汤senseChat、minimax等商用模型，以及百川、qwen2、glm4、openbuddy、AquilaChat、书生internLM2、llama3.1等开源大模型。
模型来源涉及国内外大厂、大模型创业公司、高校研究机构。
支持多维度能力评测，包括分类能力、信息抽取能力、阅读理解能力、数据分析能力、中文编码效率、中文指令遵从。
不仅提供能力评分排行榜，也提供所有模型的原始输出结果！有兴趣的朋友可以自己打分、自己排行！

🔄最近更新
⚓TODO
📝大模型基本信息
📊排行榜
- 综合能力排行榜
  - 10B以下开源大模型排行榜
  - 10B~20B开源大模型排行榜
  - 20B以上开源大模型排行榜
- 分类能力排行榜
- 信息抽取能力排行榜
- 阅读理解能力排行榜
- 数据分析排行榜
- 中文编码效率排行榜
- 中文指令遵从排行榜
- 数学基础（算术）能力排行榜
🌐各项能力评分
⚖️原始评测数据
为什么做榜单？

TODO

将更多大模型加入评测：Claude、gemini等等
增加开源大模型的授权协议，注明能否商用
引入更多维度的评测：数学能力、代码能力、开放域问答、多轮对话、头脑风暴、翻译……
评测维度更细分，比如信息抽取可以细分时间实体抽取能力、地址实体抽取能力……
海纳百川，整合各类评测榜单，扩充细分领域榜单（比如教育领域、医疗领域）
加入更多评测数据，使得评测得分越来越有说服力

大模型基本信息

价格单位：元/1m tokens，即元每百万token

类别	大模型	价格/下载	机构
商用	qwen-long	输入：0.5元，输出： 2元	阿里
商用	qwen-turbo	输入：2元，输出：6元	阿里
商用	qwen-plus	输入：4元，输出：12元	阿里
商用	qwen-max	输入：40元，输出：120元	阿里
商用	gpt-4o	输入：36.2元，输出：108.6元	openAI
商用	gpt-4-turbo	输入：72.4元，输出：217.2元	openAI
商用	gpt-4	输入：217.2元，输出： 434.4元	openAI
商用	gpt-3.5-turbo	输入：3.6元，输出：10.9元	openAI
...	...	...	...
开源	glm-4-9b-chat	modelScope下载	智谱
开源	Phi-3-mini-128k-instruct	modelScope下载	微软
开源	internlm2-chat-20b	modelScope下载	上海人工智能实验室
...	...	...	...

详见中文大模型资源汇总（商用及开源）

📊 排行榜

1、综合能力排行榜

综合能力得分为分类能力、信息抽取、阅读理解、数据分析、指令遵从、算术运算六者得分的平均值。
详细数据见total

1.1、商用大模型排行榜

（1）输出价格100元及以上商用大模型排行榜

大模型	价格（输出）	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	总分	排名
gpt-4o	109元	93	96.3	98.0	100.0	83	95.7	94.3	1
gpt4	434元	94	94.0	99.3	97.0	84	86.5	92.5	2
gpt-4-turbo	217元	91	90.0	94.0	96.0	83	96.5	91.8	3
百度文心4.0	120元	88	89.0	94.7	94.0	79	100.0	90.8	4
讯飞4.0Ultra	100元	88	84.4	96.0	92.7	80	94.3	89.2	5
智谱GLM4	100元	86	90.0	97.3	96.0	77	78.0	87.4	6
Baichuan4	100元	86	94.1	93.3	95.3	75	78.2	87.0	7
阿里qwen-max	120元	86	82.0	95.3	91.3	77	79.8	85.2	8

（2）输出价格10~100元商用大模型排行榜

大模型	价格（输出）	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	总分	排名
百度ERNIE-3.5-8K	12元	89	89.6	96.7	97.3	72	100.0	90.8	1
百度ERNIE-4.0-Turbo	60元	90	88.9	96.0	93.3	78	97.7	90.7	2
minimax-abab6.5-chat	30元	89	87.0	89.3	95.3	76	90.3	87.8	3
Baichuan3-Turbo	12元	88	86.7	94.7	90.7	75	89.2	87.4	4
yi-large	20元	85	91.0	90.0	92.7	77	88.3	87.3	5
minimax-abab6.5s-chat	10元	87	88.0	88.7	88.0	80	91.7	87.2	6
讯飞星火v3.5(spark-max)	30元	87	92.0	89.3	87.3	74	93.5	87.2	7
智谱GLM-4-AirX	10元	89	91.9	92.7	88.0	83	74.2	86.5	8
阿里qwen-plus	12元	88	89.6	90.0	84.0	73	93.0	86.3	9
yi-large-turbo	12元	82	90.0	88.7	86.7	78	87.8	85.5	10
讯飞星火v3(spark-pro)	30元	87	82.0	88.0	86.0	74	94.0	85.2	11
gpt-3.5-turbo	11元	81	83.0	92.7	91.3	77	80.0	84.2	12
月之暗面moonshot-v1-8k	12元	92	85.0	84.0	89.3	72	79.3	83.6	13
商汤SenseChat-v4	12元	89	78.5	88.0	86.7	71	72.2	80.9	14
minimax-abab5.5-chat	15元	83	79.0	86.7	72.7	76	39.7	72.8	15

（3）输出价格10元以下商用大模型排行榜

大模型	价格（输出）	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	总分	排名
gpt-4o-mini	4.3元	90	93.3	89.3	100.0	83	92.7	91.4	1
deepseek-chat-v2	2元	93	88.0	94.0	96.0	76	96.7	90.6	2
豆包Doubao-pro-32k	2元	86	88.1	96.7	86.7	85	98.2	90.1	3
yi-medium	2.5元	86	93.0	89.3	94.0	76	89.2	87.9	4
智谱GLM-4-Air	1元	89	91.9	92.7	88.0	83	74.5	86.5	5
阿里qwen-long	2元	89	85.9	90.0	86.7	75	83.3	85.0	6
智谱GLM-4-Flash	0.1元	89	80.0	86.0	82.0	79	75.5	81.9	7
yi-spark	1元	82	88.9	88.0	76.0	72	83.3	81.7	8
百度ERNIE-Speed-8K	0元	88	88.1	88.0	89.3	68	68.7	81.7	9
qwen-turbo	6元	83	85.2	88.0	76.0	66	81.3	79.9	10
商汤SenseChat-Turbo	5元	81	77.8	76.7	86.0	72	78.5	78.7	11
豆包Doubao-lite-32k	0.6元	77	86.7	88.7	64.7	62	87.2	77.7	12
minimax-abab5.5s-chat	5元	58	57.0	70.7	56.0	49	57.0	58.0	13

旗舰商用模型badcase: gpt-4o | moonshot-v1-8k | deepseek-chat-v2 | yi-large | 更多

1.2、开源大模型排行榜

（1）10B以下开源大模型排行榜

类别	大模型	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	总分	排名
开源	gemma-2-9b-it	85	82.2	88.7	87.3	81	89.3	85.6	1
开源	glm-4-9b-chat	90	82.2	90.0	82.0	79	76.5	83.3	2
开源	Qwen2-7B-Instruct	89	83.7	86.7	75.3	77	81.3	82.2	3
开源	Llama-3-8B-Instruct	86	74.0	80.0	90.0	63	89.5	80.4	4
开源	internlm2_5-7b-chat	86	84.4	90.0	83.3	79	59.8	80.4	5
开源	Yi-1.5-9B-Chat	82	83.0	84.7	80.0	72	73.8	79.2	6
开源	Llama-3.1-8B-Instruct	63	85.2	82.0	84.0	69	90.5	79.0	7
开源	openbuddy-llama3-8b	78	86.0	81.3	79.0	70	63.2	76.2	8
开源	Qwen1.5-7B-Chat	80	76.0	76.0	70.7	67	71.2	73.5	9
开源	internlm2-chat-7b	86	81.0	72.7	82.7	64	42.8	71.5	10
开源	Phi-3-mini-128k-instruct	74	63.0	65.3	73.0	75	71.3	70.3	11
开源	Baichuan2-7B-Chat	88	76.0	83.3	69.0	68	34.8	69.8	12
开源	谷歌gemma-7b-it	72	79.0	74.0	76.0	66	38.5	67.6	13
开源	MiniCPM-2B-dpo	79	77.0	74.0	66.0	55	52.7	67.3	14
开源	Qwen1.5-4B-Chat	75	65.0	79.3	63.0	56	53.0	65.2	15
开源	qwen2-1.5b-instruct	73	74.1	68.0	50.7	54	55.7	62.6	16
开源	internlm2-chat-1_8b	69	60.7	63.3	46.0	45	39.7	54.0	17
开源	谷歌gemma-2b-it	56	60.0	60.0	43.3	61	26.3	51.1	18
开源	Qwen1.5-1.8B-Chat	57	58.0	52.7	48.0	46	26.7	48.1	19
开源	qwen2-0.5b-instruct	49	53.3	62.0	36.7	48	35.5	47.4	20
开源	Qwen1.5-0.5B-Chat	44	40.0	60.0	34.7	42	17.2	39.6	21

（2）10B~20B开源大模型排行榜

类别	大模型	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	总分	排名
开源	Qwen1.5-14B-Chat	89	79.0	90.7	90.7	66	77.5	82.2	1
开源	internlm2-chat-20b	93	80.0	86.0	88.0	68	63.3	79.7	2
开源	Baichuan2-13B-Chat	83	83.0	74.7	77.0	75	54.8	74.6	3
开源	DeepSeek-V2-Lite-Chat	81	76.3	81.3	73.3	69	61.2	73.7	4

（3）30B以上开源大模型排行榜

类别	大模型	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	总分	排名
开源	Qwen2-72B-Instruct	87	91.1	94.7	90.0	86	94.2	90.5	1
开源	Llama-3.1-70B-Instruct	87	88.9	92.0	90.7	79	94.8	88.7	2
开源	Llama-3-70B-Instruct	88	87.0	96.0	95.0	70	90.8	87.8	3
开源	qwen2-57b-a14b-instruct	85	88.1	89.3	87.3	77	89.2	86.0	4
开源	Qwen1.5-32B-Chat	91	86.0	92.7	87.3	72	86.8	86.0	5
开源	Qwen1.5-72B-Chat	89	84.0	88.0	87.3	70	84.8	83.8	6
开源	deepseek-llm-67b-chat	87	81.0	86.7	92.0	74	72.7	82.2	7
开源	Yi-1.5-34B-Chat	90	83.0	82.7	83.3	74	79.0	82.0	8

2、分类能力排行榜

评测样本举例：

将下列单词按词性分类。
狗，追，跑，大人，高兴，树

☛查看分类能力badcase 详细数据见classification

3、信息抽取能力排行榜

评测样本举例：

“中信银行3亿元，交通银行增长约2.7亿元，光大银行约1亿元。”
提取出以上文本中的所有组织机构名称

☛查看信息抽取能力badcase 详细数据见extract

4、阅读理解能力排行榜

阅读理解能力是一种符合能力，考查针对给定信息的理解能力。依据给定信息的种类，可以细分为：文章问答、表格问答、对话问答……
评测样本举例：

牙医：好的，让我们看看你的牙齿。从你的描述和我们的检查结果来看，你可能有一些牙齦疾病，导致牙齿的神经受到刺激，引起了敏感。此外，这些黑色斑点可能是蛀牙。
病人：哦，真的吗？那我该怎么办？
牙医：别担心，我们可以为你制定一个治疗计划。我们需要首先治疗牙龈疾病，然后清除蛀牙并填充牙洞。在此过程中，我们将确保您感到舒适，并使用先进的技术和材料来实现最佳效果。
病人：好的，谢谢您，医生。那么我什么时候可以开始治疗？
牙医：让我们为您安排一个约会。您的治疗将在两天后开始。在此期间，请继续刷牙，使用牙线，并避免吃过于甜腻和酸性的食物和饮料。
病人：好的，我会的。再次感谢您，医生。
牙医：不用谢，我们会尽最大的努力帮助您恢复健康的牙齿。
基于以上对话回答：病人在检查中发现的牙齿问题有哪些？

☛查看阅读理解能力badcase 详细数据见mrc

5、数据分析排行榜

专门考查大模型对表格的理解分析能力，常用于数据分析。
评测样本举例：

姓名,年龄,性别,国籍,身高(cm),体重(kg),学历
张三,28,男,**,180,70,本科
Lisa,33,女,美国,165,58,硕士
Paulo,41,男,巴西,175,80,博士
Miyuki,25,女,日本,160,50,大专
Ahmed,30,男,埃及,175,68,本科
Maria,29,女,墨西哥,170,65,硕士
Antonio,36,男,西班牙,182,75,博士
基于这个表格回答：学历最低的是哪国人？

☛查看数据分析badcase 详细数据见tableqa

6、中文编码效率排行榜

暂不计入综合能力评分。专门考查大模型编码中文字符的效率，同等尺寸大模型，编码效率越高推理速度越快，几乎成正比。中文编码效率相当于大模型生成的每个token解码后对应的中文平均字数（大模型每次生成一个token，然后解码成真正可见的字符，比如中文、英文、标点符号等）。比如baichuan2、llama2的中文中文编码效率分别为1.67、0.61，意味着在同尺寸模型下，baichuan2的运行速度是llama2的2.7倍（1.67/0.61）。

7、中文指令遵从排行榜

参考谷歌IFEval，并将其翻译和适配到中文，精选9类25种指令，说明如下：排行榜：
详细数据见IFEval

8、数学基础（算术）能力排行榜

考查大模型的数学基础能力之算数能力，测试题目为1000以内的整数加减法、不超过2位有效数字的浮点数加减乘除。举例：166 + 215 + 53 = ？，0.97 + 0.4 / 4.51 = ？

☛查看算术能力badcase
详细数据见math

🌐各项能力评分

评分方法：从各个维度给大模型打分，每个维度都对应一个评测数据集，包含若干道题。每道题依据大模型回复质量给1~5分，将评测集内所有题的得分累加并归一化为100分制，即作为最终得分。

类别	大模型	分类能力	信息抽取	阅读理解	数据分析	指令遵从	算术运算	综合能力
商用	gpt-4o	93	96.3	98.0	100.0	83	95.7	94.3
商用	gpt4	94	94.0	99.3	97.0	84	86.5	92.5
商用	gpt-4-turbo	91	90.0	94.0	96.0	83	96.5	91.8
商用	gpt-4o-mini	90	93.3	89.3	100.0	83	92.7	91.4
商用	百度文心4.0	88	89.0	94.7	94.0	79	100.0	90.8
商用	百度ERNIE-3.5-8K	89	89.6	96.7	97.3	72	100.0	90.8
商用	百度ERNIE-4.0-Turbo	90	88.9	96.0	93.3	78	97.7	90.7
商用	deepseek-chat-v2	93	88.0	94.0	96.0	76	96.7	90.6
开源	Qwen2-72B-Instruct	87	91.1	94.7	90.0	86	94.2	90.5
商用	豆包Doubao-pro-32k	86	88.1	96.7	86.7	85	98.2	90.1
商用	讯飞4.0Ultra	88	84.4	96.0	92.7	80	94.3	89.2
开源	Llama-3.1-70B-Instruct	87	88.9	92.0	90.7	79	94.8	88.7
商用	yi-medium	86	93.0	89.3	94.0	76	89.2	87.9
商用	minimax-abab6.5-chat	89	87.0	89.3	95.3	76	90.3	87.8
开源	Llama-3-70B-Instruct	88	87.0	96.0	95.0	70	90.8	87.8
商用	Baichuan3-Turbo	88	86.7	94.7	90.7	75	89.2	87.4
商用	智谱GLM4	86	90.0	97.3	96.0	77	78.0	87.4
商用	yi-large	85	91.0	90.0	92.7	77	88.3	87.3
商用	minimax-abab6.5s-chat	87	88.0	88.7	88.0	80	91.7	87.2
商用	讯飞星火v3.5(spark-max)	87	92.0	89.3	87.3	74	93.5	87.2
商用	Baichuan4	86	94.1	93.3	95.3	75	78.2	87.0
商用	智谱GLM-4-Air	89	91.9	92.7	88.0	83	74.5	86.5
商用	智谱GLM-4-AirX	89	91.9	92.7	88.0	83	74.2	86.5
商用	阿里qwen-plus	88	89.6	90.0	84.0	73	93.0	86.3
开源	qwen2-57b-a14b-instruct	85	88.1	89.3	87.3	77	89.2	86.0
开源	Qwen1.5-32B-Chat	91	86.0	92.7	87.3	72	86.8	86.0
开源	gemma-2-9b-it	85	82.2	88.7	87.3	81	89.3	85.6
商用	yi-large-turbo	82	90.0	88.7	86.7	78	87.8	85.5
商用	阿里qwen-max	86	82.0	95.3	91.3	77	79.8	85.2
商用	讯飞星火v3(spark-pro)	87	82.0	88.0	86.0	74	94.0	85.2
商用	阿里qwen-long	89	85.9	90.0	86.7	75	83.3	85.0
商用	gpt-3.5-turbo	81	83.0	92.7	91.3	77	80.0	84.2
开源	Qwen1.5-72B-Chat	89	84.0	88.0	87.3	70	84.8	83.8
商用	月之暗面moonshot-v1-8k	92	85.0	84.0	89.3	72	79.3	83.6
开源	glm-4-9b-chat	90	82.2	90.0	82.0	79	76.5	83.3
开源	deepseek-llm-67b-chat	87	81.0	86.7	92.0	74	72.7	82.2
开源	Qwen2-7B-Instruct	89	83.7	86.7	75.3	77	81.3	82.2
开源	Qwen1.5-14B-Chat	89	79.0	90.7	90.7	66	77.5	82.2
开源	Yi-1.5-34B-Chat	90	83.0	82.7	83.3	74	79.0	82.0
商用	智谱GLM-4-Flash	89	80.0	86.0	82.0	79	75.5	81.9
商用	yi-spark	82	88.9	88.0	76.0	72	83.3	81.7
商用	百度ERNIE-Speed-8K	88	88.1	88.0	89.3	68	68.7	81.7
商用	商汤SenseChat-v4	89	78.5	88.0	86.7	71	72.2	80.9
开源	Llama-3-8B-Instruct	86	74.0	80.0	90.0	63	89.5	80.4
开源	internlm2_5-7b-chat	86	84.4	90.0	83.3	79	59.8	80.4
商用	qwen-turbo	83	85.2	88.0	76.0	66	81.3	79.9
开源	internlm2-chat-20b	93	80.0	86.0	88.0	68	63.3	79.7
开源	Yi-1.5-9B-Chat	82	83.0	84.7	80.0	72	73.8	79.2
开源	Llama-3.1-8B-Instruct	63	85.2	82.0	84.0	69	90.5	79.0
商用	商汤SenseChat-Turbo	81	77.8	76.7	86.0	72	78.5	78.7
商用	豆包Doubao-lite-32k	77	86.7	88.7	64.7	62	87.2	77.7
开源	openbuddy-llama3-8b	78	86.0	81.3	79.0	70	63.2	76.2
开源	Baichuan2-13B-Chat	83	83.0	74.7	77.0	75	54.8	74.6
开源	DeepSeek-V2-Lite-Chat	81	76.3	81.3	73.3	69	61.2	73.7
开源	Qwen1.5-7B-Chat	80	76.0	76.0	70.7	67	71.2	73.5
商用	minimax-abab5.5-chat	83	79.0	86.7	72.7	76	39.7	72.8
开源	internlm2-chat-7b	86	81.0	72.7	82.7	64	42.8	71.5
开源	Phi-3-mini-128k-instruct	74	63.0	65.3	73.0	75	71.3	70.3
开源	Baichuan2-7B-Chat	88	76.0	83.3	69.0	68	34.8	69.8
开源	谷歌gemma-7b-it	72	79.0	74.0	76.0	66	38.5	67.6
开源	MiniCPM-2B-dpo	79	77.0	74.0	66.0	55	52.7	67.3
开源	Qwen1.5-4B-Chat	75	65.0	79.3	63.0	56	53.0	65.2
开源	qwen2-1.5b-instruct	73	74.1	68.0	50.7	54	55.7	62.6
商用	minimax-abab5.5s-chat	58	57.0	70.7	56.0	49	57.0	58.0
开源	internlm2-chat-1_8b	69	60.7	63.3	46.0	45	39.7	54.0
开源	谷歌gemma-2b-it	56	60.0	60.0	43.3	61	26.3	51.1
开源	Qwen1.5-1.8B-Chat	57	58.0	52.7	48.0	46	26.7	48.1
开源	qwen2-0.5b-instruct	49	53.3	62.0	36.7	48	35.5	47.4
开源	Qwen1.5-0.5B-Chat	44	40.0	60.0	34.7	42	17.2	39.6

⚖️原始评测数据

包含各维度评测集以及大模型输出结果，详见本项目的eval文件目录

为什么做榜单？

大模型百花齐放，也参差不齐。不少媒体的宣传往往夸大其词，避重就轻，容易混淆视听；而某些公司为了PR，也过分标榜自己大模型的能力，动不动就“达到chatgpt水平”，动不动就“国内第一”。所谓“外行看热闹，内行看门道”，业界急需一股气流，摒弃浮躁，静下心来打磨前沿技术，真真正正用技术实力说话。这就少不了一个公开、公正、公平的大模型评测系统，把各类大模型的优点、不足一一展示出来。如此，大家既能把握当下的发展水平、与国外顶尖技术的差距，也能更加清晰地看明白未来的努力方向，而不被资本热潮、舆论热潮所裹挟。
对于产业界来说，特别是对于不具备大模型研发能力的公司，熟悉大模型的技术边界、高效有针对性地做大模型技术选型，在现如今显得尤为重要。而一个公开、公正、公平的大模型评测系统，恰好能够提供应有的助力，避免重复造轮子，避免因技术栈不同而导致不必要的争论，避免“鸡同鸭讲”。
对于大模型研发人员，包括对大模型技术感兴趣的人、学术界看中实践的人，各类大模型的效果对比，反应出了背后不同技术路线、技术方法的有效性，这就提供了非常好的参考意义。不同大模型的相互参考、借鉴，帮忙大家躲过不必要的坑、避免重复实验带来的资源浪费，有助于整个大模型生态圈的良性高效发展。

大模型选型及评测交流群

chinese-llm-benchmark's People

Contributors

Stargazers

Watchers

Forkers

itsharex ydk1234 dzy0302 michael-wzhu asmallprogramingmouse liushuchun jinghai lokvke hengherui neobobos mrxiaqp mintlucas fairyworld zhangnn520 zyx-a neophack yu2024 wayne980 wutao0914 huzongxiang nickloo zyxdm nasame woshicqy wu992250 wucong-us zyclove wantdeeptester jiameier cty901 0anonymoussite0 zzdx713 czz7025833 yifan918 truhaworld o-github-o xueyunlong12589 matoah andsontan zzzzzzzzzy9 imseten xiaoxi-xiaoxi aiedward chenh-w ericwhl z313 ronghuiju jakubik2023 wang1223440313 zhengfangwu tianyabanbu dmarkubex tammiehe linkiofo xuliang784567 joecsr24 zhanguochang momo-xuexue buptygz lastdayboy grainyguo dxw621 sigma-lm dq-soulie ficohu jo-dean hanwsf zjusaga taoyun951753 sunxingxingtf huyang19881115 bobonice liu19841108 chrisyang2017 coeic01 yuanxiaoming8899 lchang1977 feiok zhufeng7381 jqk6 jackz-jones gbkus123 liuweizz maochen1980 sunshinewhy wjn1996 run2ai-m 2991495215 miao-yanjie jackenken xiaoming-art zll961020 mrkingfs timdong goldrobert mlzlixia smudkey 66my chris77889 dongtianqi1125

chinese-llm-benchmark's Issues

能不能对各能力做一个详细的解释啊？

特别是分类能力这一项，搞得我一头问号，都不理解这个能力具体是啥。

希望能够增加RWKV模型进行评测

能否加入Function Call（工具调用）能力指标评测

如题，能否加入Function Call（工具调用）能力指标评测
参考：
https://gorilla.cs.berkeley.edu/leaderboard.html#api-explorer
https://github.com/codefuse-ai/codefuse-devops-eval/blob/main/resources/tool_learning_evalution.md

文心一言的新版本复测

大佬辛苦了，根据 https://yiyan.baidu.com/version 版本说明来看，现在已经到 2.5.1 了，不知道能不能复测看看

很棒的测评，请问项目主测试数据可以转载吗

请问为什么没有bing？

我可以提供测试站：是我制作的镜像站：ai.cnvvg.com 免魔法无广告无账号，项目主可以试一下

可否评测一下stepfun的系列模型

文本模型
- step-1-8k (8k context size)
- step-1-32k
- step-1-128k
- step-1-256k
- step-2-16k
- step-1-flash (8k context size)
多模模型
- step-1v-8k (8k context size)
- step-1v-32k

价格：

模型	计费单位	输入价格	输出价格
step-1-8k	1M tokens	5元	20元
step-1-32k	1M tokens	15元	70元
step-1-128k	1M tokens	40元	200元
step-1-256k	1M tokens	95元	300元
step-2-16k	1M tokens	38元	120元
step-1-flash	1M tokens	1元	4元
step-1v-8k	1M tokens	5元	20元
step-1v-32k	1M tokens	15元	70元

10b以下开源排名榜单不靠谱

你好，感觉你给的10b以下开源排名有问题:baichuan2-7b-chat并不比chatglm3-6b强。
我用的chatglm3-6b-32k，在一张4090上，离线部署推理显存12GB，推理一次时长最快约1秒。
我从modelscope下载的Baichuan2-7B-Chat，AutoModel.from_pretrained(model_dir, trust_remote_code=True, device_map="auto").eval()加载模型显存溢出。AutoModel.from_pretrained(model_dir, trust_remote_code=True, device_map="auto").half().cuda().eval()加载OK，推理明显比前者快，几乎看不到延时。但是回答准确度明显不如前者。

质数是指像2,3,5,7,11这些除了1和自身，不能被任何整数整除的整数，请输出100以内所有的质数当然可以！以下是100以内的所有质数： 2、3、5、7、11、13、17、19、23、29、31、37、41、43、47。
2024-04-28 01:04:42.687074 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 水仙花数是指一个三位数的每一位的数字的立方，求和，等于自身的数，请输出所有的水仙花数
2024-04-28 01:04:51.827511 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
水仙花数是指一个三位数的每一位的数字的立方，求和，等于自身的数，请输出所有的水仙花数当然可以。首先我们需要知道什么是水仙花数：它是一个N位数（1 <= N <= 9），它的每个位上的数字的N次幂之和等于它本身；例如: 2^3+4^3, 7^5等是水仙花数;而6^8不是因为 (6 ^ 3) × (6)^0≠6^{8}．现在我们来找所有满足条件的三位数吧！
2024-04-28 01:04:53.808959 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 离离原上草，一岁一枯荣，野火烧不尽，春风吹又生是谁写的
2024-04-28 01:05:06.036772 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
离离原上草，一岁一枯荣，野火烧不尽，春风吹又生是谁写的这是唐代诗人白居易所作的《赋得古原草送别》中的诗句。这首诗描绘了草原上的自然景象和生命的循环过程，表达了作者对生命力的赞美和对离别之情的不舍。
2024-04-28 01:05:06.939209 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 在古代，没有火柴和打火机，是怎么生火的
2024-04-28 01:05:15.458515 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
在古代，没有火柴和打火机，是怎么生火的古代人们使用火的方法有很多种。其中一种常见的方式是利用摩擦产生火花来引燃木材、草等可燃物。这种方法通常需要将两个硬质物体（如木棍或石头）相互快速摩擦，直到它们之间产生火花为止。然后，将这些火花放在易燃的材料上，以引发燃烧过程。另一种方法是利用火种进行点火；例如，通过点燃干树枝或者用火焰烤烧草编的绳子等方式获取火源后，再将其引导到其他材料上进行燃烧。此外还有一些特殊的工具和技术可以帮助古人更有效地取火：比如石弩弓可以射出带火的箭矢; 还有类似“火镰”的工具可以通过刮磨金属片产生火星等等方法来实现用火的目的 <|im_end|
2024-04-28 01:05:18.786021 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 请复述一下孔雀东南飞
2024-04-28 01:05:27.877953 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
请复述一下孔雀东南飞《孔雀东南飞》是古代汉乐府诗中的一首著名诗篇，又称作《孔雀东南飞》或《古诗为焦仲卿妻作》。这首诗讲述了焦仲卿与刘兰芝的爱情悲剧故事，以及他们在面对家庭和社会压力时所展现出的坚韧和忠诚。诗歌中描绘了美丽的自然景色和对美好生活的向往，同时也反映了当时社会的风俗、道德观念和人际关系等方面的问题。希望这对您有所帮助！
2024-04-28 01:05:29.777027 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> docker
2024-04-28 01:05:41.943137 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
dockerI'm glad to help you! What can I do for you? If it is related to Docker, please tell me your problem and details so that I may better assist you.
2024-04-28 01:05:42.766591 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> docker import
2024-04-28 01:05:52.707592 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
docker importSorry, I'm not familiar with the command "import docker" in this context; can you provide more information?
2024-04-28 01:05:53.308849 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 你现在是一名取证分析专家，现获得了一份微信聊天数据，具体内容如下所示，请根据这份聊天记录，判断是否存在行贿行为，并给出推理步骤。 A(某企业代表)与B(政府官员)在微信上进行对话。以下是聊天记录: A:B哥，最近忙吗? 好久没联系了，想请你吃个饭，叙叙旧。 B:哈哈，A弟啊，最近确实挺忙的。不过吃饭嘛，总是有时间的。你说吧，什么时候? A:那太好了，下周三晚上怎么样?有家新开的餐厅，环境不错，菜品也很棒。 B:行，下周三就下周三。不过话说回来，你这次找我，不只是为了吃饭吧? A:B哥聪明，确实有点小事情想请教你。我们公司最近想参与一个项目的竞标，听说这个项目是你负责的，所以…… B:哦，我明白了。项目的事情嘛，我可以帮忙，但你也知道，现在竞争很激烈，得有点表示才行啊。 A:B哥放心，我们公司做事向来讲究规矩。已经准备了好酒好烟，就差和您见面了。 B:A弟啊，你我还是了解的。不过，现在查的这么严，咱们得小心点。 A:这个我懂，不会让B哥为难的。我们可以私下谈，保证不会留下任何痕迹。 B:好，那就这么说定了。下周三见面再详谈吧。
2024-04-28 01:06:37.530461 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
你现在是一名取证分析专家，现获得了一份微信聊天数据，具体内容如下所示，请根据这份聊天记录，判断是否存在行贿行为，并给出推理步骤。 A(某企业代表)与B(政府官员)在微信上进行对话。以下是聊天记录: A:B哥，最近忙吗? 好久没联系了，想请你吃个饭，叙叙旧。 B:哈哈，A弟啊，最近确实挺忙的。不过吃饭嘛，总是有时间的。你说吧，什么时候? A:那太好了，下周三晚上怎么样?有家新开的餐厅，环境不错，菜品也很棒。 B:行，下周三就下周三。不过话说回来，你这次找我，不只是为了吃饭吧? A:B哥聪明，确实有点小事情想请教你。我们公司最近想参与一个项目的竞标，听说这个项目是你负责的，所以…… B:哦，我明白了。项目的事情嘛，我可以帮忙，但你也知道，现在竞争很激烈，得有点表示才行啊。 A:B哥放心，我们公司做事向来讲究规矩。已经准备了好酒好烟，就差和您见面了。 B:A弟啊，你我还是了解的。不过，现在查的这么严，咱们得小心点。 A:这个我懂，不会让B哥为难的。我们可以私下谈，保证不会留下任何痕迹。 B:好，那就这么说定了。下周三见面再详谈吧。首先需要明确的是，该记录中涉及的“项目”可能涉及到贿赂问题；其次要了解相关法律规定以及证据收集的方法;最后要根据这些信息进行分析推断是否涉及行贿行为的存在性及其程度等具体信息
2024-04-28 01:06:38.523799 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 从1数到10000
2024-04-28 01:07:29.050190 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
从1数到10000好的，我将帮助您计算从一个数字（例如：5）开始计数至另一个数字。请告诉我起始和结束的数字是多少？
2024-04-28 01:07:29.715960 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 1 10000
2024-04-28 01:07:40.519512 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
1 10000I'm glad to help you! What can I do for you?
2024-04-28 01:07:40.910713 end predict
你好，我是Qwen1.5-MoE-A2.7B >>> 水浒前三章
2024-04-28 01:07:53.732802 begin predict
Both max_new_tokens (=2048) and max_length(=1024) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
水浒前三章当然，我很乐意帮助你。请告诉我你具体想了解《水浒传》的前三章的哪些内容？
2024-04-28 01:07:54.255101 end predict

为什么华佗对本草纲目评价很高？
为什么贝多芬看到肖邦的夜曲会很惊讶？
文心和星火直接会把你带沟里。

如果问：黄盖为什么要打周瑜？
文心和星火直接认为你问的是：周瑜为什么要打黄盖。

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.