tjunlp-lab / awesome-llms-evaluation-papers Goto Github PK

View Code? Open in Web Editor NEW

628.0 11.0 41.0 1.07 MB

The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.

awesome-llms-evaluation-papers's People

Contributors

Stargazers

Watchers

Forkers

terryyz shaoyuyoung andrewyu0 dattgoswami ikergarcia1996 chanliang xingyaoww bananemure omygpt hslim11 lorypack csshali ai-jie01 alexrogalskiy zhfish wxpjimmy yasharkor eltociear strategist922 zhimin-z binwang28 andy20071846 achuanle liyunxiangcool macromachine ibibek ljtlrh ggbetz cntommy jluite codehruv cosmoshwpg melanie531 wbing520 kknakkav2 edmundmarcia konulj suroorwijdan olivierbinette teddydharma harry-zhou

awesome-llms-evaluation-papers's Issues

Add "How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective." in Robustness Evaluation

Hi,

Please note our recent paper How Important are Good Method Names in Neural Code Generation? A Model Robustness Perspective. on this topic :)

Cheers

Could you add PandaLM to your survey?

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. PandaLM is the first to evaluate llm using a finetuned llm.

Can you add our recent work to your survey?

Hi,

I have read your insightful paper and found it to be a valuable contribution to the field.

I would like to kindly suggest adding our recent work to your survey.

📄 Paper : Ask Again, Then Fail: Large Language Models' Vacillations in Judgement

This paper uncovers that the judgement consistency of LLM dramatically decreases when confronted with disruptions like questioning, negation, or misleading, even though its previous judgments were correct. It also explores several prompting methods to mitigate this issue and demonstrates their effectiveness.

Thank you for your consideration! :)

What is the provenance of WGlaw dataset?

http://openeval.org.cn/dataset_detail?datasetname=WGlaw

SeaEval: Multilingual LLM Evaluation

Please note our paper on evaluation, which could be an important building block for multilingual evaluation and cultural understanding.

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning

Why we list inaccessible benchmark?

Where are the results of these two benchmarks for domain evaluation in the OpenEval leaderboard?

Add License

Hi all,

this is an amazing resource! Thanks for sharing openly! I was just wondering under which conditions one could reuse the materials provided here. Would you mind adding a license file? If you're new to licensing and/or wonder which license to use, you can read more in this blob post: https://focalplane.biologists.com/2023/05/06/if-you-license-it-itll-be-harder-to-steal-it-why-we-should-license-our-work/

Thanks!

Best,
Robert

RAGAS: Automated Evaluation of Retrieval Augmented Generation

in 📚Knowledge and Capability Evaluation -> Question Answering
RAGAS: Automated Evaluation of Retrieval Augmented Generation
paper
RAGAS framework on github

@jjmachan
@shahules786

Which metrics is chosen in the leaderboard?

http://openeval.org.cn/doc?id=2

I cannot find any further explanation of the chosen metrics.

The leaderboard is missing from the page...

Check http://openeval.org.cn/rank and click on 知识能力 (intellectual ability) option, then the leaderboard does not show up:

Code-Related Benchmarks

Hi TJUNLP team, thanks so much for the great work! I think a paper that presents a holistic view of current NLP benchmarks is relevant amidst the many ongoing efforts.

To this end, I'd like to point out a couple works concerning evaluating language models on the coding related tasks, such as completion, patch generation, and language agents using code as actions.

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation; paper, site
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?; paper, site
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback; paper, site

Thanks in advance!

Add SpyGame

Hi there,

Thanks for the effort in putting up this repo on LLMs evaluation.

I'd like to suggest adding our work, SpyGame, a framework for evaluating language model intelligence. We propose to use word guessing games to assess the language and theory of mind intelligences of LLMs.

Paper: Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models
GitHub: https://github.com/Skytliang/SpyGame