Coder Social home page Coder Social logo

vila-lab / atlas Goto Github PK

View Code? Open in Web Editor NEW
890.0 890.0 86.0 47.82 MB

A principled instruction benchmark on formulating effective queries and prompts for large language models (LLMs). Our paper: https://arxiv.org/abs/2312.16171

License: Apache License 2.0

Python 100.00%

atlas's People

Contributors

aidarmyrzakhan avatar eltociear avatar nnwhisperer avatar shandelier avatar sondosbsharat avatar szq0214 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atlas's Issues

Numerous egregious issues with this paper

Here's a list of issues others and I have found with your paper, code, data, methodology, and experiment design:

  1. Issues pertaining to overall experiment design and methodology

    • Quality vs. Correctness Discrepancy - How exactly do you differentiate between these two metrics given that the correctness of the response is correlated to the overall quality of the response? E.g How is it possible that you claim to find significant improvements to the correctness of the response while seeing only a partial improvement to the overall quality of the response (Principles 17, 18, 19 in particular sticks out to me, >60% improvement in quality but only <40% improvement to the overall quality, principle 1 is the most egregious but this is due to another issue entirely)?
      • Missing Methodology - What exactly are the guidelines by which you measured the quality or the correctness of a response given that both seem subjective and can vary significantly depending on the context? E.g Principle 2 & 5, are you assessing the quality and correctness of the response from the standpoint of whatever audience you're prompting the LLM to address? What about for the prompts such as "What were the main global events of 2022?" or "What are the potential future trends in renewable energy?"?
    • Comparative Analysis - Where are the baseline instructions and baseline responses for your comparison?
    • Unlikely Results - Many of the instructions are overly simple tasks where one would expect to see marginal improvements, especially for larger models. Specifically, I've noticed many instructions in different principles (8, 6, 19) are extremely simple and one would expect to see only marginal improvements to the response, yet there's somehow >50% improvements to the correctness? There are also certain prompts where your results cannot be replicated, such as "###Instruction###\nTranslate a given word from English to French.\n### Question ###\nWhat is the French word for "book"?" on Llama7b.
    • Choice of Model - Why did you decide to use the baseline models for your small and medium sized models but dialogue/preference tuned models for your large models? Given that models of an entirely different architecture and training format was used, why did you proceed with doing a comparison between baseline and tuned models when there are alternative baseline 70b models (WizardLM, Orca)? Furthermore, there's a massive gap between the parameter sizes within the large models, 70b vs 200b vs 1t+. All of this makes me extremely dubious of your findings given the majority of the gap in performance between the different size classes in your paper can simply explained due to the large models being tuned and having much much more parameters. This fact can be seen in your detailed percentages, there's a massive gap in performance between GPT4 and all other models simply due to it having >1t parameters.
    • Inconsistent Handling of Responses - Why is it for the GPT3.5 and GPT4 models that you prompted the model 10 times while only prompting the open sourced models once? How did you even choose which response to use? Was this treatment consistent with how you generated the baseline (I won't take your word for this one given the numerous flaws and errors I've observed so far)? If not, how are your results not biased (already biased IMO given the lack of guidelines for your evaluation combined with your model choice)?
    • Misc - Was your evaluation done blind? Did the evaluator know which was the baseline and which was the principled prompts? Who were the ones evaluating these results?
  2. Issues pertaining to code, implementation, and the actual data

    • Unprincipled Prompts - For principle 1, which was "No need to be polite with LLM so there is no need to add phrases like β€œplease"", anyone who bothered to even take a look could see that none of your instructions even follows your principle. All of them are polite, yet you somehow see a difference in both quality AND correctness? How is this even possible, and what were the baseline for this principle which resulted in these improvements?
    • Literally Impossible Data - Based off the generate.py code you've released, it's literally impossible to generate the responses as shown for Prompt 14 since all you're doing is calling the model using the same prompt without updating it with the model's questions or the user's response

      ATLAS/generate.py

      Lines 40 to 43 in 03511d3

      for _ in range(10):
      a = generate_answers(q)
      questions.extend(q)
      answers.extend(a)
      Furthermore, using these clearly fabricated responses, you claim to somehow achieved 100% improvement across all three model sizes? Really?
    • Inconsistencies between Code and Data Format - In the code, the output is written without the model's name, yet in the data all the model's names are magically filled out?
      qa_pairs = [{"instruction": q, "output": a} for q, a in zip(questions, answers)]
      How can you actually guarantee the data comes from the model you claim to be given that you clearly modified the data using external code?
    • Inconsistencies between Data and Paper - In the paper, you claimed to have used Llama-70b-chat, how come this isn't reflected in your data?
    • Missing Data - I noticed that correctness data for principles 14, 15, 21, 22, 23 were outright omitted from the paper. Why is this the case?
    • Mixing of Principles - I cba even citing direct examples for this, many of your instructions use a mix of CoT along with whatever principle the instruction is for.

There are significant issues with your paper which makes your findings "dubious" to say the least. Was this written by freshmen undergrads over two to three weeks? This paper comes off as sloppy, and the way this was written makes me think the authors were trying to just fill pages without regard to the quality of the content. Almost 1/5th of the pages are dedicated to just the Gemini and GPT4 references when there's no other (decent) paper referencing either paper that does so in this manner. I get this was released on arxiv, but how such glaring flaws weren't caught by your advisor is honestly beyond me.

Will there be an update to the paper for the newer GPT models like GPT-4o?

I found this paper fantastic. It seems to be a great way to eliminate hallucinations. However, with the new GPT-4o model, it doesn't seem that some of these principles are as effective anymore. Has there been any thoughts on getting the paper updated to reflect these newer and more cost-effective models?

Mistake in graphic?

The graphic shows in "User Iteraction and Engagement "P21" as "Style-Consistency. But P21 is "Detailed Writing" in Specificity and Information and also in the tables in the paper.

If 21 was referenced from two sections, it should be "Detailed Writing" in both. Otherwise it is unclear what "Style-Consistency" is and how it would be different from 22 "Preserve-Style"

Where are the principles?

I am looking at this project, but I don't see any of the principles that you describe, and modified prompts? I'm pretty confused, and there are no instructions for people to run their own generate.py since you'd need people to install OpenAI.

All I see are the written out instructions + outputs from models. But where are the modified prompts for each principle? Am I missing that?

Let's build better tools together 🧰

I found my people, maybe, hopefully: let's build better tools πŸ› οΈπŸ§°πŸš€
I am limited by my tools. I need more big brains to help implement my ideas πŸš€πŸ§  together πŸ›€οΈπŸš€
@techbrotino

Extend the repository for self-benchmark

Hello!

I've found your paper very helpful for my studies, but I'd like to perform some of these benchmarks myself. Out of the box, it seems like the only script provided only works with some absent principle files in plain text rather than the provided JSON, and the OpenAI library. It would be great to have some resources to perform these benchmarks to a variety of different models (such as the Phi models), of which could be implemented with a universal library like litellm, or another library as such.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.