Coder Social home page Coder Social logo

text-davinci-001: majority_vote_ function does not handle cases where pred[prediction_field] is a list of str instead of str about dspy HOT 12 CLOSED

jupiterepoch avatar jupiterepoch commented on June 24, 2024
text-davinci-001: majority_vote_ function does not handle cases where pred[prediction_field] is a list of str instead of str

from dspy.

Comments (12)

tsunrise avatar tsunrise commented on June 24, 2024 1

Hi @okhat , I believe the problem may come from the extract function in template_v2.py (https://github.com/stanfordnlp/dsp/blob/12d7f1106f2f524be161d35865a81e0ea016e929/dsp/templates/template_v2.py#L131).

The function takes an example, and overwrites example.answer if an answer is found. However, in some case, the response from LM is malformed. For example, when we add rationale to our template, we expect the LM to follow Answer: {LM's answer} format. but we do not enforce it.For example, here is one possible output generated from LM:

Context:
[1] «BTC-e | Gina Esparon and Evaline Sophie Joubert and two people with significant control: Alexander Buyanov and Andrii Shvets. The US Justice Dept attempted to close down BTC-e on the 26th of July 2017 when they charged Alexander Vinnik and BTC-e in a 21-count indictment for operating an alleged international money laundering scheme and allegedly laundering funds from the hack of Mt. Gox. The exchange plans to relaunch in September 2018. BTC-e started in July 2011, handling a few coin pairs, including Bitcoin/U. S. dollar and I0Coin to Bitcoin. By October 2011, they supported many different currency pairs, including Litecoin to dollars,»
[2] «Andy Harris (politician) | Furthermore, "Harris then asked if he could purchase insurance from the government to cover the gap," added an aide, who was struck by the similarity to Harris's request and the public option he denounced as a gateway to socialized medicine. On October 16, 2013, Harris voted against the motion to end the government shutdown and raise the debt ceiling. Harris has defended Hungarian leader Viktor Orbán, an ally of Vladimir Putin who has pledged to turn his country into an "illiberal democracy". Harris decried the State Departments' efforts to support free and independent media in the country. In 2014, Harris»
[3] «BuysUSA.com | a Mercedes and other luxury vehicles with his profits. The Federal Bureau of Investigation investigated the site and was ultimately responsible for shutting down the site permanently in October 2005. On Friday, August 25, 2006, Danny Ferrer appeared before U.S. District Court Judge T.T. Ellis III in Alexandria, Virginia. He was told, "You extended your hand into the pockets of these people," and that, "If severe penalties were not attached, people would line up from here to Los Angeles to do what you've done." Danny Ferrer claimed to the judge that he started selling the software to pay for a»
[4] «United States federal government shutdown of 1990 | the Smithsonian museums. Although none of the appropriations bills had been passed, not all government agencies actually shut down. Full shutdowns occurred in the Environmental Protection Agency, Department of Labor, Office of Personnel Management, and Department of Housing and Urban Development, and partial shutdowns occurred in the Library of Congress, Government Printing Office, and the Departments of Energy, Interior, and State. According to a study by the General Accounting Office, Interior (which includes the National Park Service) furloughed about 2,800 workers, the Library of Congress around 100, and the other agencies fewer than 10 each. In the morning of Tuesday,»
[5] «Jinkanpo Atsugi Incinerator | the poor air quality. In May 2001, the Japanese government purchased the plant for nearly 40 million dollars and shut it down following a United States Department of Justice lawsuit against the private incinerator owner. Dismantling was completed by the end of that year. Some former residents of Atsugi NAF still complain of health problems related to the incinerator's emissions and report that the USN has been reluctant to address their concerns. The incinerator contaminated the base, especially the housing area, with dioxin, heavy metals, and other deadly toxins. In June, 2007, the USN's Environmental Health Center announced that it»

Question: What government department did Buyantu shut down?

Rationale: Let's think step by step. There is no mention of Buyantu shutting down a government department in the provided context. Therefore, the answer is: N/A. 

In this case, the template fails to extract N/A from the answer, and leaves example.answer as is.

In dataset like SQUAD, example.answer is usually a list, so that's the reason why completion.answer is a list.

My guess comes from the following observation. Here is a code snippet from my CS224U homework:

test_case = dev_exs[56] # an entry of dev set of SQUAD 
print("Question:", test_case.question)
print("Expected answer:", test_case.answer)
print(“Generated answer:”, few_shot_openqa_custom(test_case, 3).answer)

And this is the output:

Question: What government department did Buyantu shut down?
Expected answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']
Generated answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']

If you want, I can write a PR to set answer to an empty string if template fails to extract an answer. Otherwise, this may lead to overestimation of performance during evaluation.

from dspy.

okhat avatar okhat commented on June 24, 2024

Thanks a lot for catching this! Do you have an example of where that might happen?

from dspy.

jupiterepoch avatar jupiterepoch commented on June 24, 2024

image

from dspy.

okhat avatar okhat commented on June 24, 2024

Hmm, thanks! We do already do CoT+SC with dsp.majority in the intro notebook.

Here's a simplified version.

@dsp.transformation
def QA_predict(example: dsp.Example):
    example, completions = dsp.generate(qa_template_with_CoT, n=20, temperature=0.7)(example, stage='qa')
    completions = dsp.majority(completions)
    
    return example.copy(answer=completions.answer)

The only difference I see is that we extract the answer at the end. Is there anyother difference?

from dspy.

okhat avatar okhat commented on June 24, 2024

Maybe you're saying we have some regression. The same code from the intro notebook no longer works on the latest branch?

from dspy.

jupiterepoch avatar jupiterepoch commented on June 24, 2024

I think I know what the problem is. text-davinci-001 sometimes returns a List[str], whereas gpt-3.5-turbo or text-davinci-002 consistently returns str in my experiments. You could try running the intro notebook with text-davinci-001.

I guess either specifying that the code does not support certain GPT models, or adding codes to handle the different LM behaviors could work.

If you want to go down the second path, the evaluate would probably need to change as well.

from dspy.

okhat avatar okhat commented on June 24, 2024

If your PR fixes this, I'm happy to merge it. But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?

from dspy.

jupiterepoch avatar jupiterepoch commented on June 24, 2024

If your PR fixes this, I'm happy to merge it.

Please don't merge just yet. As I mentioned, the evaluate function has similar problems. If you want to fix it, it's better to fix both.

But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?

Since completions: list[dict[str, Any]] = generator(prompt, **kwargs) at here does not check for the type of the dict values, a completion in completions could have completion[prediction_field] being multiple types. Different LMs do fill that prediction_field with different types. And downstream functions such as majority does not take this into consideration.

I'm on a $5 budget for openai calls, so I'm afraid I won't be able to help you test more cases.

from dspy.

okhat avatar okhat commented on June 24, 2024

Thanks for reporting this and helping with it. We haven’t been able to reproduce it unfortunately. cc: @lawliet19189

from dspy.

okhat avatar okhat commented on June 24, 2024

Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.

If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.

A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?

In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)

from dspy.

tsunrise avatar tsunrise commented on June 24, 2024

Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.

If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.

A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?

In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)

Sounds good! I have created a PR for that

from dspy.

okhat avatar okhat commented on June 24, 2024

Thanks for opening this, this got fundamentally resolved in v2 (DSPy). Closing.

from dspy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.