Comments (12)
Hi @okhat , I believe the problem may come from the extract
function in template_v2.py
(https://github.com/stanfordnlp/dsp/blob/12d7f1106f2f524be161d35865a81e0ea016e929/dsp/templates/template_v2.py#L131).
The function takes an example, and overwrites example.answer
if an answer is found. However, in some case, the response from LM is malformed. For example, when we add rationale
to our template, we expect the LM to follow Answer: {LM's answer}
format. but we do not enforce it.For example, here is one possible output generated from LM:
Context:
[1] «BTC-e | Gina Esparon and Evaline Sophie Joubert and two people with significant control: Alexander Buyanov and Andrii Shvets. The US Justice Dept attempted to close down BTC-e on the 26th of July 2017 when they charged Alexander Vinnik and BTC-e in a 21-count indictment for operating an alleged international money laundering scheme and allegedly laundering funds from the hack of Mt. Gox. The exchange plans to relaunch in September 2018. BTC-e started in July 2011, handling a few coin pairs, including Bitcoin/U. S. dollar and I0Coin to Bitcoin. By October 2011, they supported many different currency pairs, including Litecoin to dollars,»
[2] «Andy Harris (politician) | Furthermore, "Harris then asked if he could purchase insurance from the government to cover the gap," added an aide, who was struck by the similarity to Harris's request and the public option he denounced as a gateway to socialized medicine. On October 16, 2013, Harris voted against the motion to end the government shutdown and raise the debt ceiling. Harris has defended Hungarian leader Viktor Orbán, an ally of Vladimir Putin who has pledged to turn his country into an "illiberal democracy". Harris decried the State Departments' efforts to support free and independent media in the country. In 2014, Harris»
[3] «BuysUSA.com | a Mercedes and other luxury vehicles with his profits. The Federal Bureau of Investigation investigated the site and was ultimately responsible for shutting down the site permanently in October 2005. On Friday, August 25, 2006, Danny Ferrer appeared before U.S. District Court Judge T.T. Ellis III in Alexandria, Virginia. He was told, "You extended your hand into the pockets of these people," and that, "If severe penalties were not attached, people would line up from here to Los Angeles to do what you've done." Danny Ferrer claimed to the judge that he started selling the software to pay for a»
[4] «United States federal government shutdown of 1990 | the Smithsonian museums. Although none of the appropriations bills had been passed, not all government agencies actually shut down. Full shutdowns occurred in the Environmental Protection Agency, Department of Labor, Office of Personnel Management, and Department of Housing and Urban Development, and partial shutdowns occurred in the Library of Congress, Government Printing Office, and the Departments of Energy, Interior, and State. According to a study by the General Accounting Office, Interior (which includes the National Park Service) furloughed about 2,800 workers, the Library of Congress around 100, and the other agencies fewer than 10 each. In the morning of Tuesday,»
[5] «Jinkanpo Atsugi Incinerator | the poor air quality. In May 2001, the Japanese government purchased the plant for nearly 40 million dollars and shut it down following a United States Department of Justice lawsuit against the private incinerator owner. Dismantling was completed by the end of that year. Some former residents of Atsugi NAF still complain of health problems related to the incinerator's emissions and report that the USN has been reluctant to address their concerns. The incinerator contaminated the base, especially the housing area, with dioxin, heavy metals, and other deadly toxins. In June, 2007, the USN's Environmental Health Center announced that it»
Question: What government department did Buyantu shut down?
Rationale: Let's think step by step. There is no mention of Buyantu shutting down a government department in the provided context. Therefore, the answer is: N/A.
In this case, the template fails to extract N/A
from the answer, and leaves example.answer
as is.
In dataset like SQUAD
, example.answer
is usually a list, so that's the reason why completion.answer
is a list.
My guess comes from the following observation. Here is a code snippet from my CS224U homework:
test_case = dev_exs[56] # an entry of dev set of SQUAD
print("Question:", test_case.question)
print("Expected answer:", test_case.answer)
print(“Generated answer:”, few_shot_openqa_custom(test_case, 3).answer)
And this is the output:
Question: What government department did Buyantu shut down?
Expected answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']
Generated answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']
If you want, I can write a PR to set answer
to an empty string if template fails to extract an answer. Otherwise, this may lead to overestimation of performance during evaluation.
from dspy.
Thanks a lot for catching this! Do you have an example of where that might happen?
from dspy.
from dspy.
Hmm, thanks! We do already do CoT+SC with dsp.majority in the intro notebook.
Here's a simplified version.
@dsp.transformation
def QA_predict(example: dsp.Example):
example, completions = dsp.generate(qa_template_with_CoT, n=20, temperature=0.7)(example, stage='qa')
completions = dsp.majority(completions)
return example.copy(answer=completions.answer)
The only difference I see is that we extract the answer at the end. Is there anyother difference?
from dspy.
Maybe you're saying we have some regression. The same code from the intro notebook no longer works on the latest branch?
from dspy.
I think I know what the problem is. text-davinci-001
sometimes returns a List[str]
, whereas gpt-3.5-turbo
or text-davinci-002
consistently returns str
in my experiments. You could try running the intro notebook with text-davinci-001.
I guess either specifying that the code does not support certain GPT models, or adding codes to handle the different LM behaviors could work.
If you want to go down the second path, the evaluate
would probably need to change as well.
from dspy.
If your PR fixes this, I'm happy to merge it. But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?
from dspy.
If your PR fixes this, I'm happy to merge it.
Please don't merge just yet. As I mentioned, the evaluate
function has similar problems. If you want to fix it, it's better to fix both.
But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?
Since completions: list[dict[str, Any]] = generator(prompt, **kwargs)
at here does not check for the type of the dict values, a completion in completions could have completion[prediction_field]
being multiple types. Different LMs do fill that prediction_field with different types. And downstream functions such as majority
does not take this into consideration.
I'm on a $5 budget for openai calls, so I'm afraid I won't be able to help you test more cases.
from dspy.
Thanks for reporting this and helping with it. We haven’t been able to reproduce it unfortunately. cc: @lawliet19189
from dspy.
Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.
If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.
A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?
In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)
from dspy.
Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.
If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.
A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?
In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)
Sounds good! I have created a PR for that
from dspy.
Thanks for opening this, this got fundamentally resolved in v2 (DSPy). Closing.
from dspy.
Related Issues (20)
- KeyError: 'content' error trying rag with weaviate and ollama HOT 2
- Using llama-3-70B via Groq DSPy, I found it sends invalid arguments for COPRO optimizer HOT 1
- ValueError: Required 'max_tokens' or 'max_output_tokens' not specified in settings when using meta.llama2-13b-chat-v1 in AWS SageMaker HOT 1
- Getting this backing off problem when I use COPRO HOT 1
- Can't load Llama Model HOT 7
- Add Support to FAISS Retiever! HOT 2
- MultiChainComparison input from ChainOfThought HOT 3
- Update Google and Google VertexAI for Gemini 1.5 Pro / Flash HOT 2
- How to optimize only a part of instruction? HOT 1
- GROQ client not available in packages HOT 2
- Predict forward() does not use lm in kwargs HOT 1
- DSPY Copro Tutorial for Hotpot QA doesn't work with Bedrock Claude Sonnet Model HOT 4
- ETA on new release? HOT 2
- RAG.forward() missing 1 required positional argument: 'question' HOT 5
- Enhancement: Would be great if we can load dspy dataset directly from pandas dataframe
- Clarification on the use of hard-written output field prompt in TextToSQLAnswer Signature HOT 1
- dspy and langfuse integration HOT 6
- about openai version
- Best practice for train and validation set separation HOT 2
- Triggering the Azure OpenAi (GPT 4) Triggers content warning and DSPy exception
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dspy.