Evaluate new LLMs on relevant Tasks
- TODO: Make the Summarization and Verify prompts work for both llama-3 and qwen2
- Q:Do we need per-LLM-prompt override ? Probably yes
- TODO: How to eval ? with GPT4 ?
- Run same via OpenAI api (add in addition to vLLM) - GPT-4o --> GPT Eval Judge -->
- Problem: The GT in Summarization and long generation tasks is not well defined - GPT4 and Opus may disagree and still be right, in their respective ways.