Coder Social home page Coder Social logo

Comments (5)

SivilTaram avatar SivilTaram commented on August 22, 2024 1

@ganler

If following the exact same evaluation setting with bigcode-evaluation-harness, I can obtain the following pass@1 results for the 33B model as below:

Evaluating generations...                                    
{                                                           
  "humaneval": {                                             
    "pass@1": 0.5365853658536586                            
  },                                                         
  "config": {                                                 
    "prefix": "",                                                  
    "do_sample": false,                                         
    "temperature": 0.2,                                            
    "top_k": 0,                                                      
    "top_p": 0.95,                                                 
    "n_samples": 1,                                                  
    "eos": "<|endoftext|>",                                             
    "seed": 0,                                                       
    "model": "deepseek-ai/deepseek-coder-33b-base",                     
    "modeltype": "causal",                                                
    "peft_model": null,                                                 
    "revision": null,                                                     
    "use_auth_token": false,                                                 
    "trust_remote_code": true,                                            
    "tasks": "humaneval",                                                    
    "instruction_tokens": null,                                             
    "batch_size": 1,                                                         
    "max_length_generation": 650,                                            
    "precision": "fp32",              
    "load_in_8bit": false,            
    "load_in_4bit": false,            
    "limit": null,                    
    "limit_start": 0,                 
    "postprocess": true,              
    "allow_code_execution": true,                                            
    "generation_only": false,         
    "load_generations_path": null,                                           
    "load_data_path": null,           
    "metric_output_path": "evaluation_results.json",                         
    "save_generations": false,                                               
    "save_generations_path": "generations.json",                             
    "save_references": false,         
    "prompt": "prompt",               
    "max_memory_per_gpu": "auto",                                            
    "check_references": false         
  }                                   
}

Considering that the slightly different operations during post-processing, I think the claimed pass@1 should be reproducible.

from deepseek-coder.

ganler avatar ganler commented on August 22, 2024 1

@SivilTaram Thanks for the reference! I think these results are all possible by a better/optimal generation/inference parameters (e.g., we use bf16 and your config is calling fp32; our max generation length is 512 while yours is larger; etc.) since I believe our post-processing is robust as only two output solutions from deepseek-coder are not compilable (and we manually checked it is due to infinite enumeration of outputs) while the rest all look good.

Anyways,I also get a copy of the author-provided solutions (thanks to the authors!) which can produce the claimed results:

image

I will close the issue as I don't have further questions. Thanks for the attention and the great work! :)

from deepseek-coder.

ganler avatar ganler commented on August 22, 2024

An interesting finding tho: https://evalplus.github.io/leaderboard.html

image

From our independent evaluation, codellama-34b is a bit stronger than deepseekcoder-33b on the original HumanEval but instead on HumanEval+ deepseekcoder-33b is found to be better, indicating deepseekcoder is producing more robust code (i.e., HumanEval+ adds more test-cases for HumanEval whose tests might be insufficient for showing the correctness).

from deepseek-coder.

ganler avatar ganler commented on August 22, 2024

BTW, could you indicate the instruction that is used for the deepseek-ai/deepseek-coder-33b-instruct on HumanEval? Thanks!

from deepseek-coder.

ganler avatar ganler commented on August 22, 2024

Update: I am able to get the following result using the prompt here and some post cleaning to make all produced compilable:

image

from deepseek-coder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.