Coder Social home page Coder Social logo

darwin's Introduction

Darwin: A Tailored GPT for the Scientific Domain 🇦🇺

logo

Organization: University of New South Wales(UNSW) AI4Science & GreenDynamics Pty. Ltd

Darwin is an open-source project dedicated to fine-tuning the LLaMA model on scientific literature and datasets. Specifically designed for the scientific domain with an emphasis on materials science, chemistry, and physics, Darwin integrates structured and unstructured scientific knowledge to enhance the efficacy of language models in scientific research.

Usage and License Notices: Creative Commons License Darwin is licensed and intended for research use only. The dataset is licensed under CC BY NC 4.0, allowing non-commercial use. Models trained using this dataset should not be used outside of research purposes. The weight diff is also under CC BY NC 4.0 license

Model Overview

Darwin, based on the 7B LLaMA model, is trained on over 100,000 instruction-following data points generated by the Darwin Scientific Instruction Generator (SIG) from various scientific FAIR datasets and a literature corpus. By focusing on the factual correctness of the model's responses, Darwin represents a significant stride towards leveraging Large Language Models (LLMs) for scientific discovery. Preliminary human evaluations indicate that Darwin 7B outperforms GPT-4 in scientific Q&A and fine-tuned GPT-3 in solving chemistry problems (like gptChem).

We are actively developing Darwin for more advanced scientific domain experiments, and we're also integrating Darwin with LangChain to solve more complex scientific tasks (like a private research assistant for personal computers).

Please note, Darwin is still under development, and many limitations need to be addressed. Most importantly, we have yet to fine-tune Darwin for maximum safety. We encourage users to report any concerning behavior to help improve the model's safety and ethical considerations.

Model Comparison

compare

Getting Started

Installation

First install the requirements:

pip install -r requirements.txt

Preparing the Darwin Weights

Download the checkpoints of the Darwin-7B Weights from onedrive. Once you've downloaded the model, you can try our demo:

python inference.py <your path to darwin-7b>

Please note, the inference requires at least 10GB of GPU memory for Darwin 7B. We are working on a Colab version of the demo.

Fine-tuning

To further fine-tune our Darwin-7b with different datasets, below is a command that works on a machine with 4 A100 80G GPUs.

torchrun  --nproc_per_node=8 --master_port=1212 train.py \
    --model_name_or_path <your path to darwin-7b> \
    --data_path <your path to dataset> \
    --bf16 True \
    --output_dir <your output dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 False

Datasets Information

Our data comes from two primary sources:

A raw literature corpus containing 6.0M papers related to materials science, chemistry, and physics published after 2000. The publishers include ACS, RSC, Springer Nature, Wiley, and Elsevier. We thank them for their support.

FAIR Datasets - We've collected data from 10 FAIR Datasets.

Data Generation

We developed Darwin-SIG to generate scientific instructions. It can memorize long texts from full literature texts (average ~5000 words) and generate question-and-answer (Q&A) data based on scientific literature keywords (from web of science API)

Note: You could also use GPT3.5 or GPT-4 for generation, but these options might be costly.

Please be aware that due to agreements with the publishers, we can't share the training dataset.

Authors

This project is a collaborative effort by the following:

UNSW & GreenDynamics: Tong Xie, Shaozhou Wang, Qingyuan Linghu

UNSW: Imran Razzak, Cody Huang, Zhenyu Yin

GreenDynamics: Yuwei Wan (CityU HK), Yixuan Liu (University of Melbourne)

All advised by Bram Hoex and Wenjie Zhang from UNSW Engineering

Citation

If you use the data or code from this repository in your work, please cite it accordingly.

Acknowledgements

This project has referred to the following open-source projects:

Special thanks to NCI Australia for their HPC support.

We are continuously expanding Darwin's knowledge by feeding it more scientific literature. Join us on this exciting journey of advancing scientific research with AI!

darwin's People

Contributors

piupiupiuu avatar 0xtong avatar plum-yin avatar yixliu1 avatar yuweiwan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.