Coder Social home page Coder Social logo

llm_neuralips23's Introduction

PACuna : Automated Fine-Tuning of Language Models for Particle Accelerators

- We are working on PACuna 2.0, better, more factual and public, release date 1.5.2024

Sources are in sources folder. Unforutnately, some training sources (books) are licensed, therefore neither source data nor model can be published.

pdf arxiv

Pipeline.

Navigating the landscape of particle accelerators has become increasingly challenging with recent surges in contributions. These intricate devices challenge comprehension, even within individual facilities. To address this, we introduce PACuna, a fine-tuned language model refined through publicly available accelerator resources like conferences, pre-prints, and books. We automated data collection and question generation to minimize expert involvement and make the data publicly available. PACuna demonstrates proficiency in addressing intricate accelerator questions, validated by experts. Our approach shows adapting language models to scientific domains by fine-tuning technical texts and auto-generated corpora capturing the latest developments can further produce pre-trained models to answer some intricate questions that commercially available assistants cannot and can serve as intelligent assistants for individual facilities.

Code

Each LLM need a proper prompt to know that we are asking a question:

def prompt_formatter(question, answer = ""):
    return f'### Human:\n{question}\n### Assistant:\n{answer}'

PDF to Markdown

To transform PDF to MMD, we use nougat OCR (https://github.com/facebookresearch/nougat)

nougat *.pdf -o .

In publication we mistakenly omitted --no-skipping, we suggest others to use this parameter

Preprocessing

Considering Nougat OCR MultipleMarkdown, which expresses equations with (), [] and tables, we prefer to have them in standardized format where equations are expressed in $, $$ and tables in plaintext

    # replace \[.*\] with equation mode $$
    eq_expr = re.compile(r'\\\[(.*?)\\\]', flags = re.DOTALL)
    while len(re.findall(eq_expr, x)) != 0:
        eq_re = re.search(eq_expr,x)
        x = x[:eq_re.start()] + '$$' + x[eq_re.start() + 2 :eq_re.end()-2] + '$$' + x[eq_re.end():]

    # replace inline math \(.*\) with $ $
    eq_expr = re.compile(r'\\\((.*?)\\\)', flags = re.DOTALL)
    while len(re.findall(eq_expr, x)) != 0:
        eq_re = re.search(eq_expr,x)
        x = x[:eq_re.start()] + '$' + x[eq_re.start() + 2 :eq_re.end()-2] + '$' + x[eq_re.end():]
        
    # replace tables
    tabular_expr = tag_expr_full('table')
    while len(re.findall(tabular_expr,x)):
        tabular_re = re.search(tabular_expr,x)
        try:
            tabular_plain = pypandoc.convert_text(tabular_re.group(), 'markdown',format = 'latex', extra_args=['--wrap=none'])
        except Exception as e:
            tabular_plain = ""
            # print(f'EXCEPTION {str(e)} {tabular_re.group()}')
        x = x.replace(tabular_re.group(), tabular_plain)

At this stage, we have unsupervised data prepared

Supservised Data (Q&A Paris)

To generate Q&A pairs, we ask vicuna1.5-16k model with a following prompt

query = prompt_formatter(f"Generate ten questions with answers for paper:\"{data_}\"")

where data_ contains paper itself.

To generate text from LLM, we use following text block (maxl

def test(q, model, max_length=16384, temperature=0.6):
    inputs = tokenizer(q, return_tensors="pt", return_token_type_ids=False).to(device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        # repetition_penalty=1.18,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_p=1.0,  # top_k=64,
        temperature=temperature,
    )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # print(f">{answer}\n---------")
    return answer

In order to extract Q&A invidual pairs, we use following heuristics

def extract_qa_pairs_from_result(text):
    qa_pairs = []
    text = re.split(r'\d+\.\s(.*\?)\n(.*)',text)
    text = [t for t in text if not t.isspace() and len(t) > 0]
    if len(text) % 2 == 0:
        for i in range(0,len(text),2):
            q = text[i]
            a = re.sub(r'^Answer:', '', text[i+1])
            qa_pairs.append({'question' :  q,
                           'answer' : a})
    return qa_pairs

Fine-tuning LLM

See lora_unsup.ipynb

Test

    peft_model_id = model_folder + "checkpoint-x/" # Last Checkpoint Path
    
    # peft_model_id = model_folder + "/checkpoint-13200/"
    config = PeftConfig.from_pretrained(peft_model_id)
    model = AutoModelForCausalLM.from_pretrained(
        config.base_model_name_or_path,
        torch_dtype="auto",
        device_map="auto",
        offload_folder="offload",
        offload_state_dict=True,
    )  
    tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
    model = PeftModel.from_pretrained(model, peft_model_id)
    # model = model.merge_and_unload()

Querying LLM

device = 'cuda'
max_length = 256
T = 0.7

p = "What is klystron"
test(prompt_formatter(p), model, tokenizer, max_length = max_length, temperature = T) 

llm_neuralips23's People

Contributors

sulcantonin avatar

Stargazers

刘铭 avatar Jan Kaiser avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.