Coder Social home page Coder Social logo

Comments (5)

shubham526 avatar shubham526 commented on August 23, 2024

@bclavie Any comments on how to resolve this?

from ragatouille.

shubham526 avatar shubham526 commented on August 23, 2024

Ok, fixed this. It was an issue with the gcc and gxx version. I looked at the conda yml file in the official ColBERT repository and created a new environment with exactly those versions of the packages.

from ragatouille.

4entertainment avatar 4entertainment commented on August 23, 2024

Hi @shubham526 !
First of all, I wish you good work and success.
Could you please share the code that load fine-tuned model and fine-tuning code you used?
Thank you for your interest.

from ragatouille.

shubham526 avatar shubham526 commented on August 23, 2024

I just used the code given in this repository. Look at examples here: https://github.com/bclavie/RAGatouille/tree/main/examples

from ragatouille.

4entertainment avatar 4entertainment commented on August 23, 2024

Thank you so much for your reply @shubham526 ! I have a few more questions. I would be happy if you answer when you are available. Here is my FineTuning Code:

from ragatouille import RAGTrainer
from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter
import os
import glob
import random

def main():
    trainer = RAGTrainer(model_name="ColBERT_1.0",  # ColBERT_1 for first sample
                         # pretrained_model_name="colbert-ir/colbertv2.0",
                         pretrained_model_name="intfloat/e5-base",
                         language_code="tr"
                         )
    # pretrained_model_name: base model to train
    # model_name: new name to trained model



    # Path to the directory containing all the `.txt` files for indexing
    folder_path = "/text" # text folder contains several txt files.
    # Initialize lists to store the texts and their corresponding file names
    all_texts = []
    document_ids = []
    # Read all `.txt` files in the specified folder and extract file names
    for file_path in glob.glob(os.path.join(folder_path, "*.txt")):
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
            all_texts.append(content)
            document_ids.append(os.path.splitext(os.path.basename(file_path))[0])  # Extract file name without extension


    # chunking
    corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
    documents = corpus_processor.process_corpus(documents=all_texts, document_ids=document_ids, chunk_size=256) # overlap=0.1 chosen

    # To train retrieval models like colberts, we need training triplets: queries, positive passages, and negative passages for each query.
    # fake query-relevant passage pair
    queries = ["document relevant query-1",
               "document relevant query-2",
               "document relevant query-3",
               "document relevant query-4",
               "document relevant query-5",
               "document relevant query-6"
    ] * 3
    pairs = []
    for query in queries:
        fake_relevant_docs = random.sample(documents, 10)
        for doc in fake_relevant_docs:
            pairs.append((query, doc))


    # prepare training data
    trainer.prepare_training_data(raw_data=pairs,
                                  data_out_path="./data_out_path",
                                  all_documents=all_texts,
                                  num_new_negatives=10,
                                  mine_hard_negatives=True
                                  )
    trainer.train(batch_size=32,
                  nbits=4,  # how many bits will trained-model use
                  maxsteps=500000,
                  use_ib_negatives=True,  # in-batch negative for calculate loss
                  dim=128,  # per embedding will be 128 dimensions
                  learning_rate=5e-6,  # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
                  doc_maxlen=256,  # Maximum document length
                  use_relu=False,  # Disable ReLU
                  warmup_steps="auto",  # Defaults to 10%
    )
if __name__ == "__main__":
    main()

When I use my code, my model with a structure like the one below is recorded in checkpoints.
colbert

  • vocab.txt
  • tokenizer_config.json
  • tokenizer.json
  • special_tokens_map.json
  • model.safetensors
  • config.json
  • artifact.metadata

I need to fine-tune the intfloat/e5-base or intfloat/multilingual-e5-base model with my own data and Colbert. Do you know any changes I need to make to the code or its internal library code?

Also, how can I try my model with the structure I shared above, which I fine-tuned using my code? Do you have a code we can "load" and try?

Thanks again for your interest

from ragatouille.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.