Coder Social home page Coder Social logo

Number of Triplets about ragatouille HOT 13 CLOSED

nico2rdj avatar nico2rdj commented on July 19, 2024
Number of Triplets

from ragatouille.

Comments (13)

bclavie avatar bclavie commented on July 19, 2024 1

Merged in 0.0.6b0! With the shuffling & duplicate fixes πŸ˜„

from ragatouille.

bclavie avatar bclavie commented on July 19, 2024

Hey,

Just looking at the trainer actually! You shouldn't be getting more than 8M triplets, as the defaults are to mine 10 hard negative example per query (which you're doing), and also ensure that each query has a maximum number of 20 triplets πŸ€”.

Could you share your code? It's possible that the pairs pathway could accidentally be generating too many negatives!

from ragatouille.

nico2rdj avatar nico2rdj commented on July 19, 2024

Sure here is the code:

 def run():

     print("load dataset")
     dataset = load_dataset('unicamp-dl/mmarco', 'french')


     pairs = []
     for data in tqdm(dataset['train']):
    

        query = data['query']
        doc = data['positive']

        pairs.append((query, doc))

    trainer = RAGTrainer(model_name="colBERT", pretrained_model_name="almanach/camembert-base", language_code="fr")

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data", all_documents=None, num_new_negatives=10, mine_hard_negatives=True, hard_negative_model_size="base")

from ragatouille.

bclavie avatar bclavie commented on July 19, 2024

Thank you, I'll try to figure out the exact issue soon!

In the meantime, I see that you're doing this for French. If useful, you might want to check out https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR which is a ColBERT also initiated off CamemBERT-base and trained on MMARCO French split. It was trained with the upstream ColBERT codebase so should be plug&play with RAGatouille!

from ragatouille.

nico2rdj avatar nico2rdj commented on July 19, 2024

When I print the generated triplets it appears to be duplicates πŸ€”

from ragatouille.

nico2rdj avatar nico2rdj commented on July 19, 2024

image

from ragatouille.

bclavie avatar bclavie commented on July 19, 2024

Thank you! It's not immediately obvious what the issue is, but this helps diagnosing a lot... I have a few potential ideas of what the issue could be, but need to look into it deeper...

Although this has already allowed me to spot a related-bug which could cause duplicates to appear, but shouldn't cause the total number of entries to go up πŸ€” (when extra_triplets_needed > 0, there was no check to ensure the new triplets were unique)

from ragatouille.

bclavie avatar bclavie commented on July 19, 2024

Progress will be in #78

from ragatouille.

bclavie avatar bclavie commented on July 19, 2024

Hey, so #78 should resolve (at least partially!) the duplicates issue.

As for the order of magnitude issue,I have loaded the data using your code above, and there's 39780811* (3 978 0811) pairs, so it'd make sense that you end up with rouahly ~40M triplets? Or do vou run some extra processing on the pairs to reduce them to 400k?

from ragatouille.

nico2rdj avatar nico2rdj commented on July 19, 2024

Thank you for your responsiveness :)
It's clearer now. We have 39,780,811 pairs before using the prepare_training_data function, and we end up with approximately the same number, which is around 40 million triplets. It is true that, after removing duplicates, we have 457,040 pairs before the prepare_training_data function, so logically, we should end up with about 45 million unique triplets. I assume that the prepare_training_data function also removes duplicates, which is why we end up with around 40 million. However, there is (had) a peculiar issue regarding the number of duplicate triplets. I am currently running this function but by removing the duplicates beforehand.

from ragatouille.

nico2rdj avatar nico2rdj commented on July 19, 2024

I assume the issue is somewhere else I just reviewed the prepare_training_data you do remove duplicates πŸ€” I am trying your fix :)

from ragatouille.

nico2rdj avatar nico2rdj commented on July 19, 2024

We are good it works perfectly now we end up with 4M triplets (I previously said we should end up with 45M but in fact it is 4.5M) and no duplicates! Thank you for the fix Benjamin :)

from ragatouille.

bclavie avatar bclavie commented on July 19, 2024

from ragatouille.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.