Hello Benjamin, Again thank you for this amazing work! :) There

Sure here is the code: <div class="snippet-clipboard-content notranslate position-

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

Progress will be in <a class="issue-link js-issue-link" data-error-text="Failed to loa

Hey, so <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

Number of Triplets about ragatouille HOT 13 CLOSED

nico2rdj commented on July 19, 2024

Number of Triplets

from ragatouille.

Comments (13)

bclavie commented on July 19, 2024 1

Merged in 0.0.6b0! With the shuffling & duplicate fixes 😄

from ragatouille.

bclavie commented on July 19, 2024

Hey,

Just looking at the trainer actually! You shouldn't be getting more than 8M triplets, as the defaults are to mine 10 hard negative example per query (which you're doing), and also ensure that each query has a maximum number of 20 triplets 🤔.

Could you share your code? It's possible that the pairs pathway could accidentally be generating too many negatives!

from ragatouille.

nico2rdj commented on July 19, 2024

Sure here is the code:

 def run():

     print("load dataset")
     dataset = load_dataset('unicamp-dl/mmarco', 'french')


     pairs = []
     for data in tqdm(dataset['train']):
    

        query = data['query']
        doc = data['positive']

        pairs.append((query, doc))

    trainer = RAGTrainer(model_name="colBERT", pretrained_model_name="almanach/camembert-base", language_code="fr")

    trainer.prepare_training_data(raw_data=pairs, data_out_path="./data", all_documents=None, num_new_negatives=10, mine_hard_negatives=True, hard_negative_model_size="base")

from ragatouille.

bclavie commented on July 19, 2024

Thank you, I'll try to figure out the exact issue soon!

In the meantime, I see that you're doing this for French. If useful, you might want to check out https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR which is a ColBERT also initiated off CamemBERT-base and trained on MMARCO French split. It was trained with the upstream ColBERT codebase so should be plug&play with RAGatouille!

from ragatouille.

nico2rdj commented on July 19, 2024

When I print the generated triplets it appears to be duplicates 🤔

from ragatouille.

nico2rdj commented on July 19, 2024

from ragatouille.

bclavie commented on July 19, 2024

Thank you! It's not immediately obvious what the issue is, but this helps diagnosing a lot... I have a few potential ideas of what the issue could be, but need to look into it deeper...

Although this has already allowed me to spot a related-bug which could cause duplicates to appear, but shouldn't cause the total number of entries to go up 🤔 (when extra_triplets_needed > 0, there was no check to ensure the new triplets were unique)

from ragatouille.

bclavie commented on July 19, 2024

Progress will be in #78

from ragatouille.

bclavie commented on July 19, 2024

Hey, so #78 should resolve (at least partially!) the duplicates issue.

As for the order of magnitude issue,I have loaded the data using your code above, and there's 39780811* (3 978 0811) pairs, so it'd make sense that you end up with rouahly ~40M triplets? Or do vou run some extra processing on the pairs to reduce them to 400k?

from ragatouille.

nico2rdj commented on July 19, 2024

Thank you for your responsiveness :)
It's clearer now. We have 39,780,811 pairs before using the prepare_training_data function, and we end up with approximately the same number, which is around 40 million triplets. It is true that, after removing duplicates, we have 457,040 pairs before the prepare_training_data function, so logically, we should end up with about 45 million unique triplets. I assume that the prepare_training_data function also removes duplicates, which is why we end up with around 40 million. However, there is (had) a peculiar issue regarding the number of duplicate triplets. I am currently running this function but by removing the duplicates beforehand.

from ragatouille.

nico2rdj commented on July 19, 2024

I assume the issue is somewhere else I just reviewed the prepare_training_data you do remove duplicates 🤔 I am trying your fix :)

from ragatouille.

nico2rdj commented on July 19, 2024

We are good it works perfectly now we end up with 4M triplets (I previously said we should end up with 45M but in fact it is 4.5M) and no duplicates! Thank you for the fix Benjamin :)

from ragatouille.

bclavie commented on July 19, 2024

No worries, glad your issue is fixed and thanks for the debugging assistance! I’ll release the PR later today on Pypi so it also includes the main branch fix for proper shuffling pre-training (right now in the branch there are some cases where triplets aren’t shuffled properly and it makes training less efficient because of in batch negatives)

from ragatouille.

Number of Triplets about ragatouille HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent