Coder Social home page Coder Social logo

Comments (8)

okhat avatar okhat commented on July 23, 2024

Hi Alex! Training expects triples (query, positive, negative) and you can shuffle them and use more negatives (or repeat triples) as many times as you need. However, epochs are almost never needed in practice, because instead of repeating the same triple, you can just use (query, positive, NEW negative) to see more examples!

You can find top-200 or top-1000 BM25 negatives, giving you a total of up to 3200 x 10 x 1000 = 32M triples! With batch size 32, this is 1M training steps! So you don't need epochs.

from colbert.

alexjout avatar alexjout commented on July 23, 2024

Thank you for your quick reply.
Yes it makes quite a lot of training data indeed.
Also, can you confirm that there is not early ending of the training because of loss stagnation or other ?
And by chance, do you have already encountered this problem of stopping training before the maximum number of steps, but without exiting the process?

from colbert.

okhat avatar okhat commented on July 23, 2024

there is not early ending of the training because of loss stagnation or other

Training shouldn't stop until all data is exhausted or --maxsteps is over!

And by chance, do you have already encountered this problem of stopping training before the maximum number of steps, but without exiting the process?

No, could you please share more info? Are you training with multiple GPUs? Did this problem happen multiple times? Sometimes PyTorch can be (infrequently) unstable with multiple GPUs, but this shouldn't repeat.

from colbert.

alexjout avatar alexjout commented on July 23, 2024

I train on 1 GPU of 12Go currently. In this line in the if , I added the line

torch.distributed.init_process_group(
            backend='nccl', init_method='env://')

otherwise I couldn't launch the training and had an error message. So I don't know if it can be the cause.

I also changes this line with qid, pos, neg = line.strip().split('\t').

Otherwise I didn't change anything. This stop can occur after 600 steps or 3000 typically. I still see the process with htop and the memory in the GPU is still allocated but doesn't change (43% is used typically).

Maybe it comes from the triples.p file that I created. I will try to relaunch it with the --resume argument to see if it comes from the same triples (making sure that the triples are fed in the same order).

from colbert.

okhat avatar okhat commented on July 23, 2024

I see. For one GPU, you can just do python -m colbert.train [....] without the torch.distributed stuff! This would work more smoothly.

For the line that you changed to .split('\t'), are you passing qid, pos, and neg as IDs to the collection.tsv and queries.tsv? Or text?

from colbert.

alexjout avatar alexjout commented on July 23, 2024

I see. For one GPU, you can just do python -m colbert.train [....] without the torch.distributed stuff! This would work more smoothly.

Oh ok it makes more sense thank you!

For the line that you changed to .split('\t'), are you passing qid, pos, and neg as IDs to the collection.tsv and queries.tsv? Or text?

Yes I forgot to say that I changed the next line to triples.append((int(qid), int(pos), int(neg)))

from colbert.

alexjout avatar alexjout commented on July 23, 2024

It seems to work for me after deleting the line torch.distributed.init_process_group( backend='nccl', init_method='env://') and launching it directly! Thank you very much

from colbert.

tdieu29 avatar tdieu29 commented on July 23, 2024

Hi Alex! Training expects triples (query, positive, negative) and you can shuffle them and use more negatives (or repeat triples) as many times as you need. However, epochs are almost never needed in practice, because instead of repeating the same triple, you can just use (query, positive, NEW negative) to see more examples!

You can find top-200 or top-1000 BM25 negatives, giving you a total of up to 3200 x 10 x 1000 = 32M triples! With batch size 32, this is 1M training steps! So you don't need epochs.

Hi Omar, wouldn't this unbalanced data (multiple negative examples for one query and one positive example) lead to a big bias in this prediction model towards the positive example or am I missing something? Thanks!

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.