Hello Omar, Thank you very much for sharing your awesome work. I am currently

I train on 1 GPU of 12Go currently. In this <a href="https://github.com/stanford-futur

Training stopping after few hundred steps,about stanford-futuredata/colbert

Comments (8)

okhat commented on July 23, 2024

Hi Alex! Training expects triples (query, positive, negative) and you can shuffle them and use more negatives (or repeat triples) as many times as you need. However, epochs are almost never needed in practice, because instead of repeating the same triple, you can just use (query, positive, NEW negative) to see more examples!

You can find top-200 or top-1000 BM25 negatives, giving you a total of up to 3200 x 10 x 1000 = 32M triples! With batch size 32, this is 1M training steps! So you don't need epochs.

from colbert.

alexjout commented on July 23, 2024

Thank you for your quick reply.
Yes it makes quite a lot of training data indeed.
Also, can you confirm that there is not early ending of the training because of loss stagnation or other ?
And by chance, do you have already encountered this problem of stopping training before the maximum number of steps, but without exiting the process?

from colbert.

okhat commented on July 23, 2024

there is not early ending of the training because of loss stagnation or other

Training shouldn't stop until all data is exhausted or --maxsteps is over!

And by chance, do you have already encountered this problem of stopping training before the maximum number of steps, but without exiting the process?

No, could you please share more info? Are you training with multiple GPUs? Did this problem happen multiple times? Sometimes PyTorch can be (infrequently) unstable with multiple GPUs, but this shouldn't repeat.

from colbert.

alexjout commented on July 23, 2024

I train on 1 GPU of 12Go currently. In this line in the if , I added the line

torch.distributed.init_process_group(
            backend='nccl', init_method='env://')

otherwise I couldn't launch the training and had an error message. So I don't know if it can be the cause.

I also changes this line with qid, pos, neg = line.strip().split('\t').

Otherwise I didn't change anything. This stop can occur after 600 steps or 3000 typically. I still see the process with htop and the memory in the GPU is still allocated but doesn't change (43% is used typically).

Maybe it comes from the triples.p file that I created. I will try to relaunch it with the --resume argument to see if it comes from the same triples (making sure that the triples are fed in the same order).

from colbert.

okhat commented on July 23, 2024

I see. For one GPU, you can just do python -m colbert.train [....] without the torch.distributed stuff! This would work more smoothly.

For the line that you changed to .split('\t'), are you passing qid, pos, and neg as IDs to the collection.tsv and queries.tsv? Or text?

from colbert.

alexjout commented on July 23, 2024

I see. For one GPU, you can just do python -m colbert.train [....] without the torch.distributed stuff! This would work more smoothly.

Oh ok it makes more sense thank you!

For the line that you changed to .split('\t'), are you passing qid, pos, and neg as IDs to the collection.tsv and queries.tsv? Or text?

Yes I forgot to say that I changed the next line to triples.append((int(qid), int(pos), int(neg)))

from colbert.

alexjout commented on July 23, 2024

It seems to work for me after deleting the line torch.distributed.init_process_group( backend='nccl', init_method='env://') and launching it directly! Thank you very much

from colbert.

tdieu29 commented on July 23, 2024

Hi Alex! Training expects triples (query, positive, negative) and you can shuffle them and use more negatives (or repeat triples) as many times as you need. However, epochs are almost never needed in practice, because instead of repeating the same triple, you can just use (query, positive, NEW negative) to see more examples!

You can find top-200 or top-1000 BM25 negatives, giving you a total of up to 3200 x 10 x 1000 = 32M triples! With batch size 32, this is 1M training steps! So you don't need epochs.

Hi Omar, wouldn't this unbalanced data (multiple negative examples for one query and one positive example) lead to a big bias in this prediction model towards the positive example or am I missing something? Thanks!

from colbert.

Training stopping after few hundred steps about colbert HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent