Comments (8)
Hi Alex! Training expects triples (query, positive, negative) and you can shuffle them and use more negatives (or repeat triples) as many times as you need. However, epochs are almost never needed in practice, because instead of repeating the same triple, you can just use (query, positive, NEW negative) to see more examples!
You can find top-200 or top-1000 BM25 negatives, giving you a total of up to 3200 x 10 x 1000 = 32M triples! With batch size 32, this is 1M training steps! So you don't need epochs.
from colbert.
Thank you for your quick reply.
Yes it makes quite a lot of training data indeed.
Also, can you confirm that there is not early ending of the training because of loss stagnation or other ?
And by chance, do you have already encountered this problem of stopping training before the maximum number of steps, but without exiting the process?
from colbert.
there is not early ending of the training because of loss stagnation or other
Training shouldn't stop until all data is exhausted or --maxsteps is over!
And by chance, do you have already encountered this problem of stopping training before the maximum number of steps, but without exiting the process?
No, could you please share more info? Are you training with multiple GPUs? Did this problem happen multiple times? Sometimes PyTorch can be (infrequently) unstable with multiple GPUs, but this shouldn't repeat.
from colbert.
I train on 1 GPU of 12Go currently. In this line in the if , I added the line
torch.distributed.init_process_group(
backend='nccl', init_method='env://')
otherwise I couldn't launch the training and had an error message. So I don't know if it can be the cause.
I also changes this line with qid, pos, neg = line.strip().split('\t')
.
Otherwise I didn't change anything. This stop can occur after 600 steps or 3000 typically. I still see the process with htop and the memory in the GPU is still allocated but doesn't change (43% is used typically).
Maybe it comes from the triples.p file that I created. I will try to relaunch it with the --resume argument to see if it comes from the same triples (making sure that the triples are fed in the same order).
from colbert.
I see. For one GPU, you can just do python -m colbert.train [....]
without the torch.distributed
stuff! This would work more smoothly.
For the line that you changed to .split('\t')
, are you passing qid, pos, and neg as IDs to the collection.tsv and queries.tsv? Or text?
from colbert.
I see. For one GPU, you can just do
python -m colbert.train [....]
without thetorch.distributed
stuff! This would work more smoothly.
Oh ok it makes more sense thank you!
For the line that you changed to
.split('\t')
, are you passing qid, pos, and neg as IDs to the collection.tsv and queries.tsv? Or text?
Yes I forgot to say that I changed the next line to triples.append((int(qid), int(pos), int(neg)))
from colbert.
It seems to work for me after deleting the line torch.distributed.init_process_group( backend='nccl', init_method='env://')
and launching it directly! Thank you very much
from colbert.
Hi Alex! Training expects triples (query, positive, negative) and you can shuffle them and use more negatives (or repeat triples) as many times as you need. However, epochs are almost never needed in practice, because instead of repeating the same triple, you can just use (query, positive, NEW negative) to see more examples!
You can find top-200 or top-1000 BM25 negatives, giving you a total of up to 3200 x 10 x 1000 = 32M triples! With batch size 32, this is 1M training steps! So you don't need epochs.
Hi Omar, wouldn't this unbalanced data (multiple negative examples for one query and one positive example) lead to a big bias in this prediction model towards the positive example or am I missing something? Thanks!
from colbert.
Related Issues (20)
- CollectionEncoder blocking on encoder N passages HOT 1
- Focusing retrieval on list of document ids with doc_ids parameter doesn't work
- type object 'ColBERT' has no attribute 'segmented_maxsim' HOT 1
- Where is the qrels.dev.small.tsv?
- How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? HOT 1
- Request for AMD gpu support
- How to quickly check if installation is working fine?
- ColBert is not failing when Error is encounter during both train and indexing
- How to insert new document into the pre-built index? HOT 1
- Is there a check point of ColBERT that wasn't trained on MSMARCO?
- How to check the centroids and the data in the clusters?
- Extract only embeddings
- Execution fails in colbert.index_objs() with assert classname.endswith('Vector')
- Results on BEIR HOT 1
- unable to open file </root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/86b5e0934494bd15c9632b12f734a8a67f723594/model.safetensors> in read-only mode: No such file or directory (2)
- Add_to_index only work first time
- Tokenization Assumption for Query Marker Replacement is Inconsistent
- GPU crashes when running "D_packed @ Q.to(dtype=D_packed.dtype).T" with no error message HOT 1
- Training script from doc is not working
- ImportError: cannot import name 'packaging' from 'pkg_resources' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colbert.