Comments (6)
the link you provide on jaColBERT not working
Oh good catch! Updated it, the proper link is here
What is the suggestion do you have if I want to training on new languge? can you share your experience? I hope you can write down some steps so I can follow. I'm happy to introduce ColBERT new language if I can sucessful train it
There's some information on the training in the technical report . Otherwise, it should be pretty straightforward to train a ColBERT using the utils in RAGatouille, since training JaColBERT is what led to writing the lib! The training utilities handle hard negative mining, etc... for you
I forgot to meantion, how long and how much cost did you spent to train it? If you doing on home device, what is your specs?
I was lucky to get some GPU credits, so trained it on 8 Nvidia L4 GPU for around ~10 hours. I'm pretty sure you could still get decent results with less data and weaker hardware!
from ragatouille.
Hey!
ColBERT is essentially just a family of models, the examples use the original ColBERTv2 which is English-only.
There's a lot of interest in multilingual models and I'm hoping to be able to make that happen eventually, and I know other people are also working on it. Supporting more languages with ColBERT models would be fantastic! I'll leave this issue open as Help Wanted
!
As of right now, I'm aware of these models to support non-English languages:
- 🇯🇵 bclavie/JaColBERT (by myself, trained using a mix of ColBERTv1 and ColBERTv2 methods)
- 🇫🇷 antoinelouis/colbertv1-camembert-base-mmarcoFR
- 🇬🇧🇨🇳🇷🇺... Suraj Nair et al.'s family of models for some multilingual support (eng queries to multiple languages retrieval), but not fully compatible with the main ColBERT codebase which RAGatouille is based on(definitely interesting to merge in the future!)
from ragatouille.
the link you provide on jaColBERT not working:
The model is trained on the japanese split of MMARCO, augmented with hard negatives. The data, including the hard negatives, is available on huggingface datasets.
What is the suggestion do you have if I want to training on new languge? can you share your experience? I hope you can write down some steps so I can follow. I'm happy to introduce ColBERT new language if I can sucessful train it
from ragatouille.
@bclavie I forgot to meantion, how long and how much cost did you spent to train it? If you doing on home device, what is your specs?
from ragatouille.
Hey!
I had a go at training ColBERT for the spanish language a few weeks ago, unfortunately I still haven't had time to properly evaluate it, but if anyone wants to try it, it can be found here:
AdrienB134/ColBERTv1.0-bert-based-spanish-mmarcoES
from ragatouille.
Hello @bclavie,
Maybe this one is interesting:
ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval
https://github.com/ant-louis/xm-retrievers
https://huggingface.co/antoinelouis/colbert-xm/
from ragatouille.
Related Issues (20)
- Training resume feature isn't available due to removal in upstream ColBERT HOT 1
- Issue with indexing BGE-M3 (large dimensionality vectors) HOT 4
- Replace ColBERT with jina-colbert-v1-en HOT 2
- ImportError: cannot import name 'Document' from 'llama_index' (unknown location) HOT 11
- ImportError: cannot import name 'LLM' from 'llama_index.core.llms' HOT 1
- Discussion / forum for RAGatouille? HOT 1
- Is there a way to quiet the progress bar printout?
- Colbert late interaction matrix
- examples/06-index_free_use.ipynb _ TypeError: '>' not supported between instances of 'str' and 'int' HOT 1
- How to use my own hard negatives during finetuning? HOT 1
- RuntimeError: Error building extension 'decompress_residuals_cpp' - ninja/Colbert/Torch error. HOT 5
- How can i integrate RAGtouille with DSPy HOT 2
- Jupyter kernel running within WSL2 keeps restarting after a certain point while running RAG.index() HOT 3
- Get token level attention
- ImportError: DLL load failed while importing segmented_maxsim_cpp:
- Feature Request: Basic Pre-filter on Metadata HOT 1
- error on using add_to_index() sequentially HOT 2
- Why is Searcher loading a second Checkpoint, i.e. the same model again? HOT 2
- 'builtin_function_or_method' object has no attribute 'sleep' HOT 5
- Couldn't use multiple GPU to search HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ragatouille.