nlpjcl / rag-retrieval Goto Github PK

View Code? Open in Web Editor NEW

370.0 6.0 32.0 2.18 MB

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT,Cross Encoder

License: MIT License

Python 97.77% Shell 2.23%

ai llm nlp rag retrieval-augmented-generation

rag-retrieval's Introduction

Hi there 车中草同学[A grass in the Cart]👋

🔭 I’m currently working on RAG-Retrieval, denser-retriever.
🌱 My main research interest focuses on Retrieval(Embedding, ColBERT,Reranker), NLP(LLM，RAG).
👯 I’m looking to collaborate on RAG-Retrieval
📫 How to reach me: [email protected]

rag-retrieval's People

Contributors

Stargazers

Watchers

rag-retrieval's Issues

batch内负样本采样

你好，对于batch内随机采样负样本，可能存在某样本的负样本刚好是同batch内一个样本的正样本，如果采样到了，是否可能会对模型效果造成影响了，如果要采用这种随机采样的方法，是否要求数据集的数量够大

你好我用colbert去微调bge-m3报错，请问这是原因？
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.

Reranker 微调数据集准备？

我使用 “query和正例以及难负例 ”格式数据集微调了embedding model，请问如果准备Reranker 微调数据集？您在介绍中提到了可以使用两种数据,请问有没有这两种数据的示例,以及应该如何准备.

微调BCEmbedding是否也应该使用XLMroberta

在Reranker文件下的README中您指出了微调BCERanker这样的多语言模型的时候，需要使用XLMroberta的配置文件，那BCEmbedding也是同样是多语言的，是否也应该使用XLMroberta配置呢？

Roadmap Notion权限是公开的

建议关掉外部编辑权限

Hugging face issue显示merged，是否可以删除这两行代码

在学习您的Embedding model 的代码的时候，在save_pretrained这个函数中，我看到您注释写了有一个hugging face导致的bug，当我去进一步查看那个关于hugging face的issue的时候，发现这个bug好像已经被解决了，issue链接。我对hugging face保存相关的接口并不熟悉，想请教一下是否可以删除下面这两行代码？

merge_command = f"rm  {save_dir}/model.safetensors"
subprocess.run(merge_command, shell=True)

期待您的回复。

embedding \rerank微调时，在训练过程中，如何在测试集上eval呢

如题，validation_dataloader如何传入呢

怎么更直观的查看模型效果

在embedding给出的加载模型进行预测示例中，把input编码后打印出张量，我该如何判断我的模型效果
embedding = Embedding.from_pretrained(
ckpt_path,
)
embedding.to(cuda_device)
input_lst = ['我喜欢**','我爱爬泰山']
embedding = embedding.encode(input_lst,device=cuda_device)
print(embedding.tolist())

batch内随机负例是在哪里实现的啊

embedding微调时，tensorboard可视化参数问题

https://github.com/NLPJCL/RAG-Retrieval/blob/master/rag_retrieval/train/embedding/train_embedding.py#L83
accelerator.init_trackers('embedding', config=args)
在该函数定义里，config是dict类型，但是这里的args是<class 'argparse.Namespace'>类型不符，报错：TypeError: hparam_dict and metric_dict should be dictionary.
按如下修改即可：
accelerator.init_trackers('embedding', config=vars(args))

多GPU怎么改造代码

Welcome for any question and propose new features

thank you

训练bge-m3和bce-embedding-base_v1报错

使用train_embedding.sh训练上述两个模型报错：
Traceback (most recent call last):
File "train_embedding.py", line 182, in
main()
File "train_embedding.py", line 109, in main
model = accelerator.prepare(model)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1292, in prepare
result = tuple(
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1293, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1443, in prepare_model
self.state.fsdp_plugin.set_auto_wrap_policy(model)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/utils/dataclasses.py", line 1182, in set_auto_wrap_policy
raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.

微调colbert之后，评测时遇到问题

参考colbert目录下的readme，只是换了自己的数据，训完评测时，发现每次结果不一样

from sentence_transformers import SentenceTransformer, CrossEncoder, util
cross_encoder = CrossEncoder("./m3/model")
corpus = ["西安","太原","北京","海南"] 
top_res=cross_encoder.rank(query="陕西省的省会城市是哪个", documents=corpus, top_k=3, return_documents=True)
print(top_res)

提示如下，这个是正常的吗

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at ./m3/model and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

发现每次执行结果不一样，相同代码执行2词结果如下：

[{'corpus_id': 0, 'score': 0.5507802, 'text': '西安'}, {'corpus_id': 2, 'score': 0.54930556, 'text': '北京'}, {'corpus_id': 1, 'score': 0.5357058, 'text': '太原'}]

[{'corpus_id': 1, 'score': 0.5277521, 'text': '太原'}, {'corpus_id': 2, 'score': 0.5160099, 'text': '北京'}, {'corpus_id': 0, 'score': 0.4968127, 'text': '西安'}]

麻烦帮忙看下是什么原因呢

为什么训练“m3e-base”模型loss为nan

我在相同参数下训练开源的m3e-base模型，但是loss从第一次迭代后就一直是nan，相同配置和参数下训练bge-base-zh-v1.5模型是正常的，这是为什么呢？
训练脚本如下：
CUDA_VISIBLE_DEVICES="0,1" nohup accelerate launch --config_file ../configs/default_fsdp.yaml train_embedding.py
--model_name_or_path "../models/m3e-base"
--dataset "../datas/embedding/t2rank_100.json"
--output_dir "../results/m3e-base"
--batch_size 8
--lr 2e-5
--epochs 4
--save_on_epoch_end 1
--gradient_accumulation_steps 24
--log_with 'wandb'
--warmup_proportion 0.1
--neg_nums 2
--temperature 0.02
--query_max_len 128
--passage_max_len 512 \

../logs/m3e-base.log &

训练问题

请教一下作者，你的这个实现，比BGE自有的训练效果会更好么？

使用query-pos格式数据微调embedding，loss报错

base模型：bge_large
训练数据格式:jsonl文件,每行：{"query":"ssss","pos":["xxx"]}
没改代码，按照readme微调embedding时报错：

AttributeError: 'builtin_function_or_method' object has no attribute 'transpose'

报错代码见：https://github.com/NLPJCL/RAG-Retrieval/blob/master/rag_retrieval/train/embedding/model.py#L79
sim_matrix = query_embeddings @ pos_doc_embeddings.unsqueeze.transpose(-1, -2)
这里不用增加维度，修改为:
sim_matrix = query_embeddings @ pos_doc_embeddings.transpose(-1, -2)
即可正常训练

nlpjcl / rag-retrieval Goto Github PK

rag-retrieval's Introduction

Hi there 车中草同学[A grass in the Cart]👋

rag-retrieval's People

Contributors

Stargazers

Watchers

Forkers

rag-retrieval's Issues

Recommend Projects

Recommend Topics

Recommend Org