Coder Social home page Coder Social logo

nlpjcl / rag-retrieval Goto Github PK

View Code? Open in Web Editor NEW
370.0 6.0 32.0 2.18 MB

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT,Cross Encoder

License: MIT License

Python 97.77% Shell 2.23%
ai llm nlp rag retrieval-augmented-generation

rag-retrieval's Introduction

Hi there 车中草同学[A grass in the Cart]👋

rag-retrieval's People

Contributors

buaadreamer avatar nlpjcl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

rag-retrieval's Issues

batch内负样本采样

image

你好,对于batch内随机采样负样本,可能存在某样本的负样本刚好是同batch内一个样本的正样本,如果采样到了,是否可能会对模型效果造成影响了,如果要采用这种随机采样的方法,是否要求数据集的数量够大

微调bge-m3报错

你好我用colbert去微调bge-m3报错,请问这是原因?
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.

Reranker 微调数据集准备?

我使用 “query和正例以及难负例 ”格式数据集微调了embedding model,请问如果准备Reranker 微调数据集?您在介绍中提到了可以使用两种数据,请问有没有这两种数据的示例,以及应该如何准备.

微调BCEmbedding是否也应该使用XLMroberta

在Reranker文件下的README中您指出了微调BCERanker这样的多语言模型的时候,需要使用XLMroberta的配置文件,那BCEmbedding也是同样是多语言的,是否也应该使用XLMroberta配置呢?

Hugging face issue显示merged,是否可以删除这两行代码

在学习您的Embedding model 的代码的时候,在save_pretrained这个函数中,我看到您注释写了有一个hugging face导致的bug,当我去进一步查看那个关于hugging face的issue的时候,发现这个bug好像已经被解决了,issue链接。我对hugging face保存相关的接口并不熟悉,想请教一下是否可以删除下面这两行代码?

merge_command = f"rm  {save_dir}/model.safetensors"
subprocess.run(merge_command, shell=True)

期待您的回复。

怎么更直观的查看模型效果

在embedding给出的加载模型进行预测示例中,把input编码后打印出张量,我该如何判断我的模型效果
embedding = Embedding.from_pretrained(
ckpt_path,
)
embedding.to(cuda_device)
input_lst = ['我喜欢**','我爱爬泰山']
embedding = embedding.encode(input_lst,device=cuda_device)
print(embedding.tolist())

训练bge-m3和bce-embedding-base_v1报错

使用train_embedding.sh训练上述两个模型报错:
Traceback (most recent call last):
File "train_embedding.py", line 182, in
main()
File "train_embedding.py", line 109, in main
model = accelerator.prepare(model)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1292, in prepare
result = tuple(
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1293, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 1443, in prepare_model
self.state.fsdp_plugin.set_auto_wrap_policy(model)
File "/home/nlp/miniconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/utils/dataclasses.py", line 1182, in set_auto_wrap_policy
raise Exception("Could not find the transformer layer class to wrap in the model.")
Exception: Could not find the transformer layer class to wrap in the model.

微调colbert之后,评测时遇到问题

参考colbert目录下的readme,只是换了自己的数据,训完评测时,发现每次结果不一样

from sentence_transformers import SentenceTransformer, CrossEncoder, util
cross_encoder = CrossEncoder("./m3/model")
corpus = ["西安","太原","北京","海南"] 
top_res=cross_encoder.rank(query="陕西省的省会城市是哪个", documents=corpus, top_k=3, return_documents=True)
print(top_res)
  1. 提示如下,这个是正常的吗

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at ./m3/model and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

  1. 发现每次执行结果不一样,相同代码执行2词结果如下:

[{'corpus_id': 0, 'score': 0.5507802, 'text': '西安'}, {'corpus_id': 2, 'score': 0.54930556, 'text': '北京'}, {'corpus_id': 1, 'score': 0.5357058, 'text': '太原'}]

[{'corpus_id': 1, 'score': 0.5277521, 'text': '太原'}, {'corpus_id': 2, 'score': 0.5160099, 'text': '北京'}, {'corpus_id': 0, 'score': 0.4968127, 'text': '西安'}]

麻烦帮忙看下是什么原因呢

为什么训练“m3e-base”模型loss为nan

我在相同参数下训练开源的m3e-base模型,但是loss从第一次迭代后就一直是nan,相同配置和参数下训练bge-base-zh-v1.5模型是正常的,这是为什么呢?
训练脚本如下:
CUDA_VISIBLE_DEVICES="0,1" nohup accelerate launch --config_file ../configs/default_fsdp.yaml train_embedding.py
--model_name_or_path "../models/m3e-base"
--dataset "../datas/embedding/t2rank_100.json"
--output_dir "../results/m3e-base"
--batch_size 8
--lr 2e-5
--epochs 4
--save_on_epoch_end 1
--gradient_accumulation_steps 24
--log_with 'wandb'
--warmup_proportion 0.1
--neg_nums 2
--temperature 0.02
--query_max_len 128
--passage_max_len 512 \

../logs/m3e-base.log &

训练问题

请教一下作者, 你的这个实现,比BGE自有的训练效果会更好么?

使用query-pos格式数据微调embedding,loss报错

base模型:bge_large
训练数据格式:jsonl文件,每行:{"query":"ssss","pos":["xxx"]}
没改代码,按照readme微调embedding时报错:

AttributeError: 'builtin_function_or_method' object has no attribute 'transpose'

报错代码见:https://github.com/NLPJCL/RAG-Retrieval/blob/master/rag_retrieval/train/embedding/model.py#L79
sim_matrix = query_embeddings @ pos_doc_embeddings.unsqueeze.transpose(-1, -2)
这里不用增加维度,修改为:
sim_matrix = query_embeddings @ pos_doc_embeddings.transpose(-1, -2)
即可正常训练

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.