arvinzhuang / dsi-transformers Goto Github PK
View Code? Open in Web Editor NEWA huggingface transformers implementation of "Transformer Memory as a Differentiable Search Index"
License: MIT License
A huggingface transformers implementation of "Transformer Memory as a Differentiable Search Index"
License: MIT License
I notice the code set max_steps=1000000 (1000k). But the figures of hist@1 and hits@10 only illustrate the scores until 120k. Will it continue training until 1000k steps?
您好,我用官方代码步骤2无法启动,以下是报错信息:
Traceback (most recent call last):
File "train.py", line 153, in
main()
File "train.py", line 77, in main
tokenizer = T5Tokenizer.from_pretrained(model_name, cache_dir='cache')
File "/opt/conda/envs/wyc_308/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1724, in from_pretrained
resolved_vocab_files[file_id] = cached_path(
File "/opt/conda/envs/wyc_308/lib/python3.8/site-packages/transformers/file_utils.py", line 1921, in cached_path
output_path = get_from_cache(
File "/opt/conda/envs/wyc_308/lib/python3.8/site-packages/transformers/file_utils.py", line 2177, in get_from_cache
raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/projects/wyc/dsi/wandb/offline-run-20230718_015122-1nx1bm3h
wandb: Find logs at: ./wandb/offline-run-20230718_015122-1nx1bm3h/logs
您有什么建议吗?
Three types of docid representations are introduced in the paper "Transformer Memory as a Differentiable Search Index," namely, Unstructured Atomic Identifiers
, Naively Structured String Identifiers
, and Semantically Structured Identifiers
.
In your code, you currently implement only the first type, Unstructured Atomic Identifiers
. In the decoding phase, only integer docids are generated. I believe that the potential cause of lower performance compared to the source paper might be the suboptimal selection of INT_TOKEN_IDS
.
I suggest to remove this section and retrain the DSI model.
Hi, Maybe Semantic String Docid will help improve the performance of DSI? In the data/create_NQ_train_vali.py, it uses random doc id.
when I downloads the dataset an Error appeared
:OSError: Expected to be able to read 451045432 bytes for message body, got 280142879
您好,我仔细的看了您的代码。关于数据加载部分有个疑问(我尝试把它用到CV方向,虽然不一定会有效果。)
但是有个问题我不太明白,就是在您训练用到的dataset是从包括了question和document的,我没有看到在dataset中有针对它们的特殊处理。
他不应该是包括了一个Question和多个Document一起作为输入,来自回归question的docid吗?(根据您的代码我知道我可能理解错了)
不知道您能不能稍微教俺一下
十分感谢
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.