devsinghsachan / art Goto Github PK
View Code? Open in Web Editor NEWCode and models for the paper "Questions Are All You Need to Train a Dense Passage Retriever (TACL 2023)"
License: Other
Code and models for the paper "Questions Are All You Need to Train a Dense Passage Retriever (TACL 2023)"
License: Other
Hi, thanks for your excellent work. I meet a problem with rebuilding evidence embeddings during the training stage.
Batch 39000 | Total 19968000 Batch 40000 | Total 20480000 Batch 41000 | Total 20992000 [E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801009 milliseconds befotiming out. [E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801846 milliseconds befotiming out. [E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800915 milliseconds befotiming out. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomte data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomte data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomte data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' terminate called after throwing an instance of 'std::runtime_error' what(): terminate called after throwing an instance of '[Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ranr 1801009 milliseconds before timing out.std::runtime_error
This looks like a timeout caused by building the index. How can I solve this problem? What is the performance impact if I don't rebuild evidence embeddings during the training stage?
I cannot find JSON files after running the download data script examples/helper-scripts/download_data.sh
Where can I find those files? Thank you in advance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.