This is the source code for the paper Towards End-to-End Open Conversational Machine Reading.
(Codes are being further cleaned and updated)
Please refer to MUDERN and OSCAR for preparing the OR-CMR raw datasets under the folder ./data
and then begin the following processing steps.
For convenience, we make a discourse segmented version of our rule text knowledge base beforehand.
- Pytorch==0.4.1
- NLTK==3.4.5
- numpy==1.18.1
- pycparser==2.20
- six==1.14.0
- tqdm==4.44.1
- Run
cd segedu
- Run
pip install -r requirements.txt
- Run
python open_sharc_discourse_segmentation.py
For convenience, we make a retrieved rule texts for every single rule text beforehand.
- numpy
- scikit-learn
- regex
- tqdm
- Scipy
- NLTK
- elasticsearch
- pexpect==4.2.1
-
Run
pip install -r requirements.txt
-
Build Sqlite DB via:
Here base_dir=./data
db_path =
./data/sharc_raw/json/sharc_open_id2snippet.json
mkdir -p {base_dir}/tfidf python3 build_db.py ${db_path} ${base_dir}/tfidf/db.db --num-workers 60`.
-
Run the following command to build TF-IDF index:
python3 build_tfidf.py ${base_dir}/tfidf/db.db ${base_dir}/tfidf
It will save TF-IDF index in
${base_dir}/tfidf
-
Run inference code to save retrieval results.
bash inference_tfidf.sh
Tokenize the user information and construct the dialogue tree.
- Python 3.6
- Pytorch (1.6.0)
- NLTK (3.4.5)
- spacy (2.0.16)
- transformers (4.3.2)
- Run
cd ./UniCMR
- Run
pip install -r requirements.txt
- Run
bash preprocess.sh
Training and inference of our UniCMR.
- Python 3.6
- Pytorch (1.6.0)
- NLTK (3.4.5)
- spacy (2.0.16)
- transformers (4.3.2)
- Run
cd ./UniCMR
- Run
pip install -r requirements.txt
- Run
bash run.sh
Part of our codes are borrowed from the codes of Open-Retrieval Conversational Machine Reading, many thanks.