Coder Social home page Coder Social logo

ghrua / ccpr_release Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 2.18 MB

Python 8.09% Shell 16.36% Cython 0.08% C++ 58.94% Makefile 9.44% C 2.53% M4 0.02% Perl 2.96% CMake 0.26% Smalltalk 0.04% Emacs Lisp 0.32% JavaScript 0.02% NewLisp 0.03% Ruby 0.03% Slash 0.01% SystemVerilog 0.01% Jupyter Notebook 0.88%

ccpr_release's Introduction

1. Introduction

The code for our paper Cross-lingual Contextualized Phase Retrieval.

NOTE: Since this project contains many pipelines and each part is finished seperately during this long-term project, I have not test the whole project from scratch again, which is one thing in my TODO list. However, I think the code and scrips are helpful for people who are curious about how we implement our method. Please feel free to ask any questions about this project. My email address: li.huayang.lh6 [at] is.naist.jp

2. TODO List

  • Unify the python environment
  • Test those scripts one more time
  • Release the human annotated data for retrieval
  • Release the code for training
  • Release the pre-trained model
  • Release the code for retrieval inference
  • Release the code for MT inference

3. Environment

TODO: Unify the python environment

  1. [Preparing Training Data] GIZA++ requires python2.7
  2. [Training Model] Our project requires python3.9 + transformers==4.27.1
  3. [SFT LLM for MT] Platypus requires transformers>=4.31.0

Below is the short explanation about four critical folders:

  1. mytools: This is a library containing commonly used functions in this project, such as reading and saving files.
  2. mgiza: The code of GIZA++ for automatically inducing word alignment information from parallel data, which is important for collecting training data for CCPR.
  3. code: The main code for our project, including the code for model, dataloader, indexing, searching, etc.
  4. Platypus: The code for LLM-based translator. In our paper, we use the CCPR model to augment the LLM-based translator by integrating the retrieved information to the LLM prompt.

!!!Please install those libraries according to their README files!!!

4. Download

4.1 HF Model

Please ensure the mytools library is installed.

python ./mytools/mytools/hf_tools.py download_hf_model "sentence-transformers/LaBSE" ./huggingface/LaBSE
python ./mytools/mytools/hf_tools.py download_hf_model "FacebookAI/xlm-roberta-base" ./huggingface/xlm-roberta-base

Please also make sure you have the checkpoint of Llama-2-7B, which will be used for the MT task.

4.2 HF Dataset

for L1 in "de" "cs" "fi" "ru" "ro" "tr"
do
    python ./mytools/mytools/hf_tools.py download_wmt --ds_name wmt16 --lang_pair ${L1}-en --save_path ./wmt16_${L1}en
done

4.3 Human-annotated Word Alignment (for Retrieval Evaluation)

If you want to pre-process the data of human-annotated word alignment by youself, please download the raw data as follows:

URL Re-name
De->En https://www-i6.informatik.rwth-aachen.de/goldAlignment/ ./align_data/DeEn
Cs->En https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1804 ./align_data/CzEn
Ro->En http://web.eecs.umich.edu/~mihalcea/wpt05/data/Romanian-English.test.tar.gz ./align_data/RoEn

4.4 Newscrawl Monolingual Data (for MT Evaluation)

# make sure you are under the root directory of the project
mkdir -p newscrawl
cd newscrawl
YY=16 # an example
LANG=tr # an example
wget https://data.statmt.org/news-crawl/${LANG}/news.20${YY}.${LANG}.shuffled.deduped.gz
gzip -d news.20${YY}.${LANG}.shuffled.deduped.gz

where YY is the number of year, e.g., 16, and LANG is the language of the data, e.g, tr.

5. Ussage

5.1 Inference: Retrieval

Before runing the following script, please remember to complete some configs in the script, e.g., path to python, and also make sure that you have installed the required libraries.

Please download the pre-processed data-bin from [this link], and put it to the root directory of this project, i.e., this folder.

You can download our pre-trained retriever through this link. If you want to train your own model, please refer to Section 5.3. In addition, please download the pre-processed human-annotated data and the indices of high-quality phrases selected by humans, and put them under the root directory of our project (this folder).

cd code
bash eval_retriever.sh

If you want to pre-process your own data for retrieval, please check un-comment the code for data processing in eval_retriever.sh.

5.2 Inference: MT

Step-1: Model training

Please save unzip the pre-trained retriever and save the ckpts folder to the root path of this project (this folder). If you don't want to use the pre-trained model, please see the Section 5.3 to train your own model.

Step-2: Data Processing & Indexing & searching

Please make sure you have downloaded the newscrawl monolingual data and install the required libraries.

cd code
bash index_and_search.sh

Step-4: Instruction-tuning LLM for translation

Please save unzip the pre-trained LLM-based translator and save the ckpts folder to the root path of this project (this folder). If you don't want to use the pre-trained model, prepare training data and train it by yourself following the README of Platypus. For the training data, you need to make samples from WMT training dataset and retrieve cross-lingual phrases using the index built in previous step.

# skip this if you want to use the pre-trained translator
cd Platypus
bash fine-tuning.sh

Step-5: Decoding & Reprting Score

You can use the folllowing data to prepare the prompts for translation. Note you can prepare the training data based on the following script.

cd Platypus
bash prepare_alpaca_data_phrase_test_enxx.sh

Then, run the script for decoding and evaluation. The score for the method will be printed.

cd Platypus
bash inference_test.sh

5.3: CCPR Training

Please check whether you need set project configs, e.g., project path, for each script.

Step-1: get the word alignment information from parallel data using GIZA++

cd mgiza/mgizapp/build
bash install.sh
bash run.sh

Step-2: run the following code to automatically induce the cross-lingual phrase pairs from the parallel data.

cd code
bash preprocess_data.sh

Step-3: train model

cd code
bash train_retriever_labse_multilingual.sh

ccpr_release's People

Contributors

ghrua avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

hienvuhuy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.