Coder Social home page Coder Social logo

hif-kat's Introduction

HIF-KAT

Source code for ACL 2021 paper "Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making".

Environments

External Resources

We provide our external resources in the Tsinghua Cloud. They include our used fasttext word embeddings and our used conda environment.

Due to the limitation of file size, we zip and split the files into pieces. In particular, these files are zipped by:

tar -zcvf - fasttext.wiki.en.300d.bin | split -b 1024m - embedding.tar.gz.
tar -zcvf - em2 | split -b 2048m - em2.tar.gz.

How to Unzip

cat embedding.tar.gz.a* embedding.tar.gz
tar -xf embedding.tar.gz

cat em2.tar.gz.a* em2.tar.gz
tar -xf em2.tar.gz

Embeddings

  1. Download fasttext.wiki.en.300d.bin from the Tsinghua Cloud.
  2. Create a new directory at $HOME/.vector_cache/fasttext (if not exist).
  3. Place fasttext.wiki.en.300d.bin at $HOME/.vector_cache/fasttext
  4. Check it by ls -al ~/.vector_cache/fasttext/fasttext.wiki.en.300d.bin, and you should get some output like this:
-rw-r--r-- 1 zijun zijun 8493673445 Jan 14 20:48 /home/zijun/.vector_cache/fasttext/fasttext.wiki.en.300d.bin

Python Environments

We would recommend you to install Anaconda (or Miniconda) and create a new environment for our code by cloning from the Tsinghua Cloud.

  1. Download our environment from the Tsinghua Cloud, and name it as em2
  2. Create a new virtual environment: conda create -n em --clone em2.
  3. Enter the new environment: conda activate em.

About the Data

  1. Go to the dataset directory: cd dataset
  2. Run 1.bigtable-attrdrop-ind.py, 2.mag-table.py, 4.mag.py, and 5.traditinal_feature.py in sequence.

Note that we have already provided data for reproducing Table 3 and Table 4. For reproducing Figure 3, you need to prepare the dataset by running our data preprocessing code with different drop_rate and train_rate.

Structured Data

music: I-A_1

citation: D-S_1

citeacm: D-A_1

Dirty Data

dmusic: I-A_2

dcitation: D-S_2

dciteacm: D-A_2

Real Data

Due to commercial issues, we are not able to publish the Real dataset.

Reproducing Table 3

cd 1-HRF-dt
bash run.sh

cd 1-HRF-gini
bash run.sh

cd 1-HRF-xgb
bash run.sh

The final results are recorded in the logs directory.

Reproducing Table 4

cd 1-HRF-dt
bash run_full.sh

cd 1-HRF-gini
bash run_full.sh

cd 1-HRF-xgb
bash run_full.sh

The final results are recorded in the logs directory.

Cite

If you use the code, please cite this paper:

Zijun Yao, Chengjiang Li, Tiansi Dong, Xin Lv, Jifan Yu, Lei Hou, Juanzi Li, Yichi Zhang, Zelin Dai. Interpretable and Low-Resource Entity Matching via Decoupling Feature Learning from Decision Making. The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021).

hif-kat's People

Contributors

iamlockelightning avatar transirius avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.