Coder Social home page Coder Social logo

infosurgeon's Introduction

InfoSurgeon

Get the Datasets

First, download and decompress data/NYTimes_orig, data/VOA, and NLP_toolbox.

Preprocess Data

Our pipeline requires several preprocessing steps, from other preexisting works:

  • Step 0: Prepare raw data into a parsed format standard across the two datasets, NYTimes and VOA

  • Step 1: Bert tokenization for textual summarization features [paper] [code]

  • Step 2: Bottom-Up-Attention visual semantic feature extraction [paper] [code]

  • Step 3: Building the IE/KG given the news article [paper] [code]

We are porting in the code for the above components into our repo so they can be run via the following commands.

dataset=NYTimes

## For the first time running the dataset..
if [ "$dataset" == "NYTimes" ]; then
    python scripts/get_NYTimes_data.py  #step 0
fi
sh scripts/preproc_bert.sh "" "" ${dataset}  #step 1a
sh scripts/preproc_bert.sh "" caption ${dataset}  #step 1b
sh scripts/preproc_bert.sh "" title ${dataset}  #step 1c
sh scripts/preproc_bua.sh ${dataset}  #step 2
sh scripts/preproc_IE.sh ${dataset}  #step 3
python data_preproc/prepare_indicator_factors.py ${dataset}

# git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . && cd .. && rm -rf apex
# git clone https://github.com/NVIDIA/apex.git
# cd apex
# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
# pip install .
# cd ..

But this is still in beta development phase. If you encounter set-up or runtime issues, please directly check out and run the original preprocessing source code and documentations linked above!

Run Misinformation Detection

# Example usage for doc-level detection task:
python code/engine.py --task doc-level --data_dir data/${dataset}/ --lrate 5e-6 --num_epochs 5 --ckpt_name ${dataset}

Example usage for KE-level detection task:
# python code/engine.py --task KE-level --data_dir data/VOA/ --lrate 0.001 --ckpt_name VOA

Credits & Acknowledgements

The NYTimes dataset orignated from GoodNews, and Tan et al., 2020 added in multimedia NeuralNews.

The pristine/unmanipulated VOA news articles used in our data was originally collected by Manling Li. Many thanks to her.

General Tips

If you would like to view a jupyter notebook running in the remote server from your local machine, do sth along the lines of

jupyter notebook --no-browser --port=5050  # in the server
ssh -N -f -L localhost:5051:localhost:5050 username@server-entry-address  # from local machine

infosurgeon's People

Contributors

yrf1 avatar

Stargazers

WzhRslh avatar  avatar L. avatar  avatar Tianxing Wu avatar Abhishek Kumar avatar JasonGuo avatar AnbinX avatar Earl avatar Chengwei Liu avatar  avatar Khoi Pham avatar  avatar Tai D Nguyen avatar Hou Pong (Ken) Chan avatar Xing Su avatar Qi Zeng avatar Xueqing Wu avatar 1998czk avatar  avatar Yipeng Zhang avatar  avatar Guangxing Han avatar Kung-Hsiang Steeve Huang  avatar Kervin Wang avatar

Watchers

 avatar

infosurgeon's Issues

Where to find "IE_results.pkl" or how to generate "IE_results.pkl"?

Dear Author Yi, I think your work on infosurgeon is really great and very insightful. I am trying to referencing the ideas and reproduce the result in your paper. But I encountered some difficulty and I noticed the required file named "IE_results.pkl" in the source code for VOA dataset cannot be found. Could you please explain how can we find "IE_results.pkl" file or how can we generate this file by using by other codes? I would really appreciate your explanation and help!

Source Code reference:

with open(self.data_dir[0]+"/IE_results.pkl", "rb") as f:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.