Coder Social home page Coder Social logo

catena's Introduction

CATENA

CAusal and TEmporal relation extraction from NAtural language texts

CATENA is a sieve-based system to perform temporal and causal relation extraction and classification from English texts, exploiting the interaction between the temporal and the causal model. The system requires pre-annotated text with EVENT and TIMEX3 tags according to the TimeML annotation standard, as these annotation are used as features to extract the relations.

Requirements

  • Java Runtime Environment (JRE) 1.7.x or higher

Maven

CATENA is now available on Maven Central. Please add the following dependency in your pom.xml.

<dependency>
  <groupId>com.github.paramitamirza</groupId>
  <artifactId>CATENA</artifactId>
  <version>1.0.3</version>
</dependency>

To build the fat (executable) JAR:

  • Install the WS4J library in your local Maven repo, e.g., mvn install:install-file -Dfile=./lib/ws4j-1.0.1.jar -DgroupId=edu.cmu.lti -DartifactId=ws4j -Dversion=1.0.1 -Dpackaging=jar
  • Run mvn package to build the executable JAR file (in target/CATENA-<version>.jar).

Text processing tools:

Other libraries:

Other resources:

  • Temporal and causal signal lists, available in resource/. This folder must be placed within the root folder of the project.
  • Classification models, available in models/, including: catena-event-timex.model, catena-event-dct.model, catena-event-event.model and catena-causal-event-event.model.

Usage

! The input file(s) must be in the TimeML annotation format or CoNLL column format (one token per line) !

usage: Catena
 -i,--input <arg>        Input TimeML file/directory path
 -f,--col                (optional) Input files are in column format (.col)
 -tl,--tlinks <arg>      (optional) Input file containing list of gold temporal links
 -cl,--clinks <arg>      (optional) Input file containing list of gold causal links
 -gl,--gold              (optional) Gold candidate pairs to be classified are given
 -y,--clinktype          (optional) Output the type of CLINK (ENABLE, PREVENT, etc.) from the rule-based sieve
        
 -x,--textpro <arg>      TextPro directory path
 -l,--matelemma <arg>    Mate tools' lemmatizer model path   
 -g,--matetagger <arg>   Mate tools' PoS tagger model path
 -p,--mateparser <arg>   Mate tools' parser model path      
 
 -t,--ettemporal <arg>   CATENA model path for E-T temporal classifier    
 -d,--edtemporal <arg>   CATENA model path for E-D temporal classifier                       
 -e,--eetemporal <arg>   CATENA model path for E-E temporal classifier
 -c,--eecausal <arg>     CATENA model path for E-E causal classifier
 
 -b,--train              (optional) Train the models
 -m,--tempcorpus <arg>   (optional) Directory path (containing .tml or .col files) for training temporal classifiers
 -u,--causcorpus <arg>   (optional) Directory path (containing .tml or .col files) for training causal classifier     

For example

java -Xmx2G -jar ./target/CATENA-1.0.2.jar -i ./data/example_COL/ --col --tlinks ./data/TempEval3.TLINK.txt --clinks ./data/Causal-TimeBank.CLINK.txt -l ./models/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model -g ./models/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model -p ./models/CoNLL2009-ST-English-ALL.anna-3.3.parser.model -x ./tools/TextPro2.0/ -d ./models/catena-event-dct.model -t ./models/catena-event-timex.model -e ./models/catena-event-event.model -c ./models/catena-causal-event-event.model -b -m ./data/Catena-train_COL/ -u ./data/Causal-TimeBank_COL/

CoNLL column format

The input document must be in tab-separated 'one-token-per-line' format, with each column as: | token | token-id | sentence-id | lemma | event-id | event-class | event-tense+aspect+polarity | timex-id | timex-type | timex-value | signal-id | causal-signal-id | pos-tag | chunk | lemma | pos-tag | dependencies | main-verb |

  • event-id and event-class: TimeML event ID and attributes
  • timex-id and timex-type and timex-value: TimeML timex ID and attributes
  • signal-id and causal-signal-id: temporal and causal signal ID
  • event-tense+aspect+polarity: optional attributes of an event, if given O, CATENA will infer them automatically according to PoS tags and dependency relations
  • pos-tag: BNC tagset (default tagset uset to build the models) or Penn Treebank tagset
  • chunk:
  • dependencies: in the format of dep1:deprel1||dep2:deprel2||..., dependency relations are resulted from Mate-tools

See for example data/example_COL/.

Output format

The output will be a list of temporal and/or causal relations, one relation per line, in the format of:

filename  entity_1  entity_2  TLINK_type/CLINK/CLINK-R
  • TLINK_type: One of TLINK types according to TimeML, e.g., BEFORE, AFTER, SIMULTANEOUS
  • CLINK: entity_1 CAUSE entity_2
  • CLINK-R: entity_1 IS_CAUSED_BY entity_2

System architecture

alt tag

CATENA contains two main modules:

  1. Temporal module, a combination of rule-based and supervised classifiers, with a temporal reasoner module in between.
  2. Causal module, a combination of a rule-based classifier according to causal verbs, and supervised classifier taken into account syntactic and context features, especially causal signals appearing in the text.

The two modules interact, based on the assumption that the notion of causality is tightly connected with the temporal dimension: (i) TLINK labels for event-event pairs, resulting from the rule-based sieve + temporal reasoner, are used for the CLINK classifier, and (ii) CLINK labels are used as a post-editing method for correcting the wrongly labelled event pairs by the Temporal module.

Publication

Paramita Mirza and Sara Tonelli. 2016. CATENA: CAusal and TEmporal relation extraction from NAtural language texts. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, December. [pdf]

Dataset

  • Training data for the Temporal module is taken from the TempEval-3 shared task, particularly the combination of TBAQ-cleaned (English training data) and TE3-platinum (English test data).
  • Training data for the Causal module is Causal-TimeBank, the TimeBank corpus annotated with causal information.
  • TimeBank-Dense corpus is used in one of the evaluation schemes for temporal relation extraction.
  • Causal-TempEval3-eval.txt (available in data/) is used in one of the evaluation schemes for causal relation extraction.

! Whenever making reference to this resource please cite the paper in the Publication section. !

Web Service

Soon!

Contact

For more information please contact Paramita Mirza ([email protected]).

catena's People

Contributors

paramitamirza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

catena's Issues

About tokenchunk embedding

Hi,I find that you use tokenChunk's embedding ,but in the code it's "getPhraseEmbedding("http://137.132.82.174:8080/", m.getText())" Could you please tell the way to create the embedding?(Because I may want add some new chunk or causal signal)

data/...._TML files not included in github, but needed for running CATENA

There are a few places in the code that rely on the presence of TML data files. Yet all of the *_TML directories and files have been gitignored:

/Catena-train_TML/
/Causal-TimeBank_TML/
/TempEval3-eval_TML/
/TempEval3-train_TML/

Here is one manifestation:

Train CATENA temporal and causal models...
Exception in thread "main" java.io.FileNotFoundException: /homes/hny2/mfeb/causal/CATENA_project/CATENA/./data/Catena-train_TML/ABC19980108.1830.0711.tml (No such file or directory)

Can these please be added back, please? I'd like to work with TML files if that's possible.

About TextPro2

Hi,I am following your work to install TextPro2.0,but I got some problem while using.
Below is my result, and I also test entity,chunk ,they doesn't goes on well like Lemma.
And for the pos the code ran successfully,but the output file doesn't display the pos tagger.
Have you ever met this problem?
My java is 1.8.0.2,my Perl is 5.30.Fllowing is the error description.
Hopes for your attestion,thanks.
#TextPro is running on test/input/storace-assunta.html #Detected language: italian-utf #TokenPro... 324ms #TagPro... 83ms #MorphoPro... 2ms #LemmaPro... Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7 at eu.fbk.textpro.wrapper.wrapper.prepareTmpFile(wrapper.java:1322) at eu.fbk.textpro.wrapper.wrapper.runModule(wrapper.java:1006) at eu.fbk.textpro.wrapper.wrapper.sortandrunDependentModulesValue(wrapper.java:868) at eu.fbk.textpro.wrapper.wrapper.runFile(wrapper.java:688) at eu.fbk.textpro.wrapper.wrapper.manageFile(wrapper.java:435) at eu.fbk.textpro.wrapper.wrapper.manageInput(wrapper.java:418) at eu.fbk.textpro.wrapper.wrapper.main(wrapper.java:399)

Query regarding input to CATENA

Hi,
I am facing some issues in getting my raw documents to the input format you have specified.

Some specific queries regarding the input format:
|token | token-id | sentence-id | lemma | event-id | event-class | event-tense+aspect+polarity | timex-id | timex-type| timex-value | signal-id | causal-signal-id | pos-tag | chunk | lemma | pos-tag | dependencies | main-verb |

  1. Why you have asked to give same information twice like lemma, pos-tag.
  2. chunk meaning is not intuitive (its description is missing from the wiki)
  3. It is a bit confusing which all attributes are optional in the input.
  4. Is there any standard library to get raw documents in the required format?

I will be thankful to you for resolving these queries.

Thanks in advance
Shikhar

No such file or directory in CATENA ./tools/TextPro2.0/

java -Xmx6G -jar ./target/CATENA-1.0.3.jar -i /root/dater/data/catena_input.xmlsample.txt.info.xml.tml -l ./models/CoNLL2009-ST-English-ALL.anna-3.3.lemmatizer.model -g ./models/CoNLL2009-ST-English-ALL.anna-3.3.postagger.model -p ./models/CoNLL2009-ST-English-ALL.anna-3.3.parser.model -x ./tools/TextPro2.0/ -d ./models/catena-event-dct.model -t ./models/catena-event-timex.model -e ./models/catena-event-event.model -c ./models/catena-causal-event-event.model > /root/dater/data/neural_input.tml

I ran the above command.. i am getting following error...

Convert TimeML files to column format...
Exception in thread "main" java.io.FileNotFoundException: ./tools/TextPro2.0/temp_catena_input.xmlsample.txt.info.xml.txt.txp (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at java.util.Scanner.(Scanner.java:611)
at catena.parser.TextProParser.run(TextProParser.java:140)
at catena.parser.TimeMLToColumns.convert(TimeMLToColumns.java:491)
at catena.parser.TimeMLToColumns.convert(TimeMLToColumns.java:514)
at catena.Temporal.filePreprocessing(Temporal.java:77)
at catena.Temporal.extractRelations(Temporal.java:734)
at catena.Catena.extractRelations(Catena.java:489)
at catena.Catena.extractRelationsString(Catena.java:402)
at catena.Catena.main(Catena.java:89)

Can anyone please help me to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.