rockt / chemspot Goto Github PK

ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. ChemSpot is released under the Common Public License 1.0.

Home Page: https://www.informatik.hu-berlin.de/forschung/gebiete/wbi/resources/chemspot/chemspot/

License: Other

Java 99.45% Scala 0.55%

chemspot's People

Contributors

Stargazers

Watchers

Forkers

imzwz clarivate-lsps paidi davidsoloman erechtheus jkirsch judithcodes wkrupa beira-bf

chemspot's Issues

Reduce size of LINNAEUS automaton

Integrate OPSIN for normalizing IUPAC entities

Tagging 1000 documents

My question is not really an issue, I just want suggestions, if for example we want to annotate 1000 text files, how can I automate it?

java -Xmx16G -jar chemspot.jar -t sample.txt -o predict.txt
The above line tags one file but I want to tag 1000 files and I can't do it manually. What is the alternative to this?

Use IDs of other Databases for Normalization (incl. ChEBI)

Fix application parameter settings

Certain parameter combinations for the ChemSpot main application produce a strange and somewhat arbitrary behavior. This should be changed so that all parameters work as expected.

Unable to access jarfile chemspot.jar

HI I have tried to follow the commands with some sample text but received the following error on Ubuntu 16.04.

Unable to access jarfile chemspot.jar

eumed-light.jar with newer version of scala

eumed-light.jar has been compiled with old version of scala 2.9.2. Need a version compiled with 2.10

Or the source code so that I can compile

Improve normalization

Find more (offline) sources to retrieve IDs for chemicals from and integrate them into ChemSpot

Solve problem with abstract entities in CRAFT corpus

Find a graceful way to deal with the problem of more abstract entities such as "protein" or "molecule" in the CRAFT corpus, without just "removing unwanted entities" that ChemSpot doesn't find (wasn't designed to find)

Make UIMA descriptors accessible from within a jar

Chemspot REST interface

Hi,
FYI I've quickly developed a wrapper around ChemSpot that offer it as a REST service. In this way you don't need 16Gb or memory everytime you need to tag a new document.

You can find it here: https://bitbucket.org/lfoppiano/chemspot-web
Regards
Luca

Seeking description of the columns in the output

Hey, I can guess at what several of the columns in the output are, but have no idea for a number of them. Is there a data dictionary somewhere? E.g., col1 = 'this'; col2 = 'that', etc? Couldn't find such after a non-trivial search.

Thanks for the fabulous tool!

Tagging text is slow

934a481

    public List<Mention> tag(String text) throws UIMAException {
        JCas jcas = JCasFactory.createJCas(typeSystem);
        jcas.setDocumentText(text);
        PubmedDocument pd = new PubmedDocument(jcas);
        pd.setBegin(0);
        pd.setEnd(text.length());
        pd.setPmid("");
        pd.addToIndexes(jcas);
        return tag(jcas);
    }

This is slow since a jcas is initialized each time we want to tag a string. Instead, hold back one pre-intitialized jcas and reset it each time this method gets called.

Tagging text from command-line does not work

java -jar -Xmx9G chemspot.jar -m crf_model.bin -s sentence_model.bin.gz -d dict.zip -i ids.zip -t sample.txt -o predict.iob

Exception in thread "main" java.io.IOException: There are no corpora defined.
at de.berlin.hu.chemspot.App.promptForCorpus(App.java:146)
at de.berlin.hu.chemspot.App.main(App.java:270)

Suppress "Couldn't open edu.umass.cs.mallet.base.util.MalletLogger resources/logging.properties file" error

Grab the missing 'logging.properties' file at

https://github.com/clulab/banner/blob/master/src/main/java/edu/umass/cs/mallet/base/util/resources/logging.properties

Open the 'ChemSpot/chemspot-2.0/chemspot.jar' file using an archive manager (me, Ubuntu: Archive Manager); add the 'logging.properties' file to the following location:

cc.mallet.util.resources.logging.properties

[i.e., "/cc/mallet/util/resources/logging.properties"]

Move match expansion to separate component

Develop a new component for the expansion of matches of chemicals

Evaluate OSCAR on CRAFT

Brics error when initializing ChemSpot

When calling ChemSpot from within a different Java project, the dictionary fails to load. Perhaps this problem is related to #22?

Failed initializing ChemSpot.
org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class "de.berlin.hu.uima.ae.tagger.brics.BricsTagger" failed. (Descriptor: jar:file:/media/Data/workspaces/wbi/prototype/lib/chemspot.jar!/desc/ae/tagger/BricsTaggerAE.xml)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:254)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:158)
at org.uimafit.factory.AnalysisEngineFactory.createPrimitive(AnalysisEngineFactory.java:403)
at de.berlin.hu.chemspot.ChemSpot.(ChemSpot.java:118)
at ChemSpotRunner.main(ChemSpotRunner.java:10)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: org.apache.uima.resource.ResourceInitializationException
at de.berlin.hu.uima.ae.tagger.brics.BricsTagger.initialize(BricsTagger.java:59)
at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:252)
... 9 more
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:214)
at java.util.zip.ZipFile.(ZipFile.java:144)
at java.util.zip.ZipFile.(ZipFile.java:115)
at de.berlin.hu.uima.ae.tagger.brics.BricsMatcher.(BricsMatcher.java:42)
at de.berlin.hu.uima.ae.tagger.brics.BricsTagger.initialize(BricsTagger.java:55)
... 10 more

Upgrade to LINNAEUS 2.0

Generic drug tagger constructor

DrugTagger(String, String, String)

Add constructor for Drug Tagger with more generic input, such as an InputStream.

Evaluate normalization on CRAFT

Evaluate the normalization on all ChEBIs in the CRAFT corpus.

Maven deployment

ChemSpot can be installed via Maven, but it would also be nice to automatically create a runnable jar, copy all required files and optionally tar/compress them

Loading resources from within the jar fails

It seems that some resources are not properly loaded from within the jar, i.e.,
resources/chebi/chebi_ontology_fulldepth.txt
resources/prefixes.txt
resources/phare.txt
resources/suffixes-filtered.txt

Use this.getClass().getClassLoader().getResource(PATH TO FILE) to access them.

change scala version to 2.11.8

ChemSpot can be wonderful if we could use it with scala 2.11.8 in our NLP pipeline. However, we have encountered a problem that class file for scala.ScalaObject not found.
The reason is why de.berlin.hu.enumed.EntityTagger is using scala.ScalaObject and we could not change compiled code.
Is there any chance to get the source code of below maven dependency?

eumed
eumed-rg
1.0.0

If we can get the source code, we could update your ChemSpot code and can use it with state-of-art libraries.

Thanks.

PrintStream oldErr = System.err;
PrintStream newErr = new PrintStream(new ByteArrayOutputStream());
System.setErr(newErr);

// do your work

System.setErr(oldErr);