The horus-ner from smartdataanalytics

Installation: conda env

I executed: conda env create -f environment.yml

And I get:

Traceback (most recent call last):
  File "/home/rspeck/anaconda2/lib/python2.7/site-packages/conda/exceptions.py", line 626, in conda_exception_handler
    return_value = func(*args, **kwargs)
  File "/home/rspeck/anaconda2/lib/python2.7/site-packages/conda_env/cli/main_create.py", line 78, in execute
    directory=os.getcwd())
  File "/home/rspeck/anaconda2/lib/python2.7/site-packages/conda_env/specs/__init__.py", line 23, in detect
    raise SpecNotFound(build_message(specs))
SpecNotFound: Runtime error: Can't process without a name
Conda Env Exception: environment.yml file not found
There is no requirements.txt

Some hints.

(Horus: envronment.yml) Solving environment failed error

@diegoesteves Creation of environment failed, as it gives the ResolvePackageNotFound: again and now with the updated ones it is still prompting for some packages to be installed.
Details of the system:
OS: Windows 10: Education 64-bit

Parallel Processing

At some point we should parallelise horus. Some libs at https://wiki.python.org/moin/ParallelProcessing

System Parametrization

OpenCV Error: Assertion failed (The data should normally be NULL!) in allocate...

Using opencv package from conda's repo...

Index Flickr

horus/core/search_engines.py and horus/core/service.py files.

State-of-the-art (Baseline @Ritter)

http://curtis.ml.cmu.edu/w/courses/index.php/Ritter_et_al,_EMNLP_2011._Named_Entity_Recognition_in_Tweets:_An_Experimental_Study

precision, recall, f-measure [system]

0.57 0.42 0.49 [COTRAIN-NER (PLO)]
0.73 0.49 0.59 [T-NER(PLO)]
0.30 0.27 0.29 [Stanford NER (PLO)]

Text Module Problem

text module is not working (at search engine level)

Stanford NER

check preliminar results using Stanford NER (to integrate in horus_matrix after merge twitter NLP)

Webservice for GERBIL Integration

create a simple webservice to be used as an integration point to GERBIL calls.

Architecture

architecture does not support package installation, check that for portability and horus as a service

HORUS Conceptual Architecture

Model features, images, annotations, etc..

Sequence Labeler

integrate a sequence labeling system

CONLL-2003 - NER dataset integration.

http://www.cnts.ua.ac.be/conll2003/ner/

d_theta and y' (correction): final annotation function()

sentence: coca/NOUN/NN/LOC cola/NOUN/NN/LOC has/VERB/VBZ/0 a/DET/DT/0 strange/ADJ/JJ/0 flavor/NOUN/NN/LOC

distance_theta: should not annotate flavor as LOC, once dtheta is = 1 and flavor got 2,2,0, thus dtheta = 0!
coca coca should be updated, once the compound returned correctly 2,3,0 with high bias to not(LOC) => -40 !!!

These errors are propagating and directly impacting the performance measures. This sentence, for instance, should have 100% accuracy and is getting 0 👎

POS encoding

Each POS tag model may have a different tag set, which makes the encoder fails if an unseen POS tag is used as feature, which is obvious. Technically the solution is simple (vector containing all possibilities), but can lead to a worse predictor, once you increase (unnecessarily) the number of dimensions without adding extra value. Have to check that carefully later.

HORUS UI/REST API

integrate microsoft graph

https://concept.research.microsoft.com/api/Concept/ScoreByProb?instance={value}&topK={value}

GERBIL Integration

integrate GERBIL's platform on HORUS

Search Engine (Genesis)

Integrate GENESIS
http://genesis.aksw.org/

Issues on a fresh clone

I tried to clone the repository and I have some problems getting it up and running.

Initially, I had the horus_dist.ini to ~/horus.ini copied and set the parameters.

Next, I tried to install the module with:
python setup.py install --record files.txt
And it failed with the following error message:
error: package directory 'src/horus/sift' does not exist

The next thing I tried was to run the source code directly without having the whole module installed. I had two issues:

The horus.ini file that I copied from the src/horus/resource/horus_dist.init has some missing keys. I just added added an empty key for those keys but I'm not sure if that's going to resolve all the issues or not.
I did not had the database in advance. I tried to create the database with ./src/horus/components/util/script_db.py. The db model in this file is obsolete. I tried to run ./src/horus/components/webservice/rest.py and the model needed a some columns that did not exist in my database.

Feature Extraction - Finish TX and CV feature builders and test it!

Test topic modeling

update this file (experiments/text_classification/topic_modelling.py). Current accuracy is 0.98% on data/dataset/Wikipedia/wiki_3classes2.csv for PER, LOC and ORG (from dbpedia).
Basically, transform that into a python notebook and perform more experiments in order to detect more topics. Save the models here: horus/resources/models/horus-text/**topic_modeling**/

hint: https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

NIF Integration

https://github.com/NLP2RDF/NIF-lib

Cognitive Services

Refactor Microsoft Bing Search Engine

CRF layer performance

Compare standard feature function x dynamic feature buider. There is still a 0.1 dif in F1. Check why! Notebook src/training/notebooks/horus_v1/03-horus-training-ner-crf.ipynb

Detecting Object Module

It's too slow, check that

Cache_results problem

Function cache_results is not searching by PROPN tags .

Cache

At some point we could also caching the visual features (extractors) in order to speed up the pipeline

Noun Phrase Parsing

Some basic sequences like {[Tom Jobim] was born in [Rio de Janeiro]} had just partial annotations e.g. {[Tom Jobim] was born in Rio de Janeiro}.

Integrate Twitter NLP (POS Tagger)

http://www.cs.cmu.edu/~ark/TweetNLP/

python wrappers at

*.none

error on saving some images leads to *.None extension

W-NUT 2015 - NER dataset integration.

http://noisy-text.github.io/2015/index.html

Pre-processing (HORUS -> CONLL format)

transform HORUS format to CONLL format as output

Dimensionality Reduction

What's the reasonable threshold we can get when applying PCA/SVD techniques in order to improve the processing time of HORUS (CV mod) ? Getting a reasonable approximation of the current data without having to store everything might be a valid workaround...

Environment file

conda is not interoperating correctly from OSX to LINUX.

Index Wikipedia

horus/core/search_engines.py and horus/core/service.py files.

Last Features Checking

media_mod1 0.58243437863
media_mod2 0.620822320117 --> RandomForest / 20 estimators
media_mod3 0.520944537582
media_mod4 0.547179384203
media_mod5 0.646284496886 --> ensemble.Voting (1,2,3)
media_mod6 0.527850685331
media_mod7 0.462354638825

HORUS_MODEL_DT_EXP_001 - POS_Stanford_Sequence_DecisionTree

to include n-grams (POS Stanford) as features
sampling: 5-folds cross-validation (80/20)
dataset: ritter

baseline experiment (no extra features)
ritter dataset (test)
no tuning parameters, max always wins (greedy strategy)
stanford POS tagger (no optimised solution)

HORUS_MODEL_EXP_003_POS_Tagger

New Experimental Idea:

Try to minimize the error propagation (POS Tagger) by considering 2 (or more) sequence of POS annotations.
Obtain a new set of NOUNS Ns

If the the hypothesis 3 is correct, then it will leverage the overall classification.

Train CNN models

similar to SIFT-based ones, as showed here: https://arxiv.org/abs/1710.11027

PER
ORG
LOC (highway, building, etc..)

smartdataanalytics / horus-ner Goto Github PK

horus-ner's People

Contributors

Stargazers

Watchers

Forkers

horus-ner's Issues

Recommend Projects

Recommend Topics

Recommend Org