smartdataanalytics / horus-ner Goto Github PK
View Code? Open in Web Editor NEWHORUS: A framework to boost NLP tasks
License: Apache License 2.0
HORUS: A framework to boost NLP tasks
License: Apache License 2.0
I executed: conda env create -f environment.yml
And I get:
Traceback (most recent call last):
File "/home/rspeck/anaconda2/lib/python2.7/site-packages/conda/exceptions.py", line 626, in conda_exception_handler
return_value = func(*args, **kwargs)
File "/home/rspeck/anaconda2/lib/python2.7/site-packages/conda_env/cli/main_create.py", line 78, in execute
directory=os.getcwd())
File "/home/rspeck/anaconda2/lib/python2.7/site-packages/conda_env/specs/__init__.py", line 23, in detect
raise SpecNotFound(build_message(specs))
SpecNotFound: Runtime error: Can't process without a name
Conda Env Exception: environment.yml file not found
There is no requirements.txt
Some hints.
@diegoesteves Creation of environment failed, as it gives the ResolvePackageNotFound: again and now with the updated ones it is still prompting for some packages to be installed.
Details of the system:
OS: Windows 10: Education 64-bit
Using opencv package from conda's repo...
horus/core/search_engines.py
and horus/core/service.py
files.
precision, recall, f-measure [system]
Model features, images, annotations, etc..
sentence: coca/NOUN/NN/LOC cola/NOUN/NN/LOC has/VERB/VBZ/0 a/DET/DT/0 strange/ADJ/JJ/0 flavor/NOUN/NN/LOC
distance_theta:
should not annotate flavor
as LOC
, once dtheta is = 1
and flavor
got 2,2,0
, thus dtheta = 0
!
coca coca
should be updated, once the compound returned correctly 2,3,0
with high bias to not(LOC) => -40
!!!
These errors are propagating and directly impacting the performance measures. This sentence, for instance, should have 100% accuracy
and is getting 0
๐
Each POS tag model may have a different tag set, which makes the encoder fails if an unseen POS tag is used as feature, which is obvious. Technically the solution is simple (vector containing all possibilities), but can lead to a worse predictor, once you increase (unnecessarily) the number of dimensions without adding extra value. Have to check that carefully later.
Integrate GENESIS
http://genesis.aksw.org/
I tried to clone the repository and I have some problems getting it up and running.
Initially, I had the horus_dist.ini
to ~/horus.ini
copied and set the parameters.
Next, I tried to install the module with:
python setup.py install --record files.txt
And it failed with the following error message:
error: package directory 'src/horus/sift' does not exist
The next thing I tried was to run the source code directly without having the whole module installed. I had two issues:
The horus.ini
file that I copied from the src/horus/resource/horus_dist.init
has some missing keys. I just added added an empty key for those keys but I'm not sure if that's going to resolve all the issues or not.
I did not had the database in advance. I tried to create the database with ./src/horus/components/util/script_db.py
. The db model in this file is obsolete. I tried to run ./src/horus/components/webservice/rest.py
and the model needed a some columns that did not exist in my database.
update this file (experiments/text_classification/topic_modelling.py
). Current accuracy is 0.98% on data/dataset/Wikipedia/wiki_3classes2.csv
for PER, LOC and ORG (from dbpedia).
Basically, transform that into a python notebook and perform more experiments in order to detect more topics. Save the models here: horus/resources/models/horus-text/**topic_modeling**/
hint: https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html
Refactor Microsoft Bing Search Engine
Compare standard feature function x dynamic feature buider. There is still a 0.1
dif in F1
. Check why! Notebook src/training/notebooks/horus_v1/03-horus-training-ner-crf.ipynb
It's too slow, check that
Function cache_results is not searching by PROPN tags .
Some basic sequences like {[Tom Jobim] was born in [Rio de Janeiro]} had just partial annotations e.g. {[Tom Jobim] was born in Rio de Janeiro}.
error on saving some images leads to *.None extension
transform HORUS format to CONLL format as output
What's the reasonable threshold we can get when applying PCA/SVD techniques in order to improve the processing time of HORUS (CV mod) ? Getting a reasonable approximation of the current data without having to store everything might be a valid workaround...
conda is not interoperating correctly from OSX to LINUX.
horus/core/search_engines.py
and horus/core/service.py
files.
media_mod1 0.58243437863
media_mod2 0.620822320117 --> RandomForest / 20 estimators
media_mod3 0.520944537582
media_mod4 0.547179384203
media_mod5 0.646284496886 --> ensemble.Voting (1,2,3)
media_mod6 0.527850685331
media_mod7 0.462354638825
Integrate as many datasets as possible
e.g.: MIRFLICKR
then use word2vec + stemmer + wordnet + possible further similarity functions to obtain related terms and minimize the sparsity of the main dataset
The old one (ritter_ner.tsv) has some missed annotations.
http://webdatacommons.org/structureddata/2017-12/files/html-rdfa.list
horus/core/search_engines.py
and horus/core/service.py
files.
New Experimental Idea:
If the the hypothesis 3 is correct, then it will leverage the overall classification.
similar to SIFT-based ones, as showed here: https://arxiv.org/abs/1710.11027
Try HORUS on Chinese data
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.