Coder Social home page Coder Social logo

kensuke-mitsuzawa / documentfeatureselection Goto Github PK

View Code? Open in Web Editor NEW
45.0 6.0 12.0 546 KB

A set of metrics for feature selection from text data

License: Other

Python 99.16% Dockerfile 0.84%
nlp feature-selection feature-extraction python-3 pmi tf-idf bns soa docker webapp

documentfeatureselection's People

Contributors

kensuke-mitsuzawa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

documentfeatureselection's Issues

SOAの計算に必要な情報

  • ラベルeが保持している文数(list)
  • ラベルとindexの関係(dict)
  • ラベルeに単語wが出現した回数(term-frequency matrix. Scipy.csr)

Assertion Error

Copied and pasted the example code to and got an Assertion Error as per the end of this post,
I'm running this code on Windows 10 and Python 3.6 with Anaconda.

I printed out the scored_matrix and found that some of the values are of inf and therefore caused the assertion error seeing that the assertion error checks the value for integer of float.

But I don't know why it will generates infinity values...

Thank you for making this package!

(0, 0) inf
(0, 1) inf
(0, 2) 1.309320092201233
(0, 3) inf
(0, 4) 0.7615997791290283
(0, 5) 0.27188003063201904
(0, 6) inf
(1, 0) inf
(1, 1) inf
(1, 2) 1.309320092201233
(1, 3) inf
(1, 4) 0.7615997791290283
(1, 5) 0.27188003063201904
(1, 6) inf

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 276, in convert_score_matrix2score_record
frequency_matrix=self.frequency_matrix

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 367, in get_feature_dictionary
weight_value_index_items = self.make_non_zero_information(weighted_matrix)

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 317, in make_non_zero_information
self.__get_value_index(row_indexes[i], column_indexes[i], weight_csr_matrix))

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 292, in __get_value_index
assert isinstance(row_index, (int, int32, int64))

AssertionError

Impossible to put large-dictionary input as data source

Problem

The interface is expecting dict object, however dict object tend to be super huge, and therefore there is no space left on memory.

Solution-1

Merit

Use corpus object of textacy.
This object can generate document-term matrix easily.

Negative point

It's not cleared this repository.

Solution-2

use shelve.

good point

python standard package

BNS Clarification needed: DocFreq or TermFreq?

I have been looking through the examples in the repo of how BNS is computed and came across the following. As can be seen from the naming of the methods, in the 1st case it is Term Frequency meanwhile in the 2nd case it is Document Frequency. Could you please clarify which is correct?

  1. in examples/basic_example.py that uses DocumentFeatureSelection/interface.py, line 128:

    matrix_data_object = data_converter.DataConverter().convert_multi_docs2term_frequency_matrix(...)

  2. in tests/test_bns_python3.py line 105:

    data_csr_matrix = data_converter.DataConverter().labeledMultiDocs2DocFreqMatrix(...)

Thank you in advance,

Running library leaves lots of temporary directories

I have been learning how to use this library (have certain interest in Binormal Separation) and when running tests with the command:

pyhton3 setup.py test

I noticed that there are lots of tmp directories created that are not deleted afterwards. Would it be possible to implement automatic cleanup in the library itself?

I am using Python 3.4.2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.