kensuke-mitsuzawa / documentfeatureselection Goto Github PK

View Code? Open in Web Editor NEW

45.0 6.0 12.0 546 KB

A set of metrics for feature selection from text data

License: Other

Python 99.16% Dockerfile 0.84%

nlp feature-selection feature-extraction python-3 pmi tf-idf bns soa docker webapp

documentfeatureselection's People

Contributors

Stargazers

Watchers

Forkers

sandy4321 python3pkg fahad92virgo undarmaa guoyu07 o0windseed0o avsolatorio aaasss0636 kyran0255 lapulasitu chandupentela72 gkud

documentfeatureselection's Issues

incorrrect BNS function in code

This code is incorrect for BNS. abs(norm.ppf(tpr) - norm.ppf(fpr))
And you must filter away words that occur < 3 times (tp+fp counts).
If you only want positively associated words, drop the abs().

DocumentFeatureSelection/DocumentFeatureSelection/bns/bns_python3.py

Line 65 in b038d26

bns_score = np.abs(norm.ppf(norm.cdf(tpr)) - norm.ppf(norm.cdf(fpr)))

sk-learn style interface

from sklearn.base import BaseEstimator and make implementation

Importing library leaves behind tmp directories

This is a followup to the issue Running library leaves lots of temporary directories

Turns out tmp directories are created as soon as the library is imported and they are not deleted afterwards. Steps to reproduce:

start python3 in the terminal
type the command
from DocumentFeatureSelection import interface
Observe 5 new empty directories created in tmp/
Press Ctrl-D to exit python session
Observe tmp directories are still there

Regards,

Taking a lot of time on processing big data

Plans to make faster

use feature extraction in scikit learn

http://scikit-learn.org/stable/modules/feature_extraction.html

delete/rename inproper method/module name

slow speed on constructing frequency matrix

consider to use gensim

SOAの計算に必要な情報

ラベルeが保持している文数(list)
ラベルとindexの関係(dict)
ラベルeに単語wが出現した回数(term-frequency matrix. Scipy.csr)

Replace dictionary into numpy array

http://stackoverflow.com/questions/18041848/efficient-way-to-hold-and-process-a-big-dict-in-memory-in-python

Consider to take into Cython in bottleneck part

This part is taking a lot of time because of O^2 computation cost
Computing SOA
Computing PMI

Assertion Error

Copied and pasted the example code to and got an Assertion Error as per the end of this post,
I'm running this code on Windows 10 and Python 3.6 with Anaconda.

I printed out the scored_matrix and found that some of the values are of inf and therefore caused the assertion error seeing that the assertion error checks the value for integer of float.

But I don't know why it will generates infinity values...

Thank you for making this package!

(0, 0) inf
(0, 1) inf
(0, 2) 1.309320092201233
(0, 3) inf
(0, 4) 0.7615997791290283
(0, 5) 0.27188003063201904
(0, 6) inf
(1, 0) inf
(1, 1) inf
(1, 2) 1.309320092201233
(1, 3) inf
(1, 4) 0.7615997791290283
(1, 5) 0.27188003063201904
(1, 6) inf

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 276, in convert_score_matrix2score_record
frequency_matrix=self.frequency_matrix

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 367, in get_feature_dictionary
weight_value_index_items = self.make_non_zero_information(weighted_matrix)

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 317, in make_non_zero_information
self.__get_value_index(row_indexes[i], column_indexes[i], weight_csr_matrix))

File "C:\Users\Ryan\Anaconda3\lib\site-packages\DocumentFeatureSelection\models.py", line 292, in __get_value_index
assert isinstance(row_index, (int, int32, int64))

AssertionError

TF-IDF score is always same

This is because this line should be calculated by each doc

https://github.com/Kensuke-Mitsuzawa/DocumentFeatureSelection/blob/master/DocumentFeatureSelection/common/labeledMultiDocs2labeledDocsSet.py#L51

Impossible to put large-dictionary input as data source

Problem

The interface is expecting dict object, however dict object tend to be super huge, and therefore there is no space left on memory.

Solution-1

Merit

Use corpus object of textacy.
This object can generate document-term matrix easily.

Negative point

It's not cleared this repository.

Solution-2

use shelve.

good point

python standard package

BNS Clarification needed: DocFreq or TermFreq?

I have been looking through the examples in the repo of how BNS is computed and came across the following. As can be seen from the naming of the methods, in the 1st case it is Term Frequency meanwhile in the 2nd case it is Document Frequency. Could you please clarify which is correct?

in examples/basic_example.py that uses DocumentFeatureSelection/interface.py, line 128:

matrix_data_object = data_converter.DataConverter().convert_multi_docs2term_frequency_matrix(...)
in tests/test_bns_python3.py line 105:

data_csr_matrix = data_converter.DataConverter().labeledMultiDocs2DocFreqMatrix(...)

Thank you in advance,

installing error because of lack of cython

this line causes error.

This is because from distutils.extension import Extension is not imported at this line.

normalized PMI

ダイレクトに特徴量空間に変換する関数をつくる

scikit-learnのようにfit_transform()という名前で関数を作成する。

各indexごとに値を求めて、matrixを返すだけ

Consuming a lot of memory to save objects during computation

Problem

In computing PMI on size(input_matrix)=20 * 10972064 , a code consumes around 20G.

Way to decrease memory size

for matrix objects
- numpy.memmap
- pytables
for saving other objects
- use persistent dict class for keeping it

convert into JSON string inside package when input object is list/tuple

web api with docker

Purpose

It makes this tool available also from any languages.

in-correct computation of TF-IDF when input list has various type(length of N-gram)

{'category-name': [ unigram-object, unigram-object, bigram-object ]}

This is an incorrect situation in the computation of TFIDF.
That should be different computation.

{'category-name': [ unigram-object, unigram-object ]}

{'category-name': [ bigram-object ]}

異常なまでにmatrixの構築に時間がかかる

マルチプロセス化を検討のこと

https://github.com/Kensuke-Mitsuzawa/document_feature_selection/blob/master/document_feature_selection/pmi/pmi_csr_matrix.py#L181

Running library leaves lots of temporary directories

I have been learning how to use this library (have certain interest in Binormal Separation) and when running tests with the command:

pyhton3 setup.py test

I noticed that there are lots of tmp directories created that are not deleted afterwards. Would it be possible to implement automatic cleanup in the library itself?

I am using Python 3.4.2

kensuke-mitsuzawa / documentfeatureselection Goto Github PK

documentfeatureselection's People

Contributors

Stargazers

Watchers

Forkers

documentfeatureselection's Issues

Problem

Solution-1

Merit

Negative point

Solution-2

good point

Problem

Way to decrease memory size

Purpose

Recommend Projects

Recommend Topics

Recommend Org