scikit-multilearn / scikit-multilearn Goto Github PK

A scikit-learn based module for multi-label et. al. classification

License: BSD 2-Clause "Simplified" License

Python 96.50% Jupyter Notebook 3.50%

classification scikit scikit-learn multi-label scikit-multilearn machine-learning clustering partitioning label-prediction

scikit-multilearn's Introduction

scikit-multilearn

scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Python packages (numpy, scipy) and follows a similar API to that of scikit-learn.

Website: scikit.ml
Documentation: scikit-multilearn Documentation

Features

Native Python implementation. A native Python implementation for a variety of multi-label classification algorithms. To see the list of all supported classifiers, check this link.
Interface to Meka. A Meka wrapper class is implemented for reference purposes and integration. This provides access to all methods available in MEKA, MULAN, and WEKA — the reference standard in the field.
Builds upon giants! Team-up with the power of numpy and scikit. You can use scikit-learn's base classifiers as scikit-multilearn's classifiers. In addition, the two packages follow a similar API.

Installation & Dependencies

To install scikit-multilearn, simply type the following command:

$ pip install scikit-multilearn

This will install the latest release from the Python package index. If you wish to install the bleeding-edge version, then clone this repository and run setup.py:

$ git clone https://github.com/scikit-multilearn/scikit-multilearn.git
$ cd scikit-multilearn
$ python setup.py

In most cases requirements are installed when you install using pip install scikit-multilearn or run python setup.py install. There are also optional dependencies pip install scikit-multilearn[gpl,keras,meka] installs the GPL-incurring igraph for for igraph library based clusterers, keras for the keras classifiers and requirements for the meka bridge respectively.

To install openNE, run:

pip install 'openne @ git+https://github.com/thunlp/OpenNE.git@master#subdirectory=src'

Note that installing the GPL licensed graphtool, for graphtool based clusters, is complicated, and must be done manually, please see: graphtool install instructions

Basic Usage

Before proceeding to classification, this library assumes that you have a dataset with the following matrices:

x_train, x_test: training and test feature matrices of size (n_samples, n_features)
y_train, y_test: training and test label matrices of size (n_samples, n_labels)

Suppose we wanted to use a problem-transformation method called Binary Relevance, which treats each label as a separate single-label classification problem, to a Support-vector machine (SVM) classifier, we simply perform the following tasks:

# Import BinaryRelevance from skmultilearn
from skmultilearn.problem_transform import BinaryRelevance

# Import SVC classifier from sklearn
from sklearn.svm import SVC

# Setup the classifier
classifier = BinaryRelevance(classifier=SVC(), require_dense=[False,True])

# Train
classifier.fit(X_train, y_train)

# Predict
y_pred = classifier.predict(X_test)

More examples and use-cases can be seen in the documentation. For using the MEKA wrapper, check this link.

Contributing

This project is open for contributions. Here are some of the ways for you to contribute:

Bug reports/fix
Features requests
Use-case demonstrations
Documentation updates

In case you want to implement your own multi-label classifier, please read our Developer's Guide to help you integrate your implementation in our API.

To make a contribution, just fork this repository, push the changes in your fork, open up an issue, and make a Pull Request!

We're also available in Slack! Just go to our slack group.

Cite

If you used scikit-multilearn in your research or project, please cite our work:

@ARTICLE{2017arXiv170201460S,
   author = {{Szyma{\'n}ski}, P. and {Kajdanowicz}, T.},
   title = "{A scikit-based Python environment for performing multi-label classification}",
   journal = {ArXiv e-prints},
   archivePrefix = "arXiv",
   eprint = {1702.01460},
   year = 2017,
   month = feb
}

scikit-multilearn's People

Contributors

Stargazers

Watchers

Forkers

pawelpamula mapleyustat subramanyata joswinkj ernayana crazyfu christiansch riomus elzbietaz fbenites fbkarsdorp vmiliann brodin grzegorz700 ltoscano imwise prasants yuzo63 szarakawka suraj-deshmukh joomik chagge ameet-1997 jpzhangvincent ewanlee rtygbwwwerr xuf12 purevoyage dterg jychen129 natsudalkr simon-m freeyawork sharmarupali ferplascencia hua-ming apoorv97 zhenqiangsun liqugit raduemanuel92 robbymeals vonzunlei ekyy2 physcoder xuliwu heiaun tolivier85 chrysm artreven xsilverbullet hyzcn p768lwy3 xweipbcsf fmaguire syzxlyx viveksck tcdex zhongkailv mahartri mejihero queirozfcom sanch7 maxwelllzh kznovo cg110778 samiratzn hsj307 yishuihanhan vpegasus captainmihir bharatr21 samehraban drbugkiller riviera2015 christopher154 joshuae1 delaiahz terrychen17 yjx0003 zhongyunuestc sanjaysanthanagopal amirstudy chanqi4444 pokidyshev grid-gudx resendevinicius sailfish009 cdyangbo azmainamin acadtags zumbalamambo sara-zhu nth221 choijoonyoung nikogithubtest johncliu yanrul rafaelri frankier tang16

scikit-multilearn's Issues

Implement hierarchical MLG

MLknn

Sorry, I do not have the time to make a pull but there is an "error" and I would like a function:
in mlknn compute_cond, mlknn is counting also the first entry of kneighbors, which is the instance itself, I would only count the neighborhood!
for instance in xrange(self.num_instances):
neighbors = self.knn.kneighbors(X[instance], self.k, return_distance=False)[0,1:]
for label in xrange(self.num_labels):
delta = sum(y[neighbor][label] for neighbor in neighbors)

also the rankings would be nice:

def predict_rankings(self, X):
result = np.zeros((len(X), self.num_labels), dtype='i8')
ranks = np.zeros((len(X), self.num_labels))
for instance in xrange(len(X)):
neighbors = self.knn.kneighbors(X[instance], self.k, return_distance=False)
for label in xrange(self.num_labels):
delta = sum(self.predictions[neighbor][label] for neighbor in neighbors[0])
p_true = self.prior_prob_true[label] * self.cond_prob_true[label][delta]
p_false = self.prior_prob_false[label] * self.cond_prob_false[label][delta]
prediction = (p_true >= p_false)
ranks[instance][label] = p_true/(p_true+p_false)
result[instance][label] = int(prediction)
return ranks,result

Cheers

fix a problem with predict_proba in BR

crashes when testing on some data sets, ex. genbase

Implement ML-c4.5

A. Clare, R.D. King, Knowledge discovery in multi-label phenotype data, in: Proceedings of the 5th European Conference on PKDD, 2001, pp. 42–53.

Multi-Label C4.5 (ML-C4.5 ) [11] is an adaptation of the well known C4.5 algorithm for multi-label learning by allowing multiple labels in the leaves of the tree. Clare et al. [11] modified the formula for calculating entropy (see Eq. (1)) for solving multi-label problems. The modified entropy sums the entropies for each individual class label. The key property of ML-C4.5 is its computational efficiency:

entropy(E)=−∑Ni=1(p(ci)logp(ci)+q(ci)logq(ci))

where E is the set of examples, p(ci) is the relative frequency of class label c i and q(ci)=1−p(ci).

Possible Indentation Error in the "fixed.py"

It seems I can't load the LabelSpacePartitioningClassifier. And it shows the Indention Error in the "fixed.py" file.

predict_proba for problem transformation

The problem transformation base classifiers are missing the predict_proba method. Is there any reason why that is? I'd propose to check the used classifier(s) if it/they have it, if so, use it, otherwise throw "not implemented". What's your opinion?

Loading Arff Data not working

The first line from the official docs pertaining to loading datasets,
from skmultilearn.dataset import Dataset,
shows the error" ImportError: cannot import name 'Dataset' "

Implement BRkNN

Implement the simple Binary Relevance kNN (both version) as per: http://link.springer.com/chapter/10.1007%2F978-3-540-87881-0_40

Preferably subclass the meta.BR classifier.

Implement Calibrated Label Ranking

profile classifiers to verify best sparsity approach to X

get_params and set_params for MLClassifierBase erroneous

Hi,

While writing the test case for cross validation I noticed that I missed something in my implementation of said functions, which gives me time to discuss some shortcomings of the code.

In the scikit-learn implementation, there is quite a lot of code to find out the attributes of each object for the get_param function, namely _get_param_names and signature. I think for simple cases like the meta estimators we have here this would be just overhead. Thus I'd advocate for a list of attributes to be copied, defined by the very estimators. Or at least the MLClassifierBase could define the common ones, like copyable_attrs = ["require_dense", "classifier"], and if needed, the heiring classifiers would rewrite this attribute with some of their own attributes, or add them or whatever.

In get_params we can then run over the keys and copy them, if deep-copyable and deep=True copy them too. Also we can check for attributes in set_params, providing another safety measure to prevent writing attribues to the object that weren't set in the first place. (This could result in some nasty bugs I think.)

What is your take on this?

PS: I'll commit the bug fixes. Additionally I've removed the get_params/set_params from the label powerset meta estimator. They were wrong anyways (missing self as first attribute).

Implement HOMER

Implement infrastructure for hierarchical classifiers and HOMER

return sparse matrices on predict* in mlaram

IOError: [Errno 2] No such file or directory: 'README.md'

hello:scikitlearn mukeshtiwari$ sudo pip install scikit-multilearn
Password:
The directory '/Users/mukeshtiwari/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/mukeshtiwari/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting scikit-multilearn
/Library/Python/2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading scikit-multilearn-0.0.1.tar.gz (32.5MB)
    100% |████████████████████████████████| 32.5MB 11kB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/tmp/pip-build-BguQD2/scikit-multilearn/setup.py", line 11, in <module>
        long_description=open('README.md').read(),
    IOError: [Errno 2] No such file or directory: 'README.md'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/tmp/pip-build-BguQD2/scikit-multilearn

Implement MLkNN

Implement MLkNN as per http://www.sciencedirect.com/science/article/pii/S0031320307000027

Cannot import BinaryRelevance

The following code, based on this section of the docs, does not run and the error is not clear (to me!):

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB

This code returns the following ImportError:

ImportError                               Traceback (most recent call last)
<ipython-input-29-6a0165cb8005> in <module>()
----> 1 from skmultilearn.problem_transform import BinaryRelevance
      2 from sklearn.naive_bayes import GaussianNB

D:\XXX\Anaconda2\lib\site-packages\skmultilearn\problem_transform\__init__.py in <module>()
      9 """
     10 
---> 11 from .br import BinaryRelevance
     12 from .cc import ClassifierChain
     13 from .lp import LabelPowerset

D:\XXX\Anaconda2\lib\site-packages\skmultilearn\problem_transform\br.py in <module>()
----> 1 from builtins import range
      2 from ..base.problem_transformation import ProblemTransformationBase
      3 from scipy.sparse import hstack, coo_matrix
      4 from sklearn.utils import check_array
      5 import copy

ImportError: No module named builtins

Any ideas of what the problem is?

Holm's post-hoc test

http://sci2s.ugr.es/sicidm/pdf/2011-Derrac-SWEVO.pdf

type_of_target off for labelpowerset and binary relevance strategies

Hi!

While trying to get support for k-fold cross validation (cv) I noticed that the scikit-learn classifiers/strategies use the target classes/shapes as given by train. I.e. my target_true might look like this: [array([0, 1, 0, 1, 1, 0, 0, 0])] and thus the scikit-learn strategies/classifiers return an np-array of targets (in this case 1-D matrix). cv checks for the same target type when trying to score the classifiers.

The problem now is that classifiers of scikit-multilearn return a list of vectors, where sklearn.utils.multiclass.type_of_target returns multiclass-multioutput, but the sklearn classifiers, as well as my targets are multilabel-indicator. Funny thing however is that when you'd print both, they both look like this: [array([0, 1, 0, 1, 1, 0, 0, 0])].

Thus for support of cv in conjunction with scikit-multilearn we'd either need to alter the implementation of type_of_target (I might drop an issue on scikit-learn and see whether or not this is expected behaviour or not) or implement casting of predicted targets by transforming the output to np.array.

Any input on this? I'd do a PR on this if I can get a solution to this.

Parallel knn in MLKNN and some other questions

hello, I am using scikit-multilearn to do tests, and I just see
mlknn
And I'm wondering that if I can set n_jobs parameters in this line so that to speedup my experiments since the scikit-learn support this just as follows
NearestNeighbors

And besides, I am just using MLKNN like

knn = MLkNN(k=num)                                                                                                                                                                                  
scores = cross_val_score(                                                                                                                                                                           
            knn, data, target, cv=10, n_jobs=10, scoring=score_name)                                                                                                                                        
print scores.mean()

and the program just works without crash, can MLKNN be used like this?

Implement SECC

http://link.springer.com/chapter/10.1007/978-3-642-38067-9_13

Implement QWML

E.L. Mencía, S.-H. Park, J. Fürnkranz
Efficient voting prediction for pairwise multilabel classification
Neurocomputing, 73 (2010), pp. 1164–1176

NameError when using GridSearchCV on Problem Transformed Classifier

File: base/problem_transformation.py

I was just fiddling around with GridSearchCV and caused the problem transformer to receive invalid parameters in a later set_params by the grid search submodule. As name is not defined in base.py and problem_transformation.py it results in a name error:

NameError: global name 'name' is not defined

My guess is that name should've been actually sub_obj_name or parameter. I changed it and could indeed produce the error as expected:

ValueError: Invalid parameter kernel for estimator BinaryRelevance(classifier=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
        require_dense=[True, True]). Check the list of available parameters with `estimator.get_params().keys()`.

Shaffer's static/dynamic post hoc test

http://sci2s.ugr.es/sicidm/pdf/2011-Derrac-SWEVO.pdf

Implement PCTs

H. Blockeel, L.D. Raedt, J. Ramon, Top-down induction of clustering trees, in: Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 55–63.

fix rakelO error when crossvalidating

================================================================================================= FAILURES ==================================================================================================
______________________________________________________________________________ RakelOTest.test_if_works_with_cross_validation _______________________________________________________________________________

self = <skmultilearn.ensemble.tests.test_rakelo.RakelOTest testMethod=test_if_works_with_cross_validation>

    def test_if_works_with_cross_validation(self):
        classifier = RakelO(classifier = self.get_labelpowerset_with_nb(), model_count = 20, labelset_size = 5)

>       self.assertClassifierWorksWithCV(classifier)

skmultilearn/ensemble/tests/test_rakelo.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
skmultilearn/tests/classifier_basetest.py:22: in assertClassifierWorksWithCV
    scores = cross_validation.cross_val_score(classifier, X, y, cv=cv, scoring='f1_macro')
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/cross_validation.py:1433: in cross_val_score
    for train, test in cv)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:804: in __call__
    while self.dispatch_one_batch(iterator):
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:662: in dispatch_one_batch
    self._dispatch(tasks)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:570: in _dispatch
    job = ImmediateComputeBatch(batch)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:183: in __init__
    self.results = batch()
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py:72: in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/cross_validation.py:1550: in _fit_and_score
    test_score = _score(estimator, X_test, y_test, scorer)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/cross_validation.py:1606: in _score
    score = scorer(estimator, X_test, y_test)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/metrics/scorer.py:90: in __call__
    **self._kwargs)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:639: in f1_score
    sample_weight=sample_weight)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:756: in fbeta_score
    sample_weight=sample_weight)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:956: in precision_recall_fscore_support
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:74: in _check_targets
    type_pred = type_of_target(y_pred)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

y = array(<30x5 sparse matrix of type '<type 'numpy.int64'>'
    with 150 stored elements in Compressed Sparse Column format>, dtype=object)

    def type_of_target(y):
        """Determine the type of data indicated by target `y`

        Parameters
        ----------
        y : array-like

        Returns
        -------
        target_type : string
            One of:
            * 'continuous': `y` is an array-like of floats that are not all
              integers, and is 1d or a column vector.
            * 'continuous-multioutput': `y` is a 2d array of floats that are
              not all integers, and both dimensions are of size > 1.
            * 'binary': `y` contains <= 2 discrete values and is 1d or a column
              vector.
            * 'multiclass': `y` contains more than two discrete values, is not a
              sequence of sequences, and is 1d or a column vector.
            * 'multiclass-multioutput': `y` is a 2d array that contains more
              than two discrete values, is not a sequence of sequences, and both
              dimensions are of size > 1.
            * 'multilabel-indicator': `y` is a label indicator matrix, an array
              of two dimensions with at least two columns, and at most 2 unique
              values.
            * 'unknown': `y` is array-like but none of the above, such as a 3d
              array, sequence of sequences, or an array of non-sequence objects.

        Examples
        --------
        >>> import numpy as np
        >>> type_of_target([0.1, 0.6])
        'continuous'
        >>> type_of_target([1, -1, -1, 1])
        'binary'
        >>> type_of_target(['a', 'b', 'a'])
        'binary'
        >>> type_of_target([1.0, 2.0])
        'binary'
        >>> type_of_target([1, 0, 2])
        'multiclass'
        >>> type_of_target([1.0, 0.0, 3.0])
        'multiclass'
        >>> type_of_target(['a', 'b', 'c'])
        'multiclass'
        >>> type_of_target(np.array([[1, 2], [3, 1]]))
        'multiclass-multioutput'
        >>> type_of_target([[1, 2]])
        'multiclass-multioutput'
        >>> type_of_target(np.array([[1.5, 2.0], [3.0, 1.6]]))
        'continuous-multioutput'
        >>> type_of_target(np.array([[0, 1], [1, 1]]))
        'multilabel-indicator'
        """
        valid = ((isinstance(y, (Sequence, spmatrix)) or hasattr(y, '__array__'))
                 and not isinstance(y, string_types))

        if not valid:
            raise ValueError('Expected array-like (array or non-string sequence), '
                             'got %r' % y)

        if is_multilabel(y):
            return 'multilabel-indicator'

        try:
            y = np.asarray(y)
        except ValueError:
            # Known to fail in numpy 1.3 for array of arrays
            return 'unknown'

        # The old sequence of sequences format
        try:
            if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
                    and not isinstance(y[0], string_types)):
                raise ValueError('You appear to be using a legacy multi-label data'
                                 ' representation. Sequence of sequences are no'
                                 ' longer supported; use a binary array or sparse'
                                 ' matrix instead.')
        except IndexError:
            pass

        # Invalid inputs
>       if y.ndim > 2 or (y.dtype == object and len(y) and
                          not isinstance(y.flat[0], string_types)):
E                         TypeError: len() of unsized object

../../.virtualenvs/work/local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:259: TypeError

write tests for the cloning functionality in mlcb/ptb

Use Christian's tests from notebook

some question about BinaryRelevance using svm(one classes)

hello, I am using br to train multi-label,and my target is

[[ 1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.]
 [ 1.  1.  0.  0.  0.  1.]
 [ 0.  0.  0.  0.  1.  0.]
 [ 1.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 1.  1.  1.  0.  0.  0.]]

we can see that the fourth column is all zero, so the number of classes is one, and I get the following crash:

Traceback (most recent call last):
  File "multi_model.py", line 157, in <module>
    train_svc_manual()
  File "multi_model.py", line 98, in train_svc_manual
    clf.fit(X_train, y_train)
  File "/usr/local/lib/python2.7/dist-packages/skmultilearn/problem_transform/br.py", line 71, in fit
    X), self.ensure_output_format(y_subset))
  File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 152, in fit
    y = self._validate_targets(y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/svm/base.py", line 526, in _validate_targets
    % len(cls))
ValueError: The number of classes has to be greater than one; got 1

Should the br deal this problem?

Implement basic classifiers

Make sure we have:

binary relevance
label powerset
rakel's

Bergman-Hommel

http://sci2s.ugr.es/sicidm/pdf/2011-Derrac-SWEVO.pdf

Error in the Example of hyperparameter tuning

I run the test code on the documentation page http://scikit.ml/api/model_estimation.html#model-estimation about hyper-parameter tuning. I got an error from the case of ensemble classifier.
------------ Code ---------------
from skmultilearn.ensemble.rakeld import RakelD
from skmultilearn.problem_transform import BinaryRelevance, LabelPowerset
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

x, y = make_multilabel_classification(sparse=True, n_labels=5,
return_indicator='sparse', allow_unlabeled=False)

parameters = {
'labelset_size': range(2, 3),
'classifier': [LabelPowerset(), BinaryRelevance()],
'classifier__classifier': [MultinomialNB()],
'classifier__classifier__alpha': [0.7, 1.0],
}

clf = GridSearchCV(RakelD(), parameters, scoring='f1_macro')
clf.fit(x, y)
-------------Error -------------------------------
ValueError: Found input variables with inconsistent numbers of samples: [66, 1]

Friedman Aligned Ranks

http://sci2s.ugr.es/sicidm/pdf/2011-Derrac-SWEVO.pdf

MemoryError when running CrossValTesting

Hi Piotr. I was trying to run one of the tests you include in the project (CrossValTesting).

You can view it on this link

I even bought more RAM for my PC because I thought I had too little but when I run this only about 3.5GB of m RAM is used, and I have 8GB total RAM so I don't think it's because of that.

Did this happen to you as well? Do you know what may be causing it? Thanks!

MLClassifierBase - > ProblemTransformationBase

Current version of MLClassifierBase is modeled for Problem Transfomation approaches while there's no need for require_dense in ensembles or algorithm adaptation approaches. Split MLCB into MLCB and PTB.

scikit-learn BaseEstimator compliance

Hey there,

Background: methods like grid search and k-fold cross validation need an estimator to comply to the API stated here. The problem is that MLClassifierBase misses the get_params and set_params method that allows cloning of estimators as used by grid search and the cross_validation module (see here).

The solution would either be to inherit BaseEstimator or to implement these mentioned methods. I patched them on my distribution because I use cross validation and depend on it. If you're up to it I'd provide a PR and you could review it if interested.

Cheers!

Implement a general ensemble of classifiers classifier

J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification, in: Proceedings of the 20th European Conference on Machine Learning, 2009, pp. 254–269.
Ensembles of classifier chains (ECC ) [16] are an ensemble multi-label classification technique that uses classifier chains as a base classifier. ECC trains m CC classifiers C1,C2,…,Cm. Each C k is trained with a random chain ordering (of ℒ) and a random subset of 𝒳. Hence each Ck model is likely to be unique and able to give different multi-label predictions. These predictions are summed per label so that each label receives a number of votes. A threshold is used to select the most popular labels which form the final predicted multi-label set.

ImportError: No module named 'sphinx_pypi_upload' on installing from github

Hi. I've tried to install scikit-multilearn directly from github like this, and got this error:

(venv3)felipe@felipe-XPS-8300:~/auto-tagger$ pip install git+https://github.com/queirozfcom/scikit-multilearn.git
Collecting git+https://github.com/queirozfcom/scikit-multilearn.git
  Cloning https://github.com/queirozfcom/scikit-multilearn.git to /tmp/pip-agw71wfn-build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-agw71wfn-build/setup.py", line 3, in <module>
        import sphinx_pypi_upload
    ImportError: No module named 'sphinx_pypi_upload'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-agw71wfn-build

Any thoughts? By the way, I'm using Python 3. Should I use Python 2?

tests for meka

write tests for meka wrapper

Implement non-overlapping non-hierarchical MLG

Implement non-overlapping non-hierarchical MLG as per http://link.springer.com/chapter/10.1007%2F978-3-642-40846-5_43

Quade test

Cannot install whatsoever

Hey there,

Neither pip install $tarball, pip install scikit-multilearn or running python setup.py build && python setup.py install in the cloned repo (with the fix suggested in #21) does the magic. So now I'm moving the skmultilearn folder to my project. I'd say there is currently no way to install this project properly. Any suggestions/input?

Implement PCC

Python 2 vs Python 3

What are the plans regarding python 2 vs python 3? Stick with python 2 for the time being?

adapt/lazy not importable in pip-installed skmultilearn

I am not able to import anything from the submodules lazy (or nowadays adapt):

/u/l/l/p/s/skmultilearn >>>  (master ↩) ls
__init__.py       base.py           dataset.py        ext               utils.pyc
__init__.pyc      base.pyc          dataset.pyc       problem_transform
base              cluster           ensemble          utils.py

Meka example not working

I am running OSX and meka 1.9.0 an tried out the examples in the doc posted below

from sklearn.datasets import make_multilabel_classification
from sklearn.cross_validation import train_test_split
from sklearn.metrics import hamming_loss
from skmultilearn.ext import Meka


X, y = make_multilabel_classification(sparse = True,
    return_indicator = 'sparse')

X_train, X_test, y_train, y_test = train_test_split(X,
    y,
    test_size=0.33)

meka = Meka(
    meka_classifier = "meka.classifiers.multilabel.LC",
    weka_classifier = "weka.classifiers.bayes.NaiveBayes",
    meka_classpath = "/Users/user/Downloads/meka-release-1.9.0/lib/",
    java_command = '/usr/bin/java')

meka.fit(X_train, y_train)

predictions = meka.predict(X_test)

hamming_loss(y_test, predictions)

But rather then getting a response I got the following error:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-8-d38086ff8db1> in <module>()
     18     java_command = '/usr/bin/java')
     19 
---> 20 meka.fit(X_train, y_train)
     21 
     22 predictions = meka.predict(X_test)

/Users/user/anaconda/lib/python2.7/site-packages/skmultilearn/ext/meka.pyc in fit(self, X, y)
    122             ]
    123 
--> 124             self.run_meka_command(input_args)
    125 
    126             self.classifier_dump = None

/Users/user/anaconda/lib/python2.7/site-packages/skmultilearn/ext/meka.pyc in run_meka_command(self, args)
     96 
     97         if pipes.returncode != 0:
---> 98             raise Exception, self.output
     99 
    100     def fit(self, X, y):

Exception: 

Evaluation Options:

-h
    Output help information.
-t <name of training file>
    Sets training file.
-T <name of test file>
    Sets test file.
-x <number of folds>
    Do cross-validation with this many folds.
-R
    Randomize the order of instances in the dataset.
-split-percentage <percentage>
    Sets the percentage for the train/test set split, e.g., 66.
-split-number <number>
    Sets the number of training examples, e.g., 800
-i
    Invert the specified train/test split.
-s <random number seed>
    Sets random number seed (use with -R, for different CV or train/test splits).
-threshold <threshold>
    Sets the type of thresholding; where
        'PCut1' automatically calibrates a threshold (the default);
        'PCutL' automatically calibrates one threshold for each label;
        any number, e.g. '0.5', specifies that threshold.
-C <number of labels>
    Sets the number of target variables (labels) to assume (indexed from the beginning).
-d <classifier_file>
    Specify a file to dump classifier into.
-l <classifier_file>
    Specify a file to load classifier from.
-verbosity <verbosity level>
    Specify more/less evaluation output


Classifier Options:

-W
    Full name of base classifier.
    (default: weka.classifiers.trees.J48)
-output-debug-info
    If set, classifier is run in debug mode and
    may output additional info to the console
--do-not-check-capabilities
    If set, classifier capabilities are not checked before classifier is built
    (use with caution).
-

-K
    Use kernel density estimator rather than normal
    distribution for numeric attributes
-D
    Use supervised discretization to process numeric attributes

-O
    Display model in old format (good when there are many classes)

-output-debug-info
    If set, classifier is run in debug mode and
    may output additional info to the console
--do-not-check-capabilities
    If set, classifier capabilities are not checked before classifier is built
    (use with caution).

Any ideas?

status

Is there some todo list or a more general status on the individual classifiers. For example, the classifier chain ensembles are not importable, which probably means that it is not functional yet? Have you done any work preparing for the models listed in the issues?

find a way to make apidocs per module and not per subsubmodules work

Currently when doing the apidocs - they are generated for each subsubmodule instead of just the submodule. For example apidocs work for skmultilearn.base.base instead of skmultilearn.base

I can't find a solution for this, but scikit-learn manages to do that, you have a nice skmultilearn.cluster documentation with classess linked to cluster.Class and not to cluster.submodule.Class

label powerset broken

scikit-multilearn/skmultilearn/meta/lp.py in fit(self, X, y)
     24         last_id = 0
     25         train_vector    = []
---> 26         for labels_applied in y_lil.rows:
     27             label_string = ",".join(map(str,labels_applied))
     28 

NameError: global name 'y_lil' is not defined

Installation fails

Hi,
when I clone the repo and run python setup.py install I obtain the following error:
C:\Anaconda\lib\distutils\dist.py:267: UserWarning: Unknown distribution option:
'include_package_data'
warnings.warn(msg)
running install
running build
running build_py
error: package directory 'data' does not exist

kind regards

Friedman/nemenyi tests

friedman and nemenyi tests per Demsar's work

Implement classifier chains

See: Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. Classifier Chains for Multi-label Classification. Machine Learning Journal. Springer. Vol. 85(3), pp 333-359. (May 2011).

scikit-multilearn / scikit-multilearn Goto Github PK

scikit-multilearn's Introduction

scikit-multilearn

Features

Installation & Dependencies

Basic Usage

Contributing

Cite

scikit-multilearn's People

Contributors

Stargazers

Watchers

Forkers

scikit-multilearn's Issues

Recommend Projects

Recommend Topics

Recommend Org