garydoranjr / misvm Goto Github PK

View Code? Open in Web Editor NEW

230.0 10.0 81.0 183 KB

Multiple-Instance Support Vector Machines

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

misvm's Introduction

MISVM: Multiple-Instance Support Vector Machines

by Gary Doran ([email protected])

Overview

MISVM contains a Python implementation of numerous support vector machine (SVM) algorithms for the multiple-instance (MI) learning framework. The implementations were created for use in the following publication:

Doran, Gary and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. To appear in Machine Learning Journal. 2013.

Installation

This package can be installed in two ways (the easy way):

# If needed:
# pip install numpy
# pip install scipy
# pip install cvxopt
pip install -e git+https://github.com/garydoranjr/misvm.git#egg=misvm

or by running the setup file manually

git clone [the url for misvm]
cd misvm
python setup.py install

Note the code depends on the numpy, scipy, and cvxopt packages. So have those installed first. The build will likely fail if it can't find them. For more information, see:

NumPy: Library for efficient matrix math in Python
SciPy: Library for more MATLAB-like functionality
CVXOPT: Efficient convex (including quadratic program) optimization

The MISVM package currently implements the following algorithms:

SIL

Single-Instance Learning (SIL) is a "naive" approach that assigns each instance the label of its bag, creating a supervised learning problem but mislabeling negative instances in positive bags. It works surprisingly well for many problems.

Ray, Soumya, and Mark Craven. Supervised versus multiple instance learning: an empirical comparison. Proceedings of the 22nd International Conference on Machine Learning. 2005.

MI-SVM and mi-SVM

These approaches modify the standard SVM formulation so that the constraints on instance labels correspond to the MI assumption that at least one instance in each bag is positive. For more information, see:

Andrews, Stuart, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. Advances in Neural Information Processing Systems. 2002.

NSK and STK

The normalized set kernel (NSK) and statistics kernel (STK) approaches use kernels to map entire bags into a features, then use the standard SVM formulation to find bag classifiers:

Gärtner, Thomas, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multi-instance kernels. Proceedings of the 19th International Conference on Machine Learning. 2002.

MissSVM

MissSVM uses a semi-supervised learning approach, treating the instances in positive bags as unlabeled data:

Zhou, Zhi-Hua, and Jun-Ming Xu. On the relation between multi-instance learning and semi-supervised learning. Proceedings of the 24th International Conference on Machine Learning. 2007.

MICA

The "multiple-instance classification algorithm" (MICA) represents each bag using a convex combinations of its instances. The optimization program is then solved by iteratively solving a series of linear programs. In our formulation, we use L2 regularization, so we solve alternating linear and quadratic programs. For more information on the original algorithm, see:

Mangasarian, Olvi L., and Edward W. Wild. Multiple instance classification via successive linear programming. Journal of Optimization Theory and Applications 137.3 (2008): 555-568.

sMIL, stMIL, and sbMIL

This family of approaches intentionally bias SVM formulations to handle the assumption that there are very few positive instances in each positive bag. In the case of sbMIL, prior knowledge on the "sparsity" of positive bags can be specified or found via cross-validation:

Bunescu, Razvan C., and Raymond J. Mooney. Multiple instance learning for sparse positive bags. Proceedings of the 24th International Conference on Machine Learning. 2007.

How to Use

The classifier implementations are loosely based on those found in the scikit-learn library. First, construct a classifier with the desired parameters:

>>> import misvm
>>> classifier = misvm.MISVM(kernel='linear', C=1.0, max_iters=50)

Use Python's help functionality as in help(misvm.MISVM) or read the documentation in the code to see which arguments each classifier takes. Then, call the fit function with some data:

>>> classifier.fit(bags, labels)

Here, the bags argument is a list of "array-like" (could be NumPy arrays, or a list of lists) objects representing each bag. Each (array-like) bag has m rows and f columns, which correspond to m instances, each with f features. Of course, m can be different across bags, but f must be the same. Then labels is an array-like object containing a label corresponding to each bag. Each label must be either +1 or -1. You will likely get strange results if you try using 0/1-valued labels. After training the classifier, you can call the predict function as:

>>> labels = classifier.predict(bags)

Here bags has the same format as for fit, and the function returns an array of real-valued predictions (use numpy.sign(labels) to get -1/+1 class predictions).

In order to get instance-level predictions from a classifier, use the instancePrediction flag, as in:

>>> bag_labels, instance_labels = classifier.predict(bags, instancePrediction=True)

The instancePrediction flag is not available for bag-level classifiers such as the NSK. However, you can always predict the labels of "singleton" bags containing a single instance to assign a label to that instance. In this case, one should use caution in interpreting the label of an instance produced by a bag-level classifier, since these classifiers are designed to make predictions based on properties of an entire bag.

An example script is included that trains classifiers on the musk1 dataset; see:

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Install the package or add the misvm directory to the PYTHONPATH environment variable before attempting to run the example using python example.py within the example directory.

Questions and Issues

If you find any bugs or have any questions about this code, please create an issue on GitHub, or contact Gary Doran at [email protected]. Of course, I cannot guarantee any support for this software.

misvm's People

Contributors

Stargazers

Watchers

Forkers

wqren skarnik-rmn orthogonal twistedmove plsang sathappanspm hbredin xypan1232 mikeseven wshenx andreas-koukorinis forestliurui giffy1 bhanu-pilani guojiangwei2 yukehit tuzihao criminalking centiment-io samuel1208 satpreetsingh bobye z01nl1o02 lynzhangyl dipanshawucr zjjj ghostintheshellarise annamalai-nr mlnjsh ludai0106 vibhormehta myouesfi jnothman actank evildj xuecong ngonthier kheffah afcarl maojingyi reckhhh dosea cyanph gtenren swarmchaos alexfrontxq nwschurink lvkd84 alpapado ammarkamoona zbodalal romaincendre johnvorsten sophiajia fireofearth fagan2888 dillonkn xyyue574 zhang-shui-shui gitouyou shsh88 saqibmamoon zfxiaobai georgezywang gregoryverghese aamin20 1996lixingyu1996 inkiinki schen1618 qiming-huang sunruizhe188 buptpriswang tangj1905 andreped levaithanjealous bzp92 anilgavade humaperveen antoniolmcandido fouada

misvm's Issues

Attribute Error running example code

Hey there,
I was happy to see this extensive work on Multiple Instance Learning, but I ran into an error running your example code. Could it be that I installed it or its dependencies incorrectly? The problem is in cvxopt. Am I not using the correct version?

pcost dcost gap pres dres
0: -4.7135e+01 -1.9465e+00 3e+03 5e+01 4e-09
Traceback (most recent call last):

...

File "/Users/snoran/misvm_example/example.py", line 38, in main
classifier.fit(train_bags, train_labels)

File "/Users/snoran/cvxopt-1.1.8/src/misvm/misvm/sil.py", line 44, in fit
super(SIL, self).fit(svm_X, svm_y)

File "/Users/snoran/cvxopt-1.1.8/src/misvm/misvm/svm.py", line 67, in fit
self.verbose)

File "/Users/snoran/cvxopt-1.1.8/src/misvm/misvm/quadprog.py", line 105, in quadprog
return qp.solve(verbose)

File "/Users/snoran/cvxopt-1.1.8/src/misvm/misvm/quadprog.py", line 65, in solve
initvals=self.last_results)

File "build/bdist.macosx-10.5-x86_64/egg/cvxopt/coneprog.py", line 4468, in qp
return coneqp(P, q, G, h, None, A, b, initvals, options = options)

File "build/bdist.macosx-10.5-x86_64/egg/cvxopt/coneprog.py", line 2243, in coneqp
if iters == 0: W = misc.compute_scaling(s, z, lmbda, dims)

File "build/bdist.macosx-10.5-x86_64/egg/cvxopt/misc.py", line 285, in compute_scaling
W['d'] = base.sqrt( base.div( s[mnl:mnl+m], z[mnl:mnl+m] ))

AttributeError: 'module' object has no attribute 'div'

Prediction scores instead of labels

Hello, is there a way to obtain scores of the predictions instead of labels? As you may guess, I am interested in computing AUC, rather than accuracy or similar metric.
When I create the classifier object with MISVM, I can see a score(X,y) function. However, when I pass the test bags as X and test bag labels as y, it gives me error.

ValueError: Classification metrics can't handle a mix of binary and continuous targets

Getting a warning for numerical instability and algorithm is unable to optimize

Getting below Warning when the data size is more than 200 rows and 25 columns
Warning: numerical instability (primal simplex, phase I)

And the optimization job does not complete

Instance level predictions

May I know if MISVM supports instance-level predictions?

the example does not run on my machine

Hi, first of all thank you for the implementation of all these versions of multiple instance learning !
I used conda with python3.6 and I installed the required libraries.
However when I executed the file "example.py"

$ python example/example.py

I obtained the following message:

Traceback (most recent call last):
File "./example/example.py", line 49, in
main()
File "./example/example.py", line 40, in main
classifier.fit(train_bags, train_labels)
File "/home/piero/miniconda3/lib/python3.6/site-packages/misvm-1.0-py3.6.egg/misvm/misssvm.py", line 57, in fit
File "/home/piero/miniconda3/lib/python3.6/site-packages/misvm-1.0-py3.6.egg/misvm/util.py", line 61, in getattr
File "/home/piero/miniconda3/lib/python3.6/site-packages/numpy/core/shape_base.py", line 237, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: need at least one array to concatenate

Can you suggest me what is the mistake I made?
Thanks
Piero

same sign for all instance-level prediction

I generated some synthetic data using the 20newsgroup to run experiment on mi-SVM and MI-SVM. I noticed that, if I predict the labels in instance-level (actually that happens for me on bag-level too), all predictions share the same sign (either positive or negative, depending on the data). The AUC looks good, which means the ranking is correct. I guess this might due to some issues caused by library version, or not? Does anyone come across the same issue? Or, which versions of libraries should we use? Thanks!

Example Code not Working

I get this error when i try to run your example code
for algorithm, classifier in classifiers.items(): classifier.fit(train_bags, train_labels)

`ValueError Traceback (most recent call last)
in ()
26 accuracies = {}
27 for algorithm, classifier in classifiers.items():
---> 28 classifier.fit(train_bags, train_labels)
29 predictions = classifier.predict(test_bags)
30 accuracies[algorithm] = np.average(test_labels == np.sign(predictions))

~/misvm/src/misvm/misvm/misssvm.py in fit(self, bags, y)
55 bs.pos_instances,
56 bs.pos_instances,
---> 57 bs.neg_instances])
58 self._y = np.vstack([np.matrix(np.ones((bs.X_p + bs.L_p, 1))),
59 -np.matrix(np.ones((bs.L_p + bs.L_n, 1)))])

~/misvm/src/misvm/misvm/util.py in getattr(self, name)
59 return self.neg_bags
60 elif name == 'neg_instances':
---> 61 self.neg_instances = np.vstack(self.neg_bags)
62 return self.neg_instances
63 elif name == 'pos_instances':

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/shape_base.py in vstack(tup)
235
236 """
--> 237 return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
238
239 def hstack(tup):

ValueError: need at least one array to concatenate `

This is problem with MissSVM on example code!

When i try other classifiers , I get different errors, for example sbMIL
3 clf = misvm.sbMIL(kernel='linear', eta=0.1, C=1.0)
----> 4 clf.fit(train_bags, train_labels)
5 predictions = clf.predict(test_bags)
6 print (np.average(test_labels == np.sign(predictions)))

~/misvm/src/misvm/misvm/sbmil.py in fit(self, bags, y)
54 scale_C=self.scale_C, verbose=self.verbose,
55 sv_cutoff=self.sv_cutoff)
---> 56 initial_classifier.fit(bags, y)
57 if self.verbose:
58 print('Computing initial instance labels for sbMIL...')

~/misvm/src/misvm/misvm/smil.py in fit(self, bags, y)
49 if self.scale_C:
50 iC = float(self.C) / bs.L_n
---> 51 bC = float(self.C) / bs.X_p
52 else:
53 iC = self.C

ZeroDivisionError: float division by zero

SIL works as expected. However, for MISVM:

clf = misvm.MISVM(kernel='linear', C=1.0, max_iters=50) clf.fit(train_bags, [-1,1])

Gives me fitting and output. How does this work? There are 82 train bags and only 2 labels, why is there no "dimension mismatch" error as when i try this with SIL?

memory error on medium scale

Hi,
I am trying to run missSVM and MICA but I am getting memory error. The total number of instances across all training bags is about 120,000 with dimensions 100. Is there a way to get it running on a 16Gb RAM computer ?

Help -- ValueError : Rank(A) < p or Rank([P; A; G]) < n

Hi, I'm trying to use the algorithm but I get this error :

ValueErrorTraceback (most recent call last)
in ()
1
----> 2 classifier.fit(train_bags, train_labels)

3 frames
/content/src/misvm/misvm/quadprog.py in solve(self, verbose)
75 self._ensure_pd(eps)
76 else:
---> 77 raise e
78
79 _apply_options(old_settings)

ValueError: Rank(A) < p or Rank([P; A; G]) < n

Can you help me solve this ? thank you in advance for any help

Class weights

Hello,

I have an issue with the distribution of classes. I have 4000 negative bags and 500 positive bags and it always predicts as negative. I would like to add the class_weight parameter, but the parameters for the mi-SVM are limited.

Thanks!

MissSVM result

According to the released code, however, the accuracy of MissSVM is only 40%, I have read the original paper, but have not found any difference between code and paper. So, what do you think the reason why the accuracy is only 40%? Thank you!!

"get_params" of miSVM

Hi, thank you for your great work and enjoying your program.

In l298 of "mi_svm.py",
I think
super_args = super(MISVM, self).get_params()
should be
super_args = super(miSVM, self).get_params().
Could you check this part and please modify if it is necessary.

Thank you.

Multi label implement

I wonder how to realize multi label classification task using this binary classification MISVM.

NaN predicted values

The classifier.predict() method outputs NaN values for some test bags (in case of sbMIL classifier that I checked on my dataset). Could you please advise why it would predict NaN values and how to resolve it?

Problem: Rank(A) < p or Rank([P; A; G]) < n

Here is the problem I got:

Traceback (most recent call last):
File "/Applications/PyCharm Edu.app/Contents/helpers/pydev/pydevd.py", line 1599, in
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm Edu.app/Contents/helpers/pydev/pydevd.py", line 1026, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/chilab/PycharmProjects/MILSVM/mis_stress.py", line 87, in
classifier.fit(X_train, y_train)
File "/Users/chilab/src/misvm/misvm/sil.py", line 45, in fit
super(SIL, self).fit(svm_X, svm_y)
File "/Users/chilab/src/misvm/misvm/svm.py", line 68, in fit
self.verbose)
File "/Users/chilab/src/misvm/misvm/quadprog.py", line 106, in quadprog
return qp.solve(verbose)
File "/Users/chilab/src/misvm/misvm/quadprog.py", line 77, in solve
raise e
ValueError: Rank(A) < p or Rank([P; A; G]) < n

Thx!

Different results in python 2.7.8 and python 3.6.8

Hello,
I have noticed a considerable difference on my results when running the exact same code in python 2.7.8 and python 3.6.8.
I have been considering results as correct when running on python 3.6.8 taking into account that the support/test for the used libraries is better for python 3.6.8 but I would like to ask whether you have fully tested your implementation for python 3.6 or i should consider python 2.7.
Does anyone come across the same issue? Or, are there specific versions of libraries that I should use?
Thanks in advance,
Redona

Overflow- INT_MAX reached

Hi everyone,
I'm actually running some of my work on your library ( pretty weel designed by the way), and I'm facing some issue during fit step.
I got arround 100 000 instances split on 5000 bags.

Problem is encountered in quadprog.convert() method at line P = cvxmat(H), where H is a sample*sample matrix. I got some matrix that contains arround 10 000 000 000 values, and by the ways exceed the authorized number of values (limited to INT_MAX values).

Do someone have some clues on how to solve that?
Best regards