pml,drusk

plot_radviz error

When plotting the dataset in test/datasets/3f_ids_header.csv with radviz there is an error:

pml:5> data = load("3f_ids_header.csv")

pml:6> plot_radviz(data)

ValueError Traceback (most recent call last)
/home/drusk/workspace_pml/PythonMachineLearning/shell/shell.py in ()
----> 1 plot_radviz(data)

/home/drusk/workspace_pml/PythonMachineLearning/pml/plotting.py in plot_radviz(dataset)
34 # contains class membership info.

 35     # therefore need to pass in the dataset's merged data and labels

---> 36 radviz(dataset.get_data_frame(), dataset.get_labels().name)
37 plt.show()

/usr/local/lib/python2.7/dist-packages/pandas-0.9.0rc2-py2.7-linux-x86_64.egg/pandas/tools/plotting.pyc in radviz(frame, class_column, ax, *kwds)
209 color=random_color(class),
210 label=com.stringify(class), *_kwds)
--> 211 ax.legend()
212
213 ax.add_patch(patches.Circle((0.0, 0.0), radius=1.0, facecolor='none'))

/usr/lib/pymodules/python2.7/matplotlib/axes.pyc in legend(self, args, *kwargs)
4517
4518
-> 4519 self.legend = mlegend.Legend(self, handles, labels, **kwargs)
4520 return self.legend
4521

/usr/lib/pymodules/python2.7/matplotlib/legend.pyc in init(self, parent, handles, labels, loc, numpoints, markerscale, scatterpoints, scatteryoffsets, prop, pad, labelsep, handlelen, handletextsep, axespad, borderpad, labelspacing, handlelength, handleheight, handletextpad, borderaxespad, columnspacing, ncol, mode, fancybox, shadow, title, bbox_to_anchor, bbox_transform, frameon, handler_map)
363
364 # init with null renderer

--> 365 self._init_legend_box(handles, labels)
366
367 self._loc = loc

/usr/lib/pymodules/python2.7/matplotlib/legend.pyc in _init_legend_box(self, handles, labels)
625 #xdescent, ydescent, width, height,

626                              fontsize,

--> 627 handlebox)
628 handle_list.append(handle)
629

/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in call(self, legend, orig_handle, fontsize, handlebox)
108 a_list = self.create_artists(legend, orig_handle,
109 xdescent, ydescent, width, height, fontsize,
--> 110 handlebox.get_transform())
111
112 # create_artists will return a list of artists.

/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in create_artists(self, legend, orig_handle, xdescent, ydescent, width, height, fontsize, trans)
350
351 sizes = self.get_sizes(legend, orig_handle, xdescent, ydescent,
--> 352 width, height, fontsize)
353
354 p = self.create_collection(orig_handle, sizes,

/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in get_sizes(self, legend, orig_handle, xdescent, ydescent, width, height, fontsize)
305 xdescent, ydescent, width, height, fontsize):
306 if self._sizes is None:
--> 307 size_max = max(orig_handle.get_sizes())_legend.markerscale__2
308 size_min = min(orig_handle.get_sizes())_legend.markerscale**2
309

ValueError: max() arg is an empty sequence

Bring up docs in web browser from shell

Now that the docs are online there should be a command in the PML Shell to launch them in a web browser.

Testing: Collections Matcher

Create a collections matcher to simplify test cases that need to compare lists, etc.

Check indices on DataSet contruction

Indices of the main dataframe and any labels or clustering results should have their indices checked to make sure they are all consistent. Otherwise an error should be raised.

Extract Error Classes

Turn various ValueErrors into custom error classes. They should be as user friendly as possible in their error messages.

Investigate QT console for shell

The IPython qtconsole allows features such as inline plotting, so it would be nice to have the option to use it.

Support IDs in DataSet

Need to be able to handle the case where the first column is meant to be ids and not a feature.

Refactor classifiers

Extract code from classifiers module (currently holds KNN implementation) and reuse for naive Bayes. KNN should probably go in a module called knn instead of classifiers (classifiers can hold the common parts perhaps).

Utility module for computing accuracy of results

Create a new utility module for operations such as computing the accuracy of a classifier's results given the labelled dataset.

Loading labeled data

Currently when labeled data is loaded the labels are interpreted as an extra feature. Need a mechanism when loading to specify whether the data is labeled or not. The data model should be updated to clearly distinguish between the feature data and labels.

Confusion Matrix

Add the ability to generate a confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix) from classifier results.

Concise info for DataSet

There should be a method called "info" or something similar in the DataSet class which gathers summary information about the DataSet which can be displayed to the user.

The information should include:
-the feature list -> "Features: [...]"
-the number of samples
-whether there are missing values -> "Missing values?: %s" % <yes/no>
-whether the dataset is labelled -> "Is labelled?: %s" % <yes/no>

New classifier: Decision Trees

Implement a decision trees classifier.

Write DataSet to delimited text file

It should be possible to write a data set to a delimited text file (which can easily be read back in) since the user may have made modifications to their data set which they want to preserve.

Clean up top-level package imports

All functions, classes, etc., even those intended to be private, are being loaded into the top level namespace for the shell. This should be changed to carefully select only the functions and classes that should be exposed to the shell user.

Combine labels

Provide the ability to combine labels in a dataset.

Principal Component Analysis

Develop principal component analysis capabilities.

Refactor test names to use underscores instead of camel case

Move DataSet to a new module

The DataSet class seems to have outgrown its current home in the loader module. There should probably be a separate module for data modeling classes.

Automatic docs build

Consider using https://readthedocs.org/ to automatically update docs when commits are pushed to GitHub.

Use PCA to determine most important features

Add a method in the PCA module for determining which features are most important.

Add constructors to docs

Sphinx is leaving out the constructors in the docs it generates. I know I have seen solutions for this somewhere, Stackoverflow I think.

RadViz of DataSet

There should be a method for creating a RadViz plot from a DataSet. pandas has support for creating RadViz plots: http://pandas.pydata.org/pandas-docs/stable/visualization.html

Investigate packaging with distutils

Would be nice to be able to package the code up and just be able to run pml at command line to start up an interactive session.

Percent variance in PCA results

Allow the user to calculate the percentage of variance represented in the selected principal components.

Variance per principal component

Need to be able to list the percentage of variance captured by each principal component. Should also be able to plot the cumulative variance captured by the inclusion of each subsequent principal component.

Clustering metrics

Add the ability to calculate metrics for the quality of clustering results.

Document building docs

Document the process of building the docs and putting them online.

DataSet random split

The DataSet.split method should take an optional parameter "random" which defaults to False. If set to True, the rows of the DataSet going into each of the split results is randomized. This is important for ensuring a good mix of training data when splitting test and training from one set.

Refactor matchers

While doing #29 refactoring matchers.

New classifier: Naive Bayes

Implement a naive bayes classifier.

Custom DataSet Matcher

Consider creating a custom matcher specifically for DataSet objects.

Handling missing values

The DataSet class should have a method for setting the default value of cells with missing values. This should be flexible enough to accommodate common strategies such as setting to 0's, setting to column averages, etc.

DataFrame Matcher

Create a pandas DataFrame matcher. Also consider updating DataSet matcher to check labels.

Refactor Classification Results

The results of classification (specifically KNN) are currently a pandas Series. This series should be placed and returned within a new class such as "ClassifiedDataSet". This class will provide methods for calculating metrics, etc. on results, thereby making these operations much simpler.

Refactor test directory

Test modules should be renamed from tests.py to test.py. Also, perhaps the tests directory should have a package structure mirroring the source code structure for each of navigation.

KNN Integration Test

Create an integration test for the full process of loading a dataset and using a KNN classifier to classify each observation, and compare with expected values to determine accuracy.

Review DataSet.get_data_frame

This method is misleading. Rework it or at least rename.

DataSet filtering

Should be able to filter a data set for samples with certain values for specified features. Should also be able to filter by labels.

Refactor package structure

Added packages under pml for superivised learning algorithms, unsupervised learning algorithms, etc. to aid navigation.

Put docs online

Put the project documentation generated by Sphinx online, hopefully on the Github project page for this repo.

Tie breaking for KNN

Investigate possible strategies for tie breaking when there is no decisive winner of the KNN vote. Currently winner in such circumstances is arbitrary.

IPython Shell Mode

SimpleCV has a mode where it starts up an IPython shell with many of its modules already imported as well as additional help and tutorial features. Something similar should be done with this project to enable quick exploratory work.

Investigate unresolved imports

Sometimes PyDev reports "Unresolved import" errors, but the code works fine. This is likely a path problem.

Known cases:
-metrics_tests: import metrics
-integration_tests: from metrics import compute_accuracy
-plotting: from pandas.tools.plotting import radviz

Recommend number of components for PCA

Add a method to recommend the number of principal components should be selected in order to keep a minimum specified percentage of the original data's variance.

Rework DataSet Construction

The current factory methods for DataSet construction seem clunky. It should probably be refactored to handle the different cases in the constructor.

Speed up integration test runtime

The 1 integration test takes longer to run than all the rest of the tests put together. It should be nearly instantaneous like the rest. Might have to eliminate the randomization used.

Make DataSet.fill_missing in-place

It is easy to forget that DataSet.fill_missing is returning the filled DataSet rather than just filling it in-place. It is also extra typing to get the most common use case. Wanting to keep an unfilled copy would be rare, it would be better to have some other mechanism for that special case.

Therefore, change data = data.fill_missing(val) workflow to data.fill_missing(val)

Blank lines at end of CSV

Blank lines at the end of a CSV file are being read in as observations with NaN for every feature value. This must be the default behaviour of pandas. Investigate if there is a keyword option for ignoring blank lines at the end of the file. By default in pml we probably want to ignore blank lines at the end, but have our own keyword option that can be set explicitly to include them.

Clustering algorithm: k-means

Implement a k-means clustering algorithm.

drusk / pml Goto Github PK

pml's People

Contributors

Stargazers

Watchers

Forkers

pml's Issues

pml:6> plot_radviz(data)

Recommend Projects

Recommend Topics

Recommend Org