drusk / pml Goto Github PK
View Code? Open in Web Editor NEWSimple interface to Python machine learning algorithms.
License: MIT License
Simple interface to Python machine learning algorithms.
License: MIT License
When plotting the dataset in test/datasets/3f_ids_header.csv with radviz there is an error:
pml:5> data = load("3f_ids_header.csv")
ValueError Traceback (most recent call last)
/home/drusk/workspace_pml/PythonMachineLearning/shell/shell.py in ()
----> 1 plot_radviz(data)
/home/drusk/workspace_pml/PythonMachineLearning/pml/plotting.py in plot_radviz(dataset)
34 # contains class membership info.
35 # therefore need to pass in the dataset's merged data and labels
---> 36 radviz(dataset.get_data_frame(), dataset.get_labels().name)
37 plt.show()
/usr/local/lib/python2.7/dist-packages/pandas-0.9.0rc2-py2.7-linux-x86_64.egg/pandas/tools/plotting.pyc in radviz(frame, class_column, ax, *kwds)
209 color=random_color(class),
210 label=com.stringify(class), *_kwds)
--> 211 ax.legend()
212
213 ax.add_patch(patches.Circle((0.0, 0.0), radius=1.0, facecolor='none'))
/usr/lib/pymodules/python2.7/matplotlib/axes.pyc in legend(self, args, *kwargs)
4517
4518
-> 4519 self.legend = mlegend.Legend(self, handles, labels, **kwargs)
4520 return self.legend
4521
/usr/lib/pymodules/python2.7/matplotlib/legend.pyc in init(self, parent, handles, labels, loc, numpoints, markerscale, scatterpoints, scatteryoffsets, prop, pad, labelsep, handlelen, handletextsep, axespad, borderpad, labelspacing, handlelength, handleheight, handletextpad, borderaxespad, columnspacing, ncol, mode, fancybox, shadow, title, bbox_to_anchor, bbox_transform, frameon, handler_map)
363
364 # init with null renderer
--> 365 self._init_legend_box(handles, labels)
366
367 self._loc = loc
/usr/lib/pymodules/python2.7/matplotlib/legend.pyc in _init_legend_box(self, handles, labels)
625 #xdescent, ydescent, width, height,
626 fontsize,
--> 627 handlebox)
628 handle_list.append(handle)
629
/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in call(self, legend, orig_handle, fontsize, handlebox)
108 a_list = self.create_artists(legend, orig_handle,
109 xdescent, ydescent, width, height, fontsize,
--> 110 handlebox.get_transform())
111
112 # create_artists will return a list of artists.
/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in create_artists(self, legend, orig_handle, xdescent, ydescent, width, height, fontsize, trans)
350
351 sizes = self.get_sizes(legend, orig_handle, xdescent, ydescent,
--> 352 width, height, fontsize)
353
354 p = self.create_collection(orig_handle, sizes,
/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in get_sizes(self, legend, orig_handle, xdescent, ydescent, width, height, fontsize)
305 xdescent, ydescent, width, height, fontsize):
306 if self._sizes is None:
--> 307 size_max = max(orig_handle.get_sizes())_legend.markerscale__2
308 size_min = min(orig_handle.get_sizes())_legend.markerscale**2
309
ValueError: max() arg is an empty sequence
Now that the docs are online there should be a command in the PML Shell to launch them in a web browser.
Create a collections matcher to simplify test cases that need to compare lists, etc.
Indices of the main dataframe and any labels or clustering results should have their indices checked to make sure they are all consistent. Otherwise an error should be raised.
Turn various ValueErrors into custom error classes. They should be as user friendly as possible in their error messages.
The IPython qtconsole allows features such as inline plotting, so it would be nice to have the option to use it.
Need to be able to handle the case where the first column is meant to be ids and not a feature.
Extract code from classifiers module (currently holds KNN implementation) and reuse for naive Bayes. KNN should probably go in a module called knn instead of classifiers (classifiers can hold the common parts perhaps).
Create a new utility module for operations such as computing the accuracy of a classifier's results given the labelled dataset.
Currently when labeled data is loaded the labels are interpreted as an extra feature. Need a mechanism when loading to specify whether the data is labeled or not. The data model should be updated to clearly distinguish between the feature data and labels.
Add the ability to generate a confusion matrix (http://en.wikipedia.org/wiki/Confusion_matrix) from classifier results.
There should be a method called "info" or something similar in the DataSet class which gathers summary information about the DataSet which can be displayed to the user.
The information should include:
-the feature list -> "Features: [...]"
-the number of samples
-whether there are missing values -> "Missing values?: %s" % <yes/no>
-whether the dataset is labelled -> "Is labelled?: %s" % <yes/no>
Implement a decision trees classifier.
It should be possible to write a data set to a delimited text file (which can easily be read back in) since the user may have made modifications to their data set which they want to preserve.
All functions, classes, etc., even those intended to be private, are being loaded into the top level namespace for the shell. This should be changed to carefully select only the functions and classes that should be exposed to the shell user.
Provide the ability to combine labels in a dataset.
Develop principal component analysis capabilities.
The DataSet class seems to have outgrown its current home in the loader module. There should probably be a separate module for data modeling classes.
Consider using https://readthedocs.org/ to automatically update docs when commits are pushed to GitHub.
Add a method in the PCA module for determining which features are most important.
Sphinx is leaving out the constructors in the docs it generates. I know I have seen solutions for this somewhere, Stackoverflow I think.
There should be a method for creating a RadViz plot from a DataSet. pandas has support for creating RadViz plots: http://pandas.pydata.org/pandas-docs/stable/visualization.html
Would be nice to be able to package the code up and just be able to run pml at command line to start up an interactive session.
Allow the user to calculate the percentage of variance represented in the selected principal components.
Need to be able to list the percentage of variance captured by each principal component. Should also be able to plot the cumulative variance captured by the inclusion of each subsequent principal component.
Add the ability to calculate metrics for the quality of clustering results.
Document the process of building the docs and putting them online.
The DataSet.split method should take an optional parameter "random" which defaults to False. If set to True, the rows of the DataSet going into each of the split results is randomized. This is important for ensuring a good mix of training data when splitting test and training from one set.
While doing #29 refactoring matchers.
Implement a naive bayes classifier.
Consider creating a custom matcher specifically for DataSet objects.
The DataSet class should have a method for setting the default value of cells with missing values. This should be flexible enough to accommodate common strategies such as setting to 0's, setting to column averages, etc.
Create a pandas DataFrame matcher. Also consider updating DataSet matcher to check labels.
The results of classification (specifically KNN) are currently a pandas Series. This series should be placed and returned within a new class such as "ClassifiedDataSet". This class will provide methods for calculating metrics, etc. on results, thereby making these operations much simpler.
Test modules should be renamed from tests.py to test.py. Also, perhaps the tests directory should have a package structure mirroring the source code structure for each of navigation.
Create an integration test for the full process of loading a dataset and using a KNN classifier to classify each observation, and compare with expected values to determine accuracy.
This method is misleading. Rework it or at least rename.
Should be able to filter a data set for samples with certain values for specified features. Should also be able to filter by labels.
Added packages under pml for superivised learning algorithms, unsupervised learning algorithms, etc. to aid navigation.
Put the project documentation generated by Sphinx online, hopefully on the Github project page for this repo.
Investigate possible strategies for tie breaking when there is no decisive winner of the KNN vote. Currently winner in such circumstances is arbitrary.
SimpleCV has a mode where it starts up an IPython shell with many of its modules already imported as well as additional help and tutorial features. Something similar should be done with this project to enable quick exploratory work.
Sometimes PyDev reports "Unresolved import" errors, but the code works fine. This is likely a path problem.
Known cases:
-metrics_tests: import metrics
-integration_tests: from metrics import compute_accuracy
-plotting: from pandas.tools.plotting import radviz
Add a method to recommend the number of principal components should be selected in order to keep a minimum specified percentage of the original data's variance.
The current factory methods for DataSet construction seem clunky. It should probably be refactored to handle the different cases in the constructor.
The 1 integration test takes longer to run than all the rest of the tests put together. It should be nearly instantaneous like the rest. Might have to eliminate the randomization used.
It is easy to forget that DataSet.fill_missing is returning the filled DataSet rather than just filling it in-place. It is also extra typing to get the most common use case. Wanting to keep an unfilled copy would be rare, it would be better to have some other mechanism for that special case.
Therefore, change data = data.fill_missing(val)
workflow to data.fill_missing(val)
Blank lines at the end of a CSV file are being read in as observations with NaN for every feature value. This must be the default behaviour of pandas. Investigate if there is a keyword option for ignoring blank lines at the end of the file. By default in pml we probably want to ignore blank lines at the end, but have our own keyword option that can be set explicitly to include them.
Implement a k-means clustering algorithm.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.