Coder Social home page Coder Social logo

pml's People

Contributors

drusk avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

pml's Issues

plot_radviz error

When plotting the dataset in test/datasets/3f_ids_header.csv with radviz there is an error:

pml:5> data = load("3f_ids_header.csv")

pml:6> plot_radviz(data)

ValueError Traceback (most recent call last)
/home/drusk/workspace_pml/PythonMachineLearning/shell/shell.py in ()
----> 1 plot_radviz(data)

/home/drusk/workspace_pml/PythonMachineLearning/pml/plotting.py in plot_radviz(dataset)
34 # contains class membership info.

 35     # therefore need to pass in the dataset's merged data and labels

---> 36 radviz(dataset.get_data_frame(), dataset.get_labels().name)
37 plt.show()

/usr/local/lib/python2.7/dist-packages/pandas-0.9.0rc2-py2.7-linux-x86_64.egg/pandas/tools/plotting.pyc in radviz(frame, class_column, ax, *kwds)
209 color=random_color(class
),
210 label=com.stringify(class), *_kwds)
--> 211 ax.legend()
212
213 ax.add_patch(patches.Circle((0.0, 0.0), radius=1.0, facecolor='none'))

/usr/lib/pymodules/python2.7/matplotlib/axes.pyc in legend(self, args, *kwargs)
4517
4518
-> 4519 self.legend
= mlegend.Legend(self, handles, labels, **kwargs)
4520 return self.legend

4521

/usr/lib/pymodules/python2.7/matplotlib/legend.pyc in init(self, parent, handles, labels, loc, numpoints, markerscale, scatterpoints, scatteryoffsets, prop, pad, labelsep, handlelen, handletextsep, axespad, borderpad, labelspacing, handlelength, handleheight, handletextpad, borderaxespad, columnspacing, ncol, mode, fancybox, shadow, title, bbox_to_anchor, bbox_transform, frameon, handler_map)
363
364 # init with null renderer

--> 365 self._init_legend_box(handles, labels)
366
367 self._loc = loc

/usr/lib/pymodules/python2.7/matplotlib/legend.pyc in _init_legend_box(self, handles, labels)
625 #xdescent, ydescent, width, height,

626                              fontsize,

--> 627 handlebox)
628 handle_list.append(handle)
629

/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in call(self, legend, orig_handle, fontsize, handlebox)
108 a_list = self.create_artists(legend, orig_handle,
109 xdescent, ydescent, width, height, fontsize,
--> 110 handlebox.get_transform())
111
112 # create_artists will return a list of artists.

/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in create_artists(self, legend, orig_handle, xdescent, ydescent, width, height, fontsize, trans)
350
351 sizes = self.get_sizes(legend, orig_handle, xdescent, ydescent,
--> 352 width, height, fontsize)
353
354 p = self.create_collection(orig_handle, sizes,

/usr/lib/pymodules/python2.7/matplotlib/legend_handler.pyc in get_sizes(self, legend, orig_handle, xdescent, ydescent, width, height, fontsize)
305 xdescent, ydescent, width, height, fontsize):
306 if self._sizes is None:
--> 307 size_max = max(orig_handle.get_sizes())_legend.markerscale__2
308 size_min = min(orig_handle.get_sizes())_legend.markerscale**2
309

ValueError: max() arg is an empty sequence

Check indices on DataSet contruction

Indices of the main dataframe and any labels or clustering results should have their indices checked to make sure they are all consistent. Otherwise an error should be raised.

Extract Error Classes

Turn various ValueErrors into custom error classes. They should be as user friendly as possible in their error messages.

Support IDs in DataSet

Need to be able to handle the case where the first column is meant to be ids and not a feature.

Refactor classifiers

Extract code from classifiers module (currently holds KNN implementation) and reuse for naive Bayes. KNN should probably go in a module called knn instead of classifiers (classifiers can hold the common parts perhaps).

Loading labeled data

Currently when labeled data is loaded the labels are interpreted as an extra feature. Need a mechanism when loading to specify whether the data is labeled or not. The data model should be updated to clearly distinguish between the feature data and labels.

Concise info for DataSet

There should be a method called "info" or something similar in the DataSet class which gathers summary information about the DataSet which can be displayed to the user.

The information should include:
-the feature list -> "Features: [...]"
-the number of samples
-whether there are missing values -> "Missing values?: %s" % <yes/no>
-whether the dataset is labelled -> "Is labelled?: %s" % <yes/no>

Write DataSet to delimited text file

It should be possible to write a data set to a delimited text file (which can easily be read back in) since the user may have made modifications to their data set which they want to preserve.

Clean up top-level package imports

All functions, classes, etc., even those intended to be private, are being loaded into the top level namespace for the shell. This should be changed to carefully select only the functions and classes that should be exposed to the shell user.

Move DataSet to a new module

The DataSet class seems to have outgrown its current home in the loader module. There should probably be a separate module for data modeling classes.

Add constructors to docs

Sphinx is leaving out the constructors in the docs it generates. I know I have seen solutions for this somewhere, Stackoverflow I think.

Variance per principal component

Need to be able to list the percentage of variance captured by each principal component. Should also be able to plot the cumulative variance captured by the inclusion of each subsequent principal component.

Clustering metrics

Add the ability to calculate metrics for the quality of clustering results.

DataSet random split

The DataSet.split method should take an optional parameter "random" which defaults to False. If set to True, the rows of the DataSet going into each of the split results is randomized. This is important for ensuring a good mix of training data when splitting test and training from one set.

Handling missing values

The DataSet class should have a method for setting the default value of cells with missing values. This should be flexible enough to accommodate common strategies such as setting to 0's, setting to column averages, etc.

DataFrame Matcher

Create a pandas DataFrame matcher. Also consider updating DataSet matcher to check labels.

Refactor Classification Results

The results of classification (specifically KNN) are currently a pandas Series. This series should be placed and returned within a new class such as "ClassifiedDataSet". This class will provide methods for calculating metrics, etc. on results, thereby making these operations much simpler.

Refactor test directory

Test modules should be renamed from tests.py to test.py. Also, perhaps the tests directory should have a package structure mirroring the source code structure for each of navigation.

KNN Integration Test

Create an integration test for the full process of loading a dataset and using a KNN classifier to classify each observation, and compare with expected values to determine accuracy.

DataSet filtering

Should be able to filter a data set for samples with certain values for specified features. Should also be able to filter by labels.

Refactor package structure

Added packages under pml for superivised learning algorithms, unsupervised learning algorithms, etc. to aid navigation.

Put docs online

Put the project documentation generated by Sphinx online, hopefully on the Github project page for this repo.

Tie breaking for KNN

Investigate possible strategies for tie breaking when there is no decisive winner of the KNN vote. Currently winner in such circumstances is arbitrary.

IPython Shell Mode

SimpleCV has a mode where it starts up an IPython shell with many of its modules already imported as well as additional help and tutorial features. Something similar should be done with this project to enable quick exploratory work.

Investigate unresolved imports

Sometimes PyDev reports "Unresolved import" errors, but the code works fine. This is likely a path problem.

Known cases:
-metrics_tests: import metrics
-integration_tests: from metrics import compute_accuracy
-plotting: from pandas.tools.plotting import radviz

Recommend number of components for PCA

Add a method to recommend the number of principal components should be selected in order to keep a minimum specified percentage of the original data's variance.

Rework DataSet Construction

The current factory methods for DataSet construction seem clunky. It should probably be refactored to handle the different cases in the constructor.

Speed up integration test runtime

The 1 integration test takes longer to run than all the rest of the tests put together. It should be nearly instantaneous like the rest. Might have to eliminate the randomization used.

Make DataSet.fill_missing in-place

It is easy to forget that DataSet.fill_missing is returning the filled DataSet rather than just filling it in-place. It is also extra typing to get the most common use case. Wanting to keep an unfilled copy would be rare, it would be better to have some other mechanism for that special case.

Therefore, change data = data.fill_missing(val) workflow to data.fill_missing(val)

Blank lines at end of CSV

Blank lines at the end of a CSV file are being read in as observations with NaN for every feature value. This must be the default behaviour of pandas. Investigate if there is a keyword option for ignoring blank lines at the end of the file. By default in pml we probably want to ignore blank lines at the end, but have our own keyword option that can be set explicitly to include them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.