jpmml / jpmml-sklearn Goto Github PK

Java library and command-line application for converting Scikit-Learn pipelines to PMML

License: GNU Affero General Public License v3.0

Java 87.81% Python 12.19%

jpmml-sklearn's Issues

Implementation of sklearn.preprocessing.Normalizer

Hi, I've tried to implement sklearn.preprocessing.Normalizer with l1 norm as custom Transformer.

public List<Feature> encodeFeatures(List<Feature> features, SkLearnEncoder encoder) {

        List<Feature> result = new ArrayList<>();

        Apply sumExpression = PMMLUtil.createApply("+");
        for(Feature feature : features){
            sumExpression.addExpressions(feature.toContinuousFeature().ref());
        }
        FieldName name = FieldName.create("sum-of-features");
        DerivedField sumField = encoder.createDerivedField(name, sumExpression);
        ContinuousFeature sumFeature = new ContinuousFeature(encoder, sumField);

        for(int i = 0; i < features.size(); i++) {
            Feature feature = features.get(i);
            ContinuousFeature continuousFeature = feature.toContinuousFeature();
            Expression expression = continuousFeature.ref();
            expression = PMMLUtil.createApply("/", expression, sumFeature.ref());
            DerivedField derivedField = encoder.createDerivedField(createName(continuousFeature), expression);
            result.add(new ContinuousFeature(encoder, derivedField));
        }
        return result;
}

It builds pmml, but fails during running evaluation with error "Expected 2 arguments, but got 3000 arguments' where 3000 is features.size()."

Am I doing something wrong?

Perform OneHotEncoding and LabelEncoding within the same pipeline

I am trying to use both a LabelEncoder() and a OneHotEncoder() within the same pipeline (as OneHotEncoder does not support string values) and I cannot find the right way to do so.

I found examples such as

my_mapper = DataFrameMapper([
  ("cat_col_1", OneHotEncoder()),
  ("bin_col_2", LabelBinarizer()),
  ("target", None)
])

But in my case it is the same column that is LabelEncoded then OntHotEncoded.
I tried the following

mapper = DataFrameMapper([
    ("cat_col_1", [LabelEncoder(), OneHotEncoder()])
])
classifier = RandomForestClassifier()

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(df, df["target"])

Which results in an error:
ValueError: Number of labels=16677 does not match number of samples=1

It seems that the problem is that the output of LabelEncoder is of the type [n_samples] while the oneHotEncoder expects an array of shape (n_samples,1) in the case of unique feature such as in the current case.

Is there any way to properly integrate a LabelEncoder prior to a OntHotEncoder ?

EDIT : I found a workaround. Instead of using one mapper I use two mappers and set the parameter 'df_out' of the first mapper at True so that the output of the DataFrameMapper is still a dataframe and not just an array allowing the use of labels ("cat_col_1"). Is this the right way to do ?

When parsing a pipeline with two mappers the follwing error is raised:

Exception in thread "main" java.lang.UnsupportedOperationException
	at sklearn_pandas.DataFrameMapper.getOpType(DataFrameMapper.java:47)
	at org.jpmml.sklearn.SkLearnEncoder.updateFeatures(SkLearnEncoder.java:42)
	at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:118)
	at org.jpmml.sklearn.Main.run(Main.java:146)
	at org.jpmml.sklearn.Main.main(Main.java:93)

Conversion of `sklearn.pipeline.Pipeline` objects

As you explain in the tutorial, jpmml-scikitlearn can be used to convert into pmml format two kinds of objects (both stored in in .pkl): mappers and estimators. Mappers come from DataFrameMappers created with sklearn-pandas, and estimators come from models created via scikit-learn.
In my case I would like to convert to pmml an object Pipeline created directly via scikit-learn. Of course I can rewrite my python code in terms of DataFrameMappers, but it would be more convenient for me if I could convert directly the object Pipeline. Thank you.

ExtraTreesRegressor does not seem to have target variables for the segments declared

getting

ERROR [2015-11-19 22:23:55,309] io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: 5b4145018068a76d
! org.jpmml.evaluator.TypeCheckException: Expected DOUBLE, but got null
! at org.jpmml.evaluator.TypeUtil.toDouble(TypeUtil.java:571) ~[pmml-evaluator-1.2.6.jar:na]
! at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:378) ~[pmml-evaluator-1.2.6.jar:na]
! at org.jpmml.evaluator.TypeUtil.parseOrCast(TypeUtil.java:66) ~[pmml-evaluator-1.2.6.jar:na]
! at org.jpmml.evaluator.MiningModelEvaluator.aggregateValues(MiningModelEvaluator.java:458) ~[pmml-evaluator-1.2.6.jar:na]

Since aggregateValues is getting back a null result from

for(SegmentResultMap segmentResult : segmentResults){
            Object targetValue = EvaluatorUtil.decode(segmentResult.getTargetValue());

which to me seems to imply that the trees are not returning target values.

It looks like some of the recent refactoring in
27858a1#diff-d4ca34d7102c57121516753b9faf5e41 where the standalone variable was used to set the target field to something only when true, but
27858a1#diff-b6e00c7675e0a9b5c3c0432ddf12c47eL126 was always setting the target field no matter what the standalone variable said. May have something to do with it? Really just from my sort of glancing through the code.

Support for sklearn VotingClassifier

Any plans on adding this? :)

Import SkLearn2PMML-Plugin example transformers

The SkLearn2PMML/JPMML-SkLearn stack should provide adequate coverage of PMML built-in functions: http://dmg.org/pmml/v4-3/BuiltinFunctions.html

Convert Model with categorical features to PMML

I used LabelBinarizer to convert categorical features to dummy variables, and trained a GBM model. However, when converting the trained model and datamapper to PMML, there was JAVA CalledProcessError error. Would you mind having a look at the issue? Thanks .

DataFrameMapper step

cat = [feature_names[i] for i in categorical_features]
num = [feature_names[i] for i in range(15) if i not in categorical_features]

transform = [(column, None) if column in num else (column, sklearn.preprocessing.LabelBinarizer())
for column in train_dt.columns]
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper(transform)
train_array = mapper.fit_transform(train_dt)

model training step

colNum = train_array.shape[1]
from sklearn.ensemble import GradientBoostingClassifier
gbtree = GradientBoostingClassifier(random_state=10)
gbtree.fit(train_array[:,0:colNum-1], train_array[:,colNum-1])

convert to PMML

from sklearn2pmml import sklearn2pmml
sklearn2pmml(gbtree, mapper, "testLabelBinarizer.pmml", with_repr = True)

error:
CalledProcessError: Command '['java', '-cp', '/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-converter-1.0.3.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.0-SNAPSHOT.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-agent-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-metro-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-schema-1.2.11.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pyrolite-4.10.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-api-1.7.18.jar:/Users/minw/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.18.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/_3/rcqhhsv17jlc9wk_83zvrjqw0000gn/T/tmpAuhOT9.pkl', '--repr-estimator', "GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n max_depth=3, max_features=None, max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=100,\n presort='auto', random_state=10, subsample=1.0, verbose=0,\n warm_start=False)", '--pkl-mapper-input', '/var/folders/_3/rcqhhsv17jlc9wk_83zvrjqw0000gn/T/tmpe7k2oy.pkl', '--repr-mapper', "DataFrameMapper(features=[('Age', None), ('Workclass', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)), ('fnlwgt', None), ('Education', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)), ('Education-Num', None), ('Marital Status', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=Fa...None), ('Country', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)), ('Labels', None)],\n sparse=False)", '--pmml-output', 'testLabelBinarizer.pmml']' returned non-zero exit status 1

Support for `sklearn.preprocessing.FunctionTransformer`

Would it be possible to implement the FunctionTransformer functionality in the supported preprocessing tools? it currently plays well with the pickle serialization of the DataFrameMapper object, but is failing on the call to the jar. I imagine the full functionality would be difficult due to the limitations of supported functions in the PMML format, but at the least some common transforms wrapped in the FunctionTransformer (e.g., basic numerical operations) are supported in PMML.

Interaction between categorical features does not make sense

from sklearn.feature_selection import SelectKBest
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer, PolynomialFeatures
from sklearn.linear_model import LinearRegression
import pandas


df = pandas.read_csv('Audit.csv')
df.dropna()

for c in df.columns:
    df[c] = df[c].astype(str)

df['Adjusted'] = df.Adjusted.astype(int)

mapper = DataFrameMapper([
    ('Education', LabelBinarizer()),
    ('Occupation', LabelBinarizer()),
])

poly = PolynomialFeatures()

regressor = LinearRegression()

pipeline = PMMLPipeline([
    ("mapper", mapper),
    ("selector", SelectKBest(k=20)),
    ("polynomial", poly),
    ("regressor", regressor)
])
pipeline.fit(df, df.Adjusted)
sklearn2pmml(pipeline, 'demo.pmml', with_repr=True)

AdaBoostClassifier class not found

Ensemble has AdaBoostRegressor but it does not have AdaBoostClassifier , what should be done?

error on LabelBinarizer when feature has two labels

Error appears when sklearn.preprocessing.LabelBinarizer is used and feature has two labels. In that case LabelBinarizer switches to "binary mode" and produces only one column.

So later we have such an error in DataFrame. updatePMML()

if (inputNames.size() != numberOfInputs) {
   throw new IllegalArgumentException();
}

The problem is discovered on pretty old (1.0-SNAPSHOT) version of jpmml-sklearn but I see that sklearn.preprocessing.LabelBinarizer is the same. So I suppose the problem still exists. Going to check with the latest version of jpmml-sklearn later

numpy function is log1p not ln1p

currently the FunctionTransformer (https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/preprocessing/FunctionTransformer.java#L120) is looking for a numpy ufunc ln1p but the numpy function is actually log1p (https://docs.scipy.org/doc/numpy/reference/ufuncs.html#math-operations)

Add Predicate caching for TreeModel producers

As requested in the JPMML mailing list:
https://groups.google.com/d/msg/jpmml/nIpr9gWcAq8/TuilwMX9DQAJ

Runt time exception while converting a random forest classifier model trained for bag of words using sklearn TfidfVectorizer

Unable to obtain pmml while converting a random forest trained for a TfidfVectorizer bag of words and the below exception is thrown RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

Meta-estimators for consensus prediction

We are looking to ensemble different models using several rules, for example, multiplying the results of two models and averaging the product with the result of a third model.

There could be a meta-estimator class that combines the predictions of child estimators based on some formula. The simplest meta-estimator configuration would be the one that accepts two estimators as arguments; all other meta-estimator configurations can be derived from this one via nesting.

For example:

regressor = ConsensusRegressor(regressor1, regressor2, formula = "(y[0] + y[1]) / 2")

Output pmml version

How can we specify the pmml version when generating the output?

Support for incremental learning of sklearn.neural_network.MLPRegressor

I am using the following code to train the sklearn.neural_network.MLPRegressor and finally converting it to PMML. Is there any way by means of which I can train my model with incremental data in the current version of the library?

              predictivescore_df = pandas.read_csv(filename)	

		clf = MLPRegressor(solver='lbfgs', alpha=1e-5,
		                    hidden_layer_sizes=(5), random_state=1, max_iter=100)

		pipeline = PMMLPipeline([
		  ('clf', clf)
		])
		pipeline.fit(predictivescore_df[predictivescore_df.columns.difference(["label"])], predictivescore_df["label"])			
		sklearn2pmml(pipeline, "PredictiveModel.pmml")

Support for `sklearn_pandas` 1.2.X

The 1.2.X development branch appears to utilize CPython classes:

python:  3.4.5
sklearn:  0.17.1
sklearn.externals.joblib: 0.9.4
sklearn_pandas:  1.2.0
sklearn2pmml:  0.12.0

SEVERE: Failed to parse DataFrameMapper PKL
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pandas.indexes.base._new_Index)
	at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)

Ensure the case-sensitivity of TF-IDF transformations

User feedback indicates that there might be an issue with the default values of TextIndex@isCaseInsensitive and TextIndexNormalization@isCaseInsensitive attributes when converting TF-IDF pipelines:
#50 (comment)

Support for `DictVectorizer`

As requested in the JPMML mailing list: https://groups.google.com/d/msg/jpmml/f75d-yN3CvI/HEJ5oygLAQAJ

Support `handle_unknown` param for OneHotEncoder

Currently jpmml-sklearn doesn't appear to provide any support for sklearn's handle_unknown='ignore' setting.
So values falling outside the pre-specified set will yield an error instead of the sklearn 'ignore' behavior where all output values are just set to 0.

That's fine if it's by design or a PMML limitation, but I wanted to check if it's supportable.

Generate Multiple target pmml failed

Hi, I wanna to generate pmml which includes multiple targest. For RandomForestRegressor, it seems that generated pmml is single target. For MultiOutputRegressor, error "Failed to convert java.lang.IllegalArgumentException" pop up.

RandomForestRegressor example:
Here is code:
from sklearn import datasets
from sklearn.datasets.base import Bunch
import csv
import numpy as np
from time import time
import pandas as pd
import scipy

caseName = "6MultipleOutputRandomForestRegressor_conti"
df = pd.read_csv("/D/AC/5.0/ScoringWithScikitLearn/Tests/data/employ_salary.csv",sep=",")
test_X = df.iloc[:,4:7]
test_y = df[['average_montly_hours','satisfaction_level']]

from sklearn2pmml import sklearn2pmml
from sklearn2pmml import PMMLPipeline
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

max_depth = 30
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(list(test_X.columns.values), [ContinuousDomain(), Imputer()])
])),
("regression", RandomForestRegressor(max_depth=max_depth,random_state=0)),
])

pipeline.fit(test_X,test_y)
sklearn2pmml(pipeline, "/D/AC/5.0/ScoringWithScikitLearn/Tests/out/"+caseName+"_PyPMML.xml", with_repr = True)

Expected behavior: multiple target field in MiningSchema in pmml
Current behavior: only one target field in MiningSchema in pmml, like this,

MultiOutputRegressor example:
Here is code (similar with RandomForest, just different model)
...
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(list(test_X.columns.values), [ContinuousDomain(), Imputer()])
])),
("regression", MultiOutputRegressor(RandomForestRegressor(max_depth=max_depth,random_state=0))),
])
...

Here is the error:
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 56 ms.
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
INFO: Converting..
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:74)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:74)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Traceback (most recent call last):
File "6MultipleOutputReg_conti.py", line 40, in
sklearn2pmml(pipeline, "/D/AC/5.0/ScoringWithScikitLearn/Tests/out/"+caseName+"_PyPMML.xml", with_repr = True)
File "/Users/lihuaw/.local/lib/python2.7/site-packages/sklearn2pmml/init.py", line 142, in sklearn2pmml
raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

Note: I also tried KNeighborsClassifier/KNeighborsRegressor/MLPRegressor, all of them meet the same error with MultiOutputRegressor. I guess multiple target pmml are not supported now by sklearn2pmml, Am I right?
Is there any plan to support this function?
Please correct me if I'm wrong.
Thanks a lot,

Using more than one CountVectorizer in DataFrameMapper gives error

I am trying to use more than one countvectorizer in DataFrameMapper and its giving an error .

DataFrameMapper([('attachment_links',countVectorizer2),('meta_email_subject',countVectorizer1)])
Although when I use a single CountVectorizer then there's no issue.
here the stacktrace-

Jul 20, 2017 5:38:55 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 785 ms.
Jul 20, 2017 5:38:55 PM org.jpmml.sklearn.Main run
INFO: Converting..
 at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
 at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122)
 at org.jpmml.sklearn.Main.run(Main.java:144)
 at org.jpmml.sklearn.Main.main(Main.java:93)
java.lang.IllegalArgumentException: tf
 at org.jpmml.converter.PMMLEncoder.addDefineFunction(PMMLEncoder.java:215)
 at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:126)
 at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:74)
 at sklearn.Initializer.encodeFeatures(Initializer.java:53)
Jul 20, 2017 5:38:55 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: tf
 at org.jpmml.converter.PMMLEncoder.addDefineFunction(PMMLEncoder.java:215)
 at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:126)
 at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:74)
 at sklearn.Initializer.encodeFeatures(Initializer.java:53)
 at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
 at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122)
 at org.jpmml.sklearn.Main.run(Main.java:144)
 at org.jpmml.sklearn.Main.main(Main.java:93)

Support for 32-bit Python pickle files

The encoding of the Tree object is platform dependent. An attempt to convert a 32-bit Python pickle object on a 64-bit Java throws a java.lang.StackOverflowError in method sklearn.tree.TreeModelUtil#encodeNode(Node, ..).

Supporting `sklearn.feature_extraction.text.TfidfVectorizer`

It would be great if the transformation sklearn.feature_extraction.text.TfidfVectorizer can be supported by JPMML-sklearn. It would be even better if both sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfTransformer can be supported (TfidfVectorizer is the combination of these two)

FunctionTransformer usage

I want to know limitations of using FunctionTransformer in scikit-learn in order to create valid PMML pipeline.
I have no experience in java so it is difficult to figure out for myself it's limitations, but I see in https://github.com/jpmml/jpmml-sklearn/blob/c0a3414c7486e880edf1aa79d4fd4b70d346cd2e/src/main/java/sklearn/preprocessing/FunctionTransformer.java that it supports various functions.

But my question is, how about combinations of those functions and simple +,-,*,/ math operations (e.g. np.log(np.ceil(x/100)+1) ), is there any possibility to create similar static expressions via FunctionTransformer?

If not, is there a way to do it in python and compile to valid PMML model using this library?

Support for `sklearn.svm.OneClassSVM`

would be awesome to add the sklearn.svm.OneClassSVM outlier detection system to the other LibSVM-based supported models!

can mapper be just the StandardScaler

Can I used the StandardScaler from sklearn.preprocessing as the mapper I export as a pkl file instead of exporting the sklearn_pandas.DataFrameMapper?

Why Estimator and Mapper

Hi Villu, I am reading through your tutorial and wondering is there any specific reason that you want to use DataframeMapper? Using the DataframeMapper, you ended up with two pickles objects, I am planning to export the PMML to openscoring, will openscoring be compatible of deploying both the mapper and estimator into a same model?
BTW, really great work!

Add possibility to add information to the pmml header.

Hello!

Great little tool. I would like the possibility to add stuff to the header of the pmml file as specified in the pmml standard.

For exampe one could add text to the tag of the header. I find this useful for storing information about how the model was trained. I dont know whats the best way to do this. I guess it could be done via commandline arguments?

The project we are working on will rely heavily on sklearn2pmml.
I guess adding something like sklearn2pmml(model, mapper, path, dict_with_header_information) is feasable?

Thanks!

Can't convert a neural_network.MLPRegressor model into PMML

I have been exploring the new MLPRegressor model from sklearn version 0.18, but I can't convert a trained model to PMML with my regular flow (where all models get converted).

I can't seam to get the exact problem, but I believe the support to this model is outdated.

If this is not the case, do you have any guess on what the problem may be? Thanks!

PD: Thanks a lot for the work in this library. It has been really useful.

`FunctionTransformer` serialization in `DataFrameMapper` failing on `numpy.log`

when using the log ufunc in the FunctionTransformer preprocessing option within a DataFrameMapper, the conversion to PMML now fails:

SEVERE: Failed to convert DataFrameMapper
java.lang.IllegalArgumentException: statsmodels.datasets.anes96.data
    at sklearn.preprocessing.FunctionTransformer.parseUFunc(FunctionTransformer.java:102)
    at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:57)
    at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:70)
    at org.jpmml.sklearn.Main.run(Main.java:146)
    at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.IllegalArgumentException: statsmodels.datasets.anes96.data
    at sklearn.preprocessing.FunctionTransformer.parseUFunc(FunctionTransformer.java:102)
    at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:57)
    at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:70)
    at org.jpmml.sklearn.Main.run(Main.java:146)
    at org.jpmml.sklearn.Main.main(Main.java:107)

evidently due to a dependency on statsmodels? Interestingly, the process still works with the log10 ufunc.

XGBClassifier wrapper failing to convert

When using the xgboost.XGBClassifer wrapper, the estimator fails to convert. I get the error:

Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsing DataFrameMapper PKL..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsed DataFrameMapper PKL in 31 ms.
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Converting DataFrameMapper..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Converted DataFrameMapper in 27 ms.
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsing Estimator PKL..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Parsed Estimator PKL in 5 ms.
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
INFO: Converting Estimator..
Aug 29, 2016 4:21:50 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert Estimator
java.lang.ClassCastException: numpy.core.NDArray cannot be cast to java.util.List
        at xgboost.sklearn.XGBClassifier.getClasses(XGBClassifier.java:55)
        at sklearn.Classifier.createSchema(Classifier.java:43)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.ClassCastException: numpy.core.NDArray cannot be cast to java.util.List
        at xgboost.sklearn.XGBClassifier.getClasses(XGBClassifier.java:55)
        at sklearn.Classifier.createSchema(Classifier.java:43)
        at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
        at org.jpmml.sklearn.Main.run(Main.java:189)
        at org.jpmml.sklearn.Main.main(Main.java:107)

My Version Info

numpy 1.11.1
pandas 0.18.1
xgboost 0.6
sklearn 0.17.1
joblib 0.10.0
java 1.8.0_91
python 2.7.10

Example

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn_pandas import DataFrameMapper
from sklearn.datasets import load_iris
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml import sklearn2pmml

iris = load_iris()

iris_df = pd.concat((pd.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pd.DataFrame(iris.target, columns = ["Species"])), axis = 1)
# change to binary classification problem
iris_df = iris_df[iris_df['Species'] > 0]

# EDIT, not included in original example
iris_mapper = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()]),
    ("Species", None)
])

iris = iris_mapper.fit_transform(iris_df)
iris_X = iris[:, 0:4]
iris_y = iris[:, 4]

iris_clf = XGBClassifier()
iris_clf.fit(iris_X, iris_y)

sklearn2pmml(estimator = iris_clf, mapper = iris_mapper, pmml = "code_output/irisXGB.pmml", with_repr = True)

Add support for `distance` weight function in KNN models

I am getting the the "returned non-zero exit status 1" error with the new version 0.17 sklearn2pmml, when using it with GridSearchCV.

Version info

('python: ', '2.7.6')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.19.2')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.17.0')

Code to reproduce

1) Working correctly:

from sklearn.datasets import load_boston
boston_data = load_boston()
X = boston_data.data
y = boston_data.target

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

knn_pipe = PMMLPipeline([
("regressor", KNeighborsRegressor())
])

knn_pipe.fit(X,y)
sklearn2pmml(knn_pipe, ".../SimpleFit.pmml", with_repr = True, debug = True)

2) Throwing error:

from sklearn.datasets import load_boston
boston_data = load_boston()
X = boston_data.data
y = boston_data.target

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

knn_pipe = PMMLPipeline([
("regressor", KNeighborsRegressor())
])

param_grid = {"regressor__n_neighbors": [3, 2,10],
          "regressor__weights": ["uniform","distance"],
          "regressor__algorithm": ["auto", "ball_tree", "kd_tree"]}
cv = GridSearchCV(knn_pipe, param_grid=param_grid)
cv.fit(X,y)

Using the following line gives "TypeError: The pipeline object is not an instance of PMMLPipeline" which is understandable.

sklearn2pmml(cv, ".../GridSearchFit.pmml", with_repr = True, debug = True)

So I tried using cv.best_estimator_ in it, but it throws the "returned non-zero exit status 1" error.

sklearn2pmml(cv.best_estimator_, ".../GridSearchFit.pmml", with_repr = True, debug = True)

Stack trace of error:

('python: ', '2.7.6')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.19.2')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.17.0')
java -cp /usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-yd1bTD.pkl.z --repr-pipeline PMMLPipeline(steps=[('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='distance'))]) --pmml-output /home/.../GridSearchFit.pmml
('Preserved joblib dump file(s): ', '/tmp/pipeline-yd1bTD.pkl.z')
Traceback (most recent call last):

  File "<ipython-input-12-b7a0923021e7>", line 1, in <module>
    sklearn2pmml(cv.best_estimator_, "/home/.../GridSearchFit.pmml", with_repr = True, debug = True)

  File "/usr/local/lib/python2.7/dist-packages/sklearn2pmml/__init__.py", line 132, in sklearn2pmml
    subprocess.check_call(cmd)

  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)

CalledProcessError: Command '['java', '-cp', '/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', '/tmp/pipeline-yd1bTD.pkl.z', '--repr-pipeline', "PMMLPipeline(steps=[('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',\n          metric_params=None, n_jobs=1, n_neighbors=10, p=2,\n          weights='distance'))])", '--pmml-output', '/home/.../GridSearchFit.pmml']' returned non-zero exit status 1

Here is the pickle saved file for this error. I have renamed it from Grid_pipeline-yd1bTD.pkl.z to Grid_pipeline-yd1bTD.pkl.zip to be able to upload here.
Grid_pipeline-yd1bTD.pkl.zip

Need help to understand the generated PMML for neural network

I am using the following code of scikit-learn to generate the pmml file. I need help to understand how it is generating the final result.

pipeline.txt

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(5), random_state=1, max_iter=100)

clf.fit(X_train, y_train)
pipeline = PMMLPipeline([
  ('clf', clf)
])

sklearn2pmml(pipeline, "pipeline.pmml",debug = True)

1) I don't understand the significance of Neuron id=logistic/1. I mentioned only one hidden layer with 5 neurons. Why it has been generated in the final PMML. Moreover, My final output is supposed to be one neuron why is it showing two having neuron id to be event/true and event/false

<NeuralLayer activationFunction="logistic">
			<Neuron id="logistic/1" bias="0.0">
				<Con from="2/1" weight="1.0"/>
			</Neuron>
		</NeuralLayer>
		<NeuralLayer activationFunction="identity">
			<Neuron id="event/false" bias="1.0">
				<Con from="logistic/1" weight="-1.0"/>
			</Neuron>
			<Neuron id="event/true" bias="0.0">
				<Con from="logistic/1" weight="1.0"/>
			</Neuron>
		</NeuralLayer>
		<NeuralOutputs>
			<NeuralOutput outputNeuron="event/false">
				<DerivedField optype="categorical" dataType="string">
					<NormDiscrete field="y" value="0"/>
				</DerivedField>
			</NeuralOutput>
			<NeuralOutput outputNeuron="event/true">
				<DerivedField optype="categorical" dataType="string">
					<NormDiscrete field="y" value="1"/>
				</DerivedField>
			</NeuralOutput>
		</NeuralOutputs>

2) Also, I am using the following code to predict the score. I couldn't understand what is the difference between evaluator.getOutputFields() and evaluator.getTargetFields() . Moreover, boolean target = (boolean)(targetFieldValue instanceof Boolean); casting this in Double gives the error. Ideally, my neural network should give the probability in double where as it is giving boolean values. Please help

			PMML pmml = readPMML(new File("pipeline.pmml"));        	
			ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
			ModelEvaluator<?> modelEvaluator = modelEvaluatorFactory.newModelEvaluator(pmml);
        	Evaluator evaluator = (Evaluator)modelEvaluator;
        	
        	List<InputField> inputFields = evaluator.getInputFields();
        	Map<FieldName, FieldValue> arguments = new LinkedHashMap<FieldName,FieldValue>();
			
			Double[] inputArray = new Double[]{0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9,0.9};

			for (int i = 0; i < inputFields.size(); i++) {
				FieldName inputFieldName = inputFields.get(i).getName();			 
			 	FieldValue inputFieldValue = inputFields.get(i).prepare(inputArray[i]);
			 	arguments.put(inputFieldName, inputFieldValue);
			}
			Map<FieldName, ?> results = evaluator.evaluate(arguments);
			Set keys = results.keySet();

   			for (Iterator i = keys.iterator(); i.hasNext(); ) {
				FieldName targetFieldName = (FieldName)i.next();
    			System.out.println("The field name is " + targetFieldName.getValue());      
   			}

			List<TargetField> targetFields = evaluator.getTargetFields();
			System.out.println(targetFields.size());
			for (int i =0 ; i < targetFields.size(); i++){
				FieldName targetFieldName = targetFields.get(i).getName();
    			System.out.println("The field name is " + targetFieldName.getValue());
    			Object targetFieldValue = results.get(targetFieldName);
    			boolean target = (boolean)(targetFieldValue instanceof Boolean);
    			System.out.println("the output is" + target);
			}

			List<OutputField> outputFields = evaluator.getOutputFields();	
			System.out.println(outputFields.size());			
			for(int i =0 ; i < outputFields.size(); i++ ){
			    FieldName outputFieldName = outputFields.get(i).getName();
				System.out.println("The field name is " + outputFieldName.getValue());

			    Object outputFieldValue = results.get(outputFieldName);
				boolean output = (boolean)(outputFieldValue instanceof Boolean);
				System.out.println("output value is " + output);
			}

XGBClassifier Loses Feature Names

When converting a pickled XGBClassifier instance, the resulting PMML appears to lose the feature names in the data dictionary.

Full reproduction:

data (iris.csv, modified to have headers):

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
...

python:

import pandas as pd
import xgboost as xgb
from sklearn.externals import joblib

iris = pd.read_csv('iris.csv')

model = xgb.XGBClassifier()
model.fit(iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']],
          iris.species)

joblib.dump(model, 'model.pkl')

conversion:

java -jar converter-executable-1.1-SNAPSHOT.jar \
--pkl-input model.pkl \
--pmml-output model.pmml

result:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
	<Header>
		<Application name="JPMML-SkLearn" version="1.1-SNAPSHOT"/>
		<Timestamp>2016-12-09T15:34:49Z</Timestamp>
	</Header>
	<DataDictionary>
		<DataField name="x1" optype="continuous" dataType="float"/>
		<DataField name="x2" optype="continuous" dataType="float"/>
		<DataField name="x3" optype="continuous" dataType="float"/>
		<DataField name="x4" optype="continuous" dataType="float"/>
		<DataField name="y" optype="categorical" dataType="string">
			<Value value="Iris-setosa"/>
			<Value value="Iris-versicolor"/>
			<Value value="Iris-virginica"/>
		</DataField>
	</DataDictionary>
    ...

Passing Parameters to Custom Transformation Function

Hi , I have a column with continuous values , in pandas I can use pd.cut() to group it into bins . I don't want to have the values in a separate file (as suggested in some other issues ).

mapper = DataFrameMapper([('customer_id',LabelEncoder()),(['date1' ,'date2'],[customtransform_add(function='subtract-dates')])])

Is there a way I can pass parameters to my own custom transformation function specifying the bins ?
Any help would be great.

VotingClassifier failure

I'm getting an error when running sklearn2pmml() on an sklearn VotingClassifier.

Here is the code:

gb_1 = GradientBoostingClassifier(n_estimators=n_est_ens)
ext_1 = ExtraTreesClassifier(n_estimators=n_est_ens, criterion='gini', max_features='sqrt')
ext_2 = ExtraTreesClassifier(n_estimators=n_est_ens, criterion='entropy', max_features='log2')
rf_1 = RandomForestClassifier(n_estimators=n_est_ens, criterion='gini', random_state=randint(1,10000))
rf_2 = RandomForestClassifier(n_estimators=n_est_ens, criterion='entropy', random_state=randint(1,10000))
clf = VotingClassifier(estimators=[ ('gb_1', gb_1), ('ext_1', ext_1), ('ext_2', ext_2), ('rf_1', rf_1), ('rf_2', rf_2)], voting=voting, weights=[3,1,1,1,1])

clf.fit(X_train, y_train)
sklearn2pmml(clf, None, pmml_file, with_repr = False, debug = True)

Here is the error:

('python: ', '2.7.12')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('sklearn_pandas: ', '1.1.0')
('sklearn2pmml: ', '0.12.1')
java -cp ~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/guava-19.0.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pyrolite-4.14.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/serpent-1.15.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.4.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar org.jpmml.sklearn.Main --pkl-estimator-input /tmp/estimator-XOBeGz.pkl.z --pmml-output RF200.pmml
Dec 09, 2016 3:48:19 PM org.jpmml.sklearn.Main run
INFO: Parsing Estimator PKL..
Dec 09, 2016 3:48:19 PM org.jpmml.sklearn.Main run
INFO: Parsed Estimator PKL in 35 ms.
Exception in thread "main" java.lang.IllegalArgumentException: The estimator object (Python class numpy.ndarray) is not an Estimator or is not a supported Estimator subclass
	at org.jpmml.sklearn.Main.run(Main.java:180)
	at org.jpmml.sklearn.Main.main(Main.java:107)
('Preserved joblib dump file(s): ', '/tmp/estimator-XOBeGz.pkl.z')
Traceback (most recent call last):
  File "grid_search.py", line 251, in <module>
    sklearn2pmml(clf, None, pmml_file, with_repr = False, debug = True)
  File "~/.local/lib/python2.7/site-packages/sklearn2pmml/__init__.py", line 65, in sklearn2pmml
    subprocess.check_call(cmd)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['java', '-cp', '~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/guava-19.0.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pyrolite-4.14.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/serpent-1.15.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.4.jar:~/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/tmp/estimator-XOBeGz.pkl.z', '--pmml-output', 'RF200.pmml']' returned non-zero exit status 1

Support for Multinomial Naive Bayes

Is there any estimate on when we can expect PMML support for Multinomial naive bayes model in sklearn?

TfidfVectorizer IllegalArgumentException issue while converting to PMML

I'm trying to convert a pipeline saved as pkl into a pmml file using the command:
java -jar converter-executable-1.2-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

And I get this:

SEVERE: Failed to convert
java.lang.IllegalArgumentException: l2
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:45)
        at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:71)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:102)
        at org.jpmml.sklearn.Main.run(Main.java:133)
        at org.jpmml.sklearn.Main.main(Main.java:99)

Exception in thread "main" java.lang.IllegalArgumentException: l2
        at sklearn.feature_extraction.text.TfidfVectorizer.encodeFeatures(TfidfVectorizer.java:45)
        at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:71)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:102)
        at org.jpmml.sklearn.Main.run(Main.java:133)
        at org.jpmml.sklearn.Main.main(Main.java:99)

Same thing happens, of course, when I try the directly from sklearn2pmml:

sklearn2pmml(pipeline, "pipeline.pmml", with_repr=True)

Versions:
sklearn 0.18.1
joblib 0.10.3
sklearn-pandas 1.3.0
sklearn2pmml 0.17.0

Support for sparse matrix types

Selected SkLearn estimators can be fitted with sparse data. This is rather common when working with high-dimensional (n > 10'000) feature spaces.

Currently, such estimators can be handled by JPMML-SkLearn when they have been post-processed by converting attribute values from sparse matrix datatypes (eg. scipy.sparse.csc, scipy.sparse.csr) to numpy.ndarray datatype.

In the future, it would be desirable to perform such conversions in Java library code.

Supporting `sklearn.neural_network.MLPClassifier`

Are there any plans to support PMML export of the sklearn multilayer perceptron classifier implementation?

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/multilayer_perceptron.py

Support for custom transformers

I'm trying to use a custom transformer, for example, this simple one takes the max value across columns:

class MaxTransform(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self
    def transform(self, x):
        return x.max(1) # or np.max(x, 1)

mapper = DataFrameMapper([
    (["x1","x2"], [MaxTransform()])
])

model=ensemble.GradientBoostingClassifier()

pipeline = PMMLPipeline([
    ("mapper", mapper),
    ("estimator", model)
])

However, when I try to build the pmml, I'm getting :
"The value object (Python class main.MaxTransform) is not a Transformer or is not a supported Transformer subclass"

Am I doing something wrong or is this not currently supported? I'm able to fit the pipeline, but not convert it to pmml.

Support for scikit-neuralnetwork model types

I am using the wrapper of scikit-learn Multilayer Perceptron in Python scikit-neuralnetwork to train the neural network and save it to a file. Now, I want to expose it on production to predict in real time. So, I was thinking to use Java/Golang for better concurrency than Python. Hence, my question is how do I read the model using this library written using Python or above wrapper? The code below I am using for training the model and last three lines I want to port to Java/GoLang to expose it on production

import pickle
import numpy as np
import pandas as pd
from sknn.mlp import Classifier, Layer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

f = open("TrainLSDataset.csv")
data = np.loadtxt(f,delimiter = ',')

x = data[:, 1:]
y = data[:, 0]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

nn = Classifier(
    layers=[            	    
        Layer("Rectifier", units=5),
        Layer("Softmax")],
    learning_rate=0.001,
    n_iter=100)

nn.fit(X_train, y_train)
filename = 'finalized_model.pkl'
pickle.dump(nn, open(filename, 'wb'))

**#Below code i want to write in GoLang for exposing it on Production** :
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
y_pred = loaded_model.predict(X_test)

I tried using the following command to convert the model written using above code to pmml. But it gives the following error. Could you please tell me what i am doing wrong here.

Also, can you share any link where model is trained using python and used in Java/Golang to score or predict.

java -jar target/converter-executable-1.2-SNAPSHOT.jar --pkl-input finalized_model.pkl --pmml-output finalized_model.pmml

SEVERE: Failed to parse PKL
net.razorvine.pickle.PickleException: failed to reconstruct()
	at net.razorvine.pickle.objects.Reconstructor.construct(Reconstructor.java:22)
	at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:708)
	at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:176)
	at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:136)
	at net.razorvine.pickle.Unpickler.load(Unpickler.java:100)
	at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:157)
	at org.jpmml.sklearn.Main.run(Main.java:111)
	at org.jpmml.sklearn.Main.main(Main.java:99)
Caused by: java.lang.NoSuchMethodException: net.razorvine.pickle.objects.ClassDictConstructor.reconstruct(java.lang.Object, java.lang.Object)
	at java.lang.Class.getMethod(Class.java:1786)
	at net.razorvine.pickle.objects.Reconstructor.construct(Reconstructor.java:19)
	... 7 more

Exception in thread "main" net.razorvine.pickle.PickleException: failed to reconstruct()
	at net.razorvine.pickle.objects.Reconstructor.construct(Reconstructor.java:22)
	at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:708)
	at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:176)
	at org.jpmml.sklearn.PickleUtil$1.dispatch(PickleUtil.java:136)
	at net.razorvine.pickle.Unpickler.load(Unpickler.java:100)
	at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:157)
	at org.jpmml.sklearn.Main.run(Main.java:111)
	at org.jpmml.sklearn.Main.main(Main.java:99)
Caused by: java.lang.NoSuchMethodException: net.razorvine.pickle.objects.ClassDictConstructor.reconstruct(java.lang.Object, java.lang.Object)
	at java.lang.Class.getMethod(Class.java:1786)
	at net.razorvine.pickle.objects.Reconstructor.construct(Reconstructor.java:19)
	... 7 more

Loss of features while converting PMMLPiepeline to pmml

Hi,
I have fitted a JPMMLPipeline and dumped it to .pkl.z as seen in this code

clf = RandomForestClassifier(n_estimators = 10)
mapper = DataFrameMapper([
    (data_class_m.columns.values, [ContinuousDomain(), StandardScaler()])
])
pipeline = PMMLPipeline([
   ("mapper", mapper),
    ("estimator", clf)
])
pipeline.fit(data_class_m, target_m)
joblib.dump(pipeline, "p.pkl.z", compress=9)

and then I converted the pipeline to PMML (I tried both ways - directly from python sklearn2pmml(pipeline, "pipeline.pmml", with_repr=True) and from command line java -jar converter-executable-1.2-SNAPSHOT.jar --pkl-input p.pkl.z --pmml-output pipeline.pmml)

However, after I loaded the pmml into Java using the JPMML, I found out that there are only 63 feature (InputFields) instead of 1000 features that were used for fitting. Even when I load the p.pkl.z file back to Python, I can see all of the 1000 features. Is it possible, that the converter stores only those features that are actually used in the trained model? Or is there any other reason why I lost so many features during the conversion?

[Feature Request] pmml->sklearn

In order for pmml-sklearn workflow to be fully useful there needs to be a way of turning PMML representations into scikit-learn models.

AFAIK, this functionality does not exist afaik in any sklearn - pmml library. This means the current flow is one directional.

Aliases are not generated when sklearn classes are not fitted without pipeline

Could you please tell the difference between objRegressor.fit(FeatureSet_df[FeatureSet_df.columns.difference(["Label"])], FeatureSet_df["Output"])
and
pipeline.fit(FeatureSet_df[FeatureSet_df.columns.difference(["Label"])], FeatureSet_df["Output"]). In the latter one alias are generated as mentioned in the dataframe in the following code. Where as in the former one alias are not generated, why is it so?

FeatureSet_df = pandas.read_csv("PredictiveData.csv")
objRegressor = MLPRegressor(solver='sgd', alpha=1e-6,
               hidden_layer_sizes=(10), random_state=1, max_iter=150)

objRegressor.fit(FeatureSet_df[FeatureSet_df.columns.difference(["Label"])], FeatureSet_df["Output"])

pipeline = PMMLPipeline([
  ('objRegressor', objRegressor)
])

sklearn2pmml(pipeline, "Testing.pmml")

OneHotEncoder parsing failure

It seems that the OneHotEncoder parsing fails for an unknown reason. Maybe I am using it wrong. My pipeline is as follows :

categorical_features=[1,2,3,4]
oh_encoder = sklearn.preprocessing.OneHotEncoder(categorical_features=categorical_features)
rfc = sklearn.ensemble.RandomForestClassifier(n_estimators=100, min_samples_leaf=20)
pipeline = PMMLPipeline([
   ("encoder", oh_encoder),
   ("classifier", rfc)
])
fraud_pipeline.fit(X,y)
joblib.dump(pipeline, "ohe_rfc_pipeline.pkl.z", compress=9)

I then go through the converter jar and get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Expected 1 element(s), got 4 element(s)
	at org.jpmml.sklearn.ClassDictUtil.checkSize(ClassDictUtil.java:136)
	at sklearn.preprocessing.OneHotEncoder.getValues(OneHotEncoder.java:103)
	at sklearn.preprocessing.OneHotEncoder.getDataType(OneHotEncoder.java:52)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:115)
	at org.jpmml.sklearn.Main.run(Main.java:146)
	at org.jpmml.sklearn.Main.main(Main.java:93)

If I set the oneHotEncoder with only one categorical feature (instaed of 4) the error becomes :

Exception in thread "main" java.lang.ClassCastException: sklearn.preprocessing.OneHotEncoder cannot be cast to sklearn.HasNumberOfFeatures
	at sklearn2pmml.PMMLPipeline.initFeatures(PMMLPipeline.java:153)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:115)
	at org.jpmml.sklearn.Main.run(Main.java:146)
	at org.jpmml.sklearn.Main.main(Main.java:93)

Maybe I am missing something in my use of the oneHotEncoder ?

Converting to pmml programatically

The converter-executable jar converts .pkl files to .pmml files and is very convenient to use.
However, I wish to do that process programatically as part of my application lifecycle.
Is there a way to do that without manually packing the converter-executable jar with my code base and calling its Main class' main() method? Maybe a java class that does the exact same thing from within any of the artifacts of the evaluator maven dependency?

10x

Support numerical classes in classifiers

It seems when we convert an sklearn model to PMML, the target field is assumed to be a string (even if we had defined numerical classes in sklearn):
https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn/Classifier.java#L51

This makes it challenging to build ensemble models that combine the outputs of multiple different PMML models, for example, and taking the average/min/median/max/etc.

Is there any way to ensure these numerical classes are preserved in conversion from sklearn to PMML with the APIs today?