In this repository you'll find an open source Python module that gives you a simple binding to interact with BigML. You can use it to easily create, retrieve, list, update, and delete BigML resources (i.e., sources, datasets, models and, predictions).
This module is licensed under the Apache License, Version 2.0.
Please, report problems and bugs to our BigML.io issue tracker
Discussions about the different bindings take place in the general BigML mailing list. Or join us in our Campfire chatroom
The only mandatory dependency is the requests library. This library is automatically installed during the setup.
The bindings will also use simplejson
if you happen to have it
installed, but that is optional: we fall back to Python's built-in
JSON libraries is simplejson
is not found.
To install:
$ python setup.py install
You can also install the bindings directly from this git repository
using pip
:
$ pip install -e git://github.com/bigmlcom/python.git#egg=bigml_python
import bigml.api
Alternatively you can just import the BigML class:
from bigml.api import BigML
All the requests to BigML.io must be authenticated using your username and API key and are always transmitted over HTTPS.
This module will look for your username and API key in the environment
variables BIGML_USERNAME
and BIGML_API_KEY
respectively. You can
add the following lines to your .bashrc
or .bash_profile
to set
those variables automatically when you log in:
export BIGML_USERNAME=myusername
export BIGML_API_KEY=ae579e7e53fb9abd646a6ff8aa99d4afe83ac291
With that environment set up, connecting to BigML is a breeze:
from bigml.api import BigML
api = BigML()
Otherwise, you can can initialize directly when instantiating the BigML class as follows.
api = BigML('myusername', 'ae579e7e53fb9abd646a6ff8aa99d4afe83ac291')
To run the tests you will need to install lettuce:
$ pip install lettuce
and set up your authentication via environment variables, as explained above. With that in place, you can run the test suite simply by:
$ cd tests
$ lettuce
Imagine that you want to use
this csv file containing the
Iris flower dataset
to predict the species of a flower whose sepal length
is 5
and
whose sepal width
is 2.5
. A preview of the dataset is shown
below. It has 4 numeric fields: sepal length
, sepal width
, petal length
, petal width
and a categorical field: species
. By default,
BigML considers the last field in the dataset as the objective field
(i.e., the field that you want to generate predictions for).
sepal length,sepal width,petal length,petal width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
...
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
...
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
You can easily generate a prediction following these steps:
from bigml.api import BigML
api = BigML()
source = api.create_source('./data/iris.csv')
dataset = api.create_dataset(source)
model = api.create_model(dataset)
prediction = api.create_prediction(model, {'sepal length': 5, 'sepal width': 2.5})
You can then print the prediction using the pprint
method.
api.pprint(prediction)
You'll see:
species for {"sepal width": 2.5, "sepal length": 5} is Iris-virginica
BigML automatically generates idenfiers for each field. To see the
fields and the ids and types that have been assigned to a source you
can use get_fields
:
source = api.get_source(source)
api.pprint(api.get_fields(source))
and you'll get:
{ u'000000': { u'column_number': 0,
u'name': u'sepal length',
u'optype': u'numeric'},
u'000001': { u'column_number': 1,
u'name': u'sepal width',
u'optype': u'numeric'},
u'000002': { u'column_number': 2,
u'name': u'petal length',
u'optype': u'numeric'},
u'000003': { u'column_number': 3,
u'name': u'petal width',
u'optype': u'numeric'},
u'000004': { u'column_number': 4,
u'name': u'species',
u'optype': u'categorical'}}
If you want to get some basic statistics for each field you can
retrieve the fields
from the dataset as follows:
dataset = api.get_dataset(dataset)
api.pprint(api.get_fields(dataset))
You will get a dictionary keyed by field id:
{ u'000000': { u'column_number': 0,
u'datatype': u'double',
u'name': u'sepal length',
u'optype': u'numeric',
u'summary': { u'maximum': 7.9,
u'median': 5.77889,
u'minimum': 4.3,
u'missing_count': 0,
u'population': 150,
u'splits': [ 4.51526,
4.67252,
4.81113,
[... snip ... ]
u'000004': { u'column_number': 4,
u'datatype': u'string',
u'name': u'species',
u'optype': u'categorical',
u'summary': { u'categories': [ [ u'Iris-versicolor',
50],
[u'Iris-setosa', 50],
[ u'Iris-virginica',
50]],
u'missing_count': 0}}}
One of the greatest things about BigML is that the models that it generates for you are fully white-boxed. To get the model for the example above you can retrieve it as follows:
model = api.get_model(model)
api.pprint(model['object']['model']['root'])
You will get a explicit tree-like predictive model:
{u'children': [
{u'children': [
{u'children': [{u'count': 38,
u'distribution': [[u'Iris-virginica', 38]],
u'output': u'Iris-virginica',
u'predicate': {u'field': u'000002',
u'operator': u'>',
u'value': 5.05}},
u'children': [
[ ... ]
{u'count': 50,
u'distribution': [[u'Iris-setosa', 50]],
u'output': u'Iris-setosa',
u'predicate': {u'field': u'000002',
u'operator': u'<=',
u'value': 2.45}}]},
{u'count': 150,
u'distribution': [[u'Iris-virginica', 50],
[u'Iris-versicolor', 50],
[u'Iris-setosa', 50]],
u'output': u'Iris-virginica',
u'predicate': True}]}}}
(Note that we have abbreviated the output in the snippet above for readability: the full predictive model you'll get is going to contain much more details).
Newly-created resources are returned in a dictionary with the following keys:
- code: If the request is successful you will get a
bigml.api.HTTP_CREATED
(201) status code. Otherwise, it will be one of the standard HTTP error codes detailed in the documentation. - resource: The identifier of the new resource.
- location: The location of the new resource.
- object: The resource itself, as computed by BigML.
- error: If an error occurs and the resource cannot be created, it
will contain an additional code and a description of the error. In
this case, location, and resource will be
None
.
Please, bear in mind that resource creation is almost always
asynchronous (predictions are the only exception). Therefore, when
you create a new source, a new dataset or a new model, even if you
receive an immediate response from the BigML servers, the full
creation of the resource can take from a few seconds to a few days,
depending on the size of the resource and BigML's load. A resource is
not fully created until its status is bigml.api.FINISHED
. See the
documentation on status codes
for the listing of potential states and their semantics. So depending
on your application you might need to import the following constants.
bigml.api import WAITING
bigml.api import QUEUED
bigml.api import STARTED
bigml.api import IN_PROGRESS
bigml.api import SUMMARIZED
bigml.api import FINISHED
bigml.api import FAULTY
bigml.api import UNKNOWN
bigml.api import RUNNABLE
You can query the status of any resource with the status
method.
api.status(source)
api.status(dataset)
api.status(model)
api.status(prediction)
Before invoking the creation of a new resource, the library checks
that the status of the resource that is passed as a parameter is
FINISHED
. You can change how often the status will be checked with
the wait_time
argument. By default, it is set to 3 seconds.
To create a source from a local data file, you can use the
create_source
method. The only required parameter is the path to the
data file. You can use a second optional parameter to specify any of
the options for source creation described in the
BigML API documentation.
Here's a sample invocation:
from bigml.api import BigML
api = BigML()
source = api.create_source('./data/iris.csv',
{'name': 'my source', 'source_parser': {'missing_tokens': ['?']}})
As already mentioned, source creation is asynchronous: the initial
resource status code will be either WAITING
or QUEUED
. You can
retrieve the updated status at any time using the corresponding get
method. For example, to get the status of our source we would use:
api.status(source)
Once you have created a source, you can create a dataset. The only required argument to create a dataset is a source id. You can add all the additional arguments accepted by BigML and documented here.
For example, to create a dataset named "my dataset" with the first 1024 bytes of a source, you can submit the following request:
dataset = api.create_dataset(source, {"name": "my dataset", "size": 1024})
Upon success, the dataset creation job will be queued for execution,
and you can follow its evolution using api.status(dataset)
.
Once you have created a dataset, you can create a model. The only required argument to create a model is a dataset id. You can also include in the request all the additional arguments accepted by BigML and documented here.
For example, to create a model only including the first two fields and the first 10 instances in the dataset, you can use the following invocation:
model = api.create_model(dataset, {
"name": "my model", "input_fields": ["000000", "000001"], "range": [1, 10]})
Again, the model is scheduled for creation, and you can retrieve its
status at any time by means of api.status(model)
.
You can now use the model resource identifier together with some input
parameters to ask for predictions, using the create_prediction
method. You can also give the prediction a name.
prediction = api.create_prediction(model,
{"sepal length": 5,
"sepal width": 2.5},
{"name": "my prediction"})
To see the prediction you can use pprint
:
api.pprint(prediction)
When retrieved individually, resources are returned as a dictionary
identical to the one you get when you create a new resource. However,
the status code will be bigml.api.HTTP_OK
if the resource can be
retrieved without problems, or one of the HTTP standard error codes
otherwise.
You can list resources with the appropriate api method:
api.list_sources()
api.list_datasets()
api.list_models()
api.list_predictions()
you will receive a dictionary with the following keys:
- code: If the request is successful you will get a
bigml.api.HTTP_OK
(200) status code. Otherwise, it will be one of the standard HTTP error codes. See BigML documentation on status codes for more info. - meta: A dictionary including the following keys that can help
you paginate listings:
- previous: Path to get the previous page or
None
if there is no previous page. - next: Path to get the next page or
None
if there is no next page. - offset: How far off from the first entry in the resources is the first one listed in the resources key.
- limit: Maximum number of resources that you will get listed in the resources key.
- total_count: The total number of resources in BigML.
- previous: Path to get the previous page or
- objects: A list of resources as returned by BigML.
- error: If an error occurs and the resource cannot be created, it
will contain an additional code and a description of the error. In
this case, meta, and resources will be
None
.
You can filter resources in listings using the syntax and fields labeled as filterable in the BigML documentation for each resource.
A few examples:
[source['resource'] for source in
api.list_sources("limit=5;created__lt=2012-04-1")['objects']]
[dataset['name'] for dataset in
api.list_datasets("limit=10;size__gt=1048576")['objects']]
[model['name'] for model in api.list_models("columns__gt=5")['objects']]
[prediction['resource'] for prediction in
api.list_predictions("model_status=true")['objects']]
You can order resources in listings using the syntax and fields labeled as sortable in the BigML documentation for each resource.
A few examples:
[source['name'] for source in api.list_sources("order_by=size")['objects']]
[dataset['rows'] for dataset in
api.list_datasets("created__lt=2012-04-1;order_by=size")['objects']]
[model['resource'] for model in
api.list_models("order_by=-number_of_predictions")['objects']]
[prediction['name'] for prediction in
api.list_predictions("order_by=name")['objects']]
When you update a resource, it is returned in a dictionary exactly
like the one you get when you create a new one. However the status
code will be bigml.api.HTTP_ACCEPTED
if the resource can be updated
without problems or one of the HTTP standard error codes otherwise.
api.update_source(source, {"name": "new name"})
api.update_dataset(dataset, {"name": "new name"})
api.update_model(model, {"name": "new name"})
api.update_prediction(prediction, {"name": "new name"})
Resources can be deleted individually using the corresponding method for each type of resource.
api.delete_source(source)
api.delete_dataset(dataset)
api.delete_model(model)
api.delete_prediction(prediction)
Each of the calls above will return a dictionary with the following keys:
- code If the request is successful, the code will be a
bigml.api.HTTP_NO_CONTENT
(204) status code. Otherwise, it wil be one of the standard HTTP error codes. See the documentation on status codes for more info. - error If the request does not succeed, it will contain a
dictionary with an error code and a message. It will be
None
otherwise.