Coder Social home page Coder Social logo

polyaxon / traceml Goto Github PK

View Code? Open in Web Editor NEW
493.0 14.0 43.0 120.51 MB

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

License: Apache License 2.0

Python 100.00%
pandas pandas-summary dataframes data-science spark dask plotly statistics matplotlib data-profiling

traceml's Introduction

License: Apache 2 TraceML Slack Docs GitHub GitHub

TraceML

Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.

Install

pip install traceml

If you would like to use the tracking features, you need to install polyaxon as well:

pip install polyaxon traceml

[WIP] Local sandbox

Coming soon

Offline usage

You can enable the offline mode to track runs without an API:

export POLYAXON_OFFLINE="true"

Or passing the offline flag

from traceml import tracking

tracking.init(..., is_offline=True, ...)

Simple usage in a Python script

import random

import traceml as tracking

tracking.init(
    is_offline=True,
    project='quick-start',
    name="my-new-run",
    description="trying TraceML",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

# Tracking some data refs
tracking.log_data_ref(content=X_train, name='x_train')
tracking.log_data_ref(content=y_train, name='y_train')

# Tracking inputs
tracking.log_inputs(
    batch_size=64,
    dropout=0.2,
    learning_rate=0.001,
    optimizer="Adam"
)

def get_loss(step):
    result = 10 / (step + 1)
    noise = (random.random() - 0.5) * 0.5 * result
    return result + noise

# Track metrics
for step in range(100):
    loss = get_loss(step)
    tracking.log_metrics(
    loss=loss,
    accuracy=(100 - loss) / 100.0,
)

# Track some one time results
tracking.log_outputs(validation_score=0.66)

# Optionally manually stop the tracking process
tracking.stop()

Integration with deep learning and machine learning libraries and frameworks

Keras

You can use TraceML's callback to automatically save all metrics and collect outputs and models, you can also track additional information using the logging methods:

from traceml import tracking
from traceml.integrations.keras import Callback

tracking.init(
    is_offline=True,
    project='tracking-project',
    name="keras-run",
    description="trying TraceML & Keras",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

tracking.log_inputs(
    batch_size=64,
    dropout=0.2,
    learning_rate=0.001,
    optimizer="Adam"
)
tracking.log_data_ref(content=x_train, name='x_train')
tracking.log_data_ref(content=y_train, name='y_train')
tracking.log_data_ref(content=x_test, name='x_test')
tracking.log_data_ref(content=y_test, name='y_test')

# ...

model.fit(
    x_train,
    y_train,
    validation_data=(X_test, y_test),
    epochs=epochs,
    batch_size=100,
    callbacks=[Callback()],
)

PyTorch

You can log metrics, inputs, and outputs of Pytorch experiments using the tracking module:

from traceml import tracking

tracking.init(
    is_offline=True,
    project='tracking-project',
    name="pytorch-run",
    description="trying TraceML & PyTorch",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

tracking.log_inputs(
    batch_size=64,
    dropout=0.2,
    learning_rate=0.001,
    optimizer="Adam"
)

# Metrics
for batch_idx, (data, target) in enumerate(train_loader):
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    tracking.log_metrics(loss=loss)

asset_path = tracking.get_outputs_path('model.ckpt')
torch.save(model.state_dict(), asset_path)

# log model
tracking.log_artifact_ref(asset_path, framework="pytorch", ...)

Tensorflow

You can log metrics, outputs, and models of Tensorflow experiments and distributed Tensorflow experiments using the tracking module:

from traceml import tracking
from traceml.integrations.tensorflow import Callback

tracking.init(
    is_offline=True,
    project='tracking-project',
    name="tf-run",
    description="trying TraceML & Tensorflow",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

tracking.log_inputs(
    batch_size=64,
    dropout=0.2,
    learning_rate=0.001,
    optimizer="Adam"
)

# log model
estimator.train(hooks=[Callback(log_image=True, log_histo=True, log_tensor=True)])

Fastai

You can log metrics, outputs, and models of Fastai experiments using the tracking module:

from traceml import tracking
from traceml.integrations.fastai import Callback

tracking.init(
    is_offline=True,
    project='tracking-project',
    name="fastai-run",
    description="trying TraceML & Fastai",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

# Log model metrics
learn.fit(..., cbs=[Callback()])

Pytorch Lightning

You can log metrics, outputs, and models of Pytorch Lightning experiments using the tracking module:

from traceml import tracking
from traceml.integrations.pytorch_lightning import Callback

tracking.init(
    is_offline=True,
    project='tracking-project',
    name="pytorch-lightning-run",
    description="trying TraceML & Lightning",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

...
trainer = pl.Trainer(
    gpus=0,
    progress_bar_refresh_rate=20,
    max_epochs=2,
    logger=Callback(),
)

HuggingFace

You can log metrics, outputs, and models of HuggingFace experiments using the tracking module:

from traceml import tracking
from traceml.integrations.hugging_face import Callback

tracking.init(
    is_offline=True,
    project='tracking-project',
    name="hg-run",
    description="trying TraceML & HuggingFace",
    tags=["examples"],
    artifacts_path="path/to/artifacts/repo"
)

...
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    callbacks=[Callback],
    # ...
)

Tracking artifacts

import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
from bokeh.plotting import figure
from vega_datasets import data

from traceml import tracking


def plot_mpl_figure(step):
    np.random.seed(19680801)
    data = np.random.randn(2, 100)

    figure, axs = plt.subplots(2, 2, figsize=(5, 5))
    axs[0, 0].hist(data[0])
    axs[1, 0].scatter(data[0], data[1])
    axs[0, 1].plot(data[0], data[1])
    axs[1, 1].hist2d(data[0], data[1])

    tracking.log_mpl_image(figure, 'mpl_image', step=step)


def log_bokeh(step):
    factors = ["a", "b", "c", "d", "e", "f", "g", "h"]
    x = [50, 40, 65, 10, 25, 37, 80, 60]

    dot = figure(title="Categorical Dot Plot", tools="", toolbar_location=None,
                 y_range=factors, x_range=[0, 100])

    dot.segment(0, factors, x, factors, line_width=2, line_color="green", )
    dot.circle(x, factors, size=15, fill_color="orange", line_color="green", line_width=3, )

    factors = ["foo 123", "bar:0.2", "baz-10"]
    x = ["foo 123", "foo 123", "foo 123", "bar:0.2", "bar:0.2", "bar:0.2", "baz-10", "baz-10",
         "baz-10"]
    y = ["foo 123", "bar:0.2", "baz-10", "foo 123", "bar:0.2", "baz-10", "foo 123", "bar:0.2",
         "baz-10"]
    colors = [
        "#0B486B", "#79BD9A", "#CFF09E",
        "#79BD9A", "#0B486B", "#79BD9A",
        "#CFF09E", "#79BD9A", "#0B486B"
    ]

    hm = figure(title="Categorical Heatmap", tools="hover", toolbar_location=None,
                x_range=factors, y_range=factors)

    hm.rect(x, y, color=colors, width=1, height=1)

    tracking.log_bokeh_chart(name='confusion-bokeh', figure=hm, step=step)


def log_altair(step):
    source = data.cars()

    brush = alt.selection(type='interval')

    points = alt.Chart(source).mark_point().encode(
        x='Horsepower:Q',
        y='Miles_per_Gallon:Q',
        color=alt.condition(brush, 'Origin:N', alt.value('lightgray'))
    ).add_selection(
        brush
    )

    bars = alt.Chart(source).mark_bar().encode(
        y='Origin:N',
        color='Origin:N',
        x='count(Origin):Q'
    ).transform_filter(
        brush
    )

    chart = points & bars

    tracking.log_altair_chart(name='altair_chart', figure=chart, step=step)


def log_plotly(step):
    df = px.data.tips()

    fig = px.density_heatmap(df, x="total_bill", y="tip", facet_row="sex", facet_col="smoker")
    tracking.log_plotly_chart(name="2d-hist", figure=fig, step=step)


plot_mpl_figure(100)
log_bokeh(100)
log_altair(100)
log_plotly(100)

Tracking DataFrames

Summary

An extension to pandas dataframes describe function.

The module contains DataFrameSummary object that extend describe() with:

  • properties
    • dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
    • dsf.columns_types: a count of the types of columns
    • dfs[column]: more in depth summary of the column
  • function
    • summary(): extends the describe() function with the values with columns_stats

The DataFrameSummary expect a pandas DataFrame to summarise.

from traceml.summary.df import DataFrameSummary

dfs = DataFrameSummary(df)

getting the columns types

dfs.columns_types


numeric     9
bool        3
categorical 2
unique      1
date        1
constant    1
dtype: int64

getting the columns stats

dfs.columns_stats


                      A            B        C              D              E
counts             5802         5794     5781           5781           4617
uniques            5802            3     5771            128            121
missing               0            8       21             21           1185
missing_perc         0%        0.14%    0.36%          0.36%         20.42%
types            unique  categorical  numeric        numeric        numeric

getting a single column summary, e.g. numerical column

# we can also access the column using numbers A[1]
dfs['A']

std                                                                 0.2827146
max                                                                  1.072792
min                                                                         0
variance                                                           0.07992753
mean                                                                0.5548516
5%                                                                  0.1603367
25%                                                                 0.3199776
50%                                                                 0.4968588
75%                                                                 0.8274732
95%                                                                  1.011255
iqr                                                                 0.5074956
kurtosis                                                            -1.208469
skewness                                                            0.2679559
sum                                                                  3207.597
mad                                                                 0.2459508
cv                                                                  0.5095319
zeros_num                                                                  11
zeros_perc                                                               0,1%
deviating_of_mean                                                          21
deviating_of_mean_perc                                                  0.36%
deviating_of_median                                                        21
deviating_of_median_perc                                                0.36%
top_correlations                         {u'D': 0.702240243124, u'E': -0.663}
counts                                                                   5781
uniques                                                                  5771
missing                                                                    21
missing_perc                                                            0.36%
types                                                                 numeric
Name: A, dtype: object

[WIP] Summaries

  • Add summary analysis between columns, i.e. dfs[[1, 2]]

[WIP] Visualizations

  • Add summary visualization with matplotlib.
  • Add summary visualization with plotly.
  • Add summary visualization with altair.
  • Add predefined profiling.

[WIP] Catalog and Versions

  • Add possibility to persist summary and link to a specific version.
  • Integrate with quality libraries.

traceml's People

Contributors

antoinetoubhans avatar antonfriberg avatar ashton-sidhu avatar axelperschmann avatar claytantor avatar denisoliveirac avatar dependabot[bot] avatar dxist avatar elyase avatar flackbash avatar gzcf avatar j-kohn avatar javidgon avatar lgeiger avatar mmourafiq avatar mofef avatar nathandemaria avatar onilton avatar polyaxon-ci avatar polyaxon-team avatar pyup-bot avatar qipengzhou avatar ricardofbarros avatar saintmalik avatar shotarok avatar sid-at-granular avatar timorohner avatar vfdev-5 avatar wbuchwalter avatar yu-iskw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

traceml's Issues

PyPI is considering 0.0.41 as latest version

Hello,

Thank you for all the work creating this project. It appears PyPI is considering 0.0.41 as the latest/default version instead of 0.0.5, therefore, 'pip install pandas-summary' is installing the old version.

Ryan

Show missing columns only

Thanks for the awesome plugin !

  1. possible to add colors to point out missing values. Light shade of red if missing is > 0.
  2. possible to display only missing columns? Sometimes a dataframe has a lot of columns and user is mostly interested in missing information.

I am new to python, if you guide me where to look, I can create a pull request. Thank you.

support for non-hashable objects

If a column has non-hashable objects, creation of DataFrameSummary gives an error:

TypeError: unhashable type: 'list'

It would be great if work also for these entries (e.g. looking at their length); or at very least - discard these variables.

Error in Pytorch Lightning Callback method.

In function log_model_summary there is a reference to non existing attribute self.run.

Using this function will raise error:

rel_path = self.run.get_outputs_path("model_summary.txt")
AttributeError: 'Callback' object has no attribute 'run'

Possible fix:
replacing self.run with self.experiment (as used in other methods) seems to fix the issue.

Merge this in pandas-profiling?

Should we think about merging this with pandas-profiling? Effort would be fairly low, and I like the idea of creating a DataFrameSummary object that you can query and which returns information.

Patch level went d?

Hi, I am trying to get pandas-summary working, and had version 0.0.41 installed. I ran into the issue fixed by #11/#12, so I tried to upgrade. It seems like the version was changed to 0.0.5 with 42227d5

When I try to upgrade to 0.0.5, pip picks up the 0.0.41 version since its patch level is considered greater than 0.0.5. I am somewhat new to the Python world, so I could be missing an easy way to do this, but wondering if the new version should be updated to 0.0.42 or 0.1.0 to better comply with semver.

pip install pandas-summary==
Collecting pandas-summary==
  Could not find a version that satisfies the requirement pandas-summary== (from versions: 0.0.3, 0.0.4, 0.0.5, 0.0.41)
No matching distribution found for pandas-summary==

For now my workaround is to use pip install pandas-summary==0.0.5 to install the exact version that I want to use.

Use more than one thread

I'd love to see these calculations split across multiple threads to reduce the clock computation time, since I often deal with really large data sets.

bad import

readme examples import should be 'from datatile.summary.df import DataFrameSummary'

Python 3 support

Right now it works only on Python 2 (at least due to print statements, I didn't check if there are any bigger differences). Do you plan to make it Python 3 compatible?

Patch level went down?

Hi, I am trying to get pandas-summary working, and had version 0.0.41 installed. I ran into the issue fixed by #11/#12, so I tried to upgrade. It seems like the version was changed to 0.0.5 with 42227d5

When I try to upgrade to 0.0.5, pip picks up the 0.0.41 version since its patch level is considered greater than 0.0.5. I am somewhat new to the Python world, so I could be missing an easy way to do this, but wondering if the new version should be updated to 0.0.42 or 0.1.0 to better comply with semver.

pip install pandas-summary==
Collecting pandas-summary==
  Could not find a version that satisfies the requirement pandas-summary== (from versions: 0.0.3, 0.0.4, 0.0.5, 0.0.41)
No matching distribution found for pandas-summary==

For now my workaround is to use pip install pandas-summary==0.0.5 to install the exact version that I want to use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.