whylabs / whylogs Goto Github PK

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

Home Page: https://whylogs.readthedocs.io/

License: Apache License 2.0

Python 14.66% Makefile 0.11% CSS 0.24% HTML 11.35% JavaScript 0.72% Jupyter Notebook 71.76% Java 1.15% Scala 0.01% Dockerfile 0.02% Shell 0.01%

ai-pipelines approximate-statistics statistical-properties data-quality calculate-statistics python logging mlops dataops ml-pipelines

whylogs's Introduction

The open standard for data logging

Documentation • Slack Community • Python Quickstart • WhyLabs Quickstart

What is whylogs

whylogs is an open source library for logging any kind of data. With whylogs, users are able to generate summaries of their datasets (called whylogs profiles) which they can use to:

Track changes in their dataset
Create data constraints to know whether their data looks the way it should
Quickly visualize key summary statistics about their datasets

These three functionalities enable a variety of use cases for data scientists, machine learning engineers, and data engineers:

Detect data drift in model input features
Detect training-serving skew, concept drift, and model performance degradation
Validate data quality in model inputs or in a data pipeline
Perform exploratory data analysis of massive datasets
Track data distributions & data quality for ML experiments
Enable data auditing and governance across the organization
Standardize data documentation practices across the organization
And more

If you have any questions, comments, or just want to hang out with us, please join our Slack Community. In addition to joining the Slack Community, you can also help this project by giving us a ⭐ in the upper right corner of this page.

Python Quickstart

Installing whylogs using the pip package manager is as easy as running pip install whylogs in your terminal.

From here, you can quickly log a dataset:

import whylogs as why
import pandas as pd

#dataframe
df = pd.read_csv("path/to/file.csv")
results = why.log(df)

And there you have it, you now have a whylogs profile. To learn more about what a whylogs profile is and what you can do with it, read on.

whylogs Profiles
Data Constraints
Profile Visualization
Integrations
Supported Data Types
Examples
Usage Statistics
Community
Contribute

whylogs Profiles

What are profiles

whylogs profiles are the core of the whylogs library. They capture key statistical properties of data, such as the distribution (far beyond simple mean, median, and standard deviation measures), the number of missing values, and a wide range of configurable custom metrics. By capturing these summary statistics, we are able to accurately represent the data and enable all of the use cases described in the introduction.

whylogs profiles have three properties that make them ideal for data logging: they are efficient, customizable, and mergeable.

Efficient: whylogs profiles efficiently describe the dataset that they represent. This high fidelity representation of datasets is what enables whylogs profiles to be effective snapshots of the data. They are better at capturing the characteristics of a dataset than a sample would be—as discussed in our Data Logging: Sampling versus Profiling blog post—and are very compact.

Customizable: The statistics that whylogs profiles collect are easily configured and customizable. This is useful because different data types and use cases require different metrics, and whylogs users need to be able to easily define custom trackers for those metrics. It’s the customizability of whylogs that enables our text, image, and other complex data trackers.

Mergeable: One of the most powerful features of whylogs profiles is their mergeability. Mergeability means that whylogs profiles can be combined together to form new profiles which represent the aggregate of their constituent profiles. This enables logging for distributed and streaming systems, and allows users to view aggregated data across any time granularity.

How do you generate profiles

Once whylogs is installed, it's easy to generate profiles in both Python and Java environments.

To generate a profile from a Pandas dataframe in Python, simply run:

import whylogs as why
import pandas as pd

#dataframe
df = pd.read_csv("path/to/file.csv")
results = why.log(df)

What can you do with profiles

Once you’ve generated whylogs profiles, a few things can be done with them:

In your local Python environment, you can set data constraints or visualize your profiles. Setting data constraints on your profiles allows you to get notified when your data don’t match your expectations, allowing you to do data unit testing and some baseline data monitoring. With the Profile Visualizer, you can visually explore your data, allowing you to understand it and ensure that your ML models are ready for production.

In addition, you can send whylogs profiles to the SaaS ML monitoring and AI observability platform WhyLabs. With WhyLabs, you can automatically set up monitoring for your machine learning models, getting notified on both data quality and data change issues (such as data drift). If you’re interested in trying out WhyLabs, check out the always free Starter edition, which allows you to experience the entire platform’s capabilities with no credit card required.

WhyLabs

WhyLabs is a managed service offering built for helping users make the most of their whylogs profiles. With WhyLabs, users can ingest profiles and set up automated monitoring as well as gain full observability into their data and ML systems. With WhyLabs, users can ensure the reliability of their data and models, and debug any problems that arise with them.

Ingesting whylogs profiles into WhyLabs is easy. After obtaining your access credentials from the platform, you’ll need to set them in your Python environment, log a dataset, and write it to WhyLabs, like so:

import whylogs as why
import os

os.environ["WHYLABS_DEFAULT_ORG_ID"] = "org-0" # ORG-ID is case-sensitive
os.environ["WHYLABS_API_KEY"] = "YOUR-API-KEY"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "model-0" # The selected model project "MODEL-NAME" is "model-0"

results = why.log(df)

results.writer("whylabs").write()

If you’re interested in trying out WhyLabs, check out the always free Starter edition, which allows you to experience the entire platform’s capabilities with no credit card required.

Data Constraints

Constraints are a powerful feature built on top of whylogs profiles that enable you to quickly and easily validate that your data looks the way that it should. There are numerous types of constraints that you can set on your data (that numerical data will always fall within a certain range, that text data will always be in a JSON format, etc) and, if your dataset fails to satisfy a constraint, you can fail your unit tests or your CI/CD pipeline.

A simple example of setting and testing a constraint is:

import whylogs as why
from whylogs.core.constraints import Constraints, ConstraintsBuilder
from whylogs.core.constraints.factories import greater_than_number

profile_view = why.log(df).view()
builder = ConstraintsBuilder(profile_view)
builder.add_constraint(greater_than_number(column_name="col_name", number=0.15))

constraints = builder.build()
constraints.report()

To learn more about constraints, check out: the Constraints Example.

Profile Visualization

In addition to being able to automatically get notified about potential issues in data, it’s also useful to be able to inspect your data manually. With the profile visualizer, you can generate interactive reports about your profiles (either a single profile or comparing profiles against each other) directly in your Jupyter notebook environment. This enables exploratory data analysis, data drift detection, and data observability.

To access the profile visualizer, install the [viz] module of whylogs by running pip install "whylogs[viz]" in your terminal. One type of profile visualization that we can create is a drift report; here's a simple example of how to analyze the drift between two profiles:

import whylogs as why

from whylogs.viz import NotebookProfileVisualizer

result = why.log(pandas=df_target)
prof_view = result.view()

result_ref = why.log(pandas=df_reference)
prof_view_ref = result_ref.view()

visualization = NotebookProfileVisualizer()
visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=prof_view_ref)

visualization.summary_drift_report()

To learn more about visualizing your profiles, check out: the Visualizer Example

Data Types

whylogs supports both structured and unstructured data, specifically:

Data type	Features	Notebook Example
Tabular Data	✅	Getting started with structured data
Image Data	✅	Getting started with images
Text Data	✅	String Features
Embeddings	✅	Embeddings Distance Logging
Other Data Types	✋	Do you have a request for a data type that you don’t see listed here? Raise an issue or join our Slack community and make a request! We’re always happy to help

Integrations

whylogs can seamslessly interact with different tooling along your Data and ML pipelines. We have currently built integrations with:

AWS S3
Apache Airflow
Apache Spark
Mlflow
GCS

and much more!

If you want to check out our complete list, please refer to our integrations examples page.

Examples

For a full set of our examples, please check out the examples folder.

Benchmarks of whylogs

By design, whylogs run directly in the data pipeline or in a sidecar container and use highly scalable streaming algorithms to compute statistics. Since data logging with whylogs happens in the same infrastructure where the raw data is being processed, it's important to think about the compute overhead. For the majority of use cases, the overhead is minimal, usually under 1%. For very large data volumes with thousands of features and 10M+ QPS it can add ~5% overhead. However, for large data volumes, customers are typically in a distributed environment, such as Ray or Apache Spark. This means they benefit from whylogs parallelization—and the map-reducible property of the whylogs profiles keeping the compute overhead to a minimum. Below are benchmarks to demonstrate how efficient whylogs is at processing tabular data with default configurations (tracking distributions, missing values, counts, cardinality, and schema). Two important advantages of this approach are that parallelization speeds up the calculation and whylogs scales with the number of features, rather than the number of rows. Learn more about how whylogs scales here.

DATA VOLUME	TOTAL COST OF RUNNING WHYLOGS	INSTANCE TYPE	CLUSTER SIZE	PROCESSING TIME
10 GB ~34M rows x 43 columns	~ $ 0.026 per 10 GB, or $2.45 per TB	c5a.2xlarge, 8 CPU 16GB RAM, $0.308 on demand price per hour	2 instances	2.6 minutes of profiling time per instance (running in parallel)
10 GB, ~34M rows x 43 columns	~ $0.016 per 10 GB, estimated $1.60 per TB	c6g.2xlarge, 8 CPU 16GB RAM, $0.272 on demand price per hour	2 instances	1.7 minutes of profiling time per instance (running in parallel)
10 GB ~34M rows x 43 columns	~ $ 0.045 per 10 GB	c5a.2xlarge, 8 CPU 16GB RAM, $0.308 on demand price per hour	16 instances	33 seconds of profiling time per instance (running in parallel)
80 GB, 83M rows x 119 columns	~ $0.139 per 80 GB	c5a.2xlarge, 8 CPU 16GB RAM, $0.308 on demand price per hour	16 instances	1.7 minutes of profiling time per instance (running in parallel)
100 GB, 290M rows x 43 columns	~ $0.221 per 100 GB	c5a.2xlarge, 8 CPU 16GB RAM, $0.308 on demand price per hour	16 instances	2.7 minutes of profiling time per instance (running in parallel)

Usage Statistics

Starting with whylogs v1.0.0, whylogs by default collects anonymous information about a user’s environment. These usage statistics do not include any information about the users or the data they are profiling, only the environment in which the user is running whylogs.

To read more about what usage statistics whylogs collects, check out the relevant documentation.

To turn off Usage Statistics, simply set the WHYLOGS_NO_ANALYTICS environment variable to True, like so:

import os
os.environ['WHYLOGS_NO_ANALYTICS']='True'

Community

If you have any questions, comments, or just want to hang out with us, please join our Slack channel.

Contribute

How to Contribute

We welcome contributions to whylogs. Please see our contribution guide and our development guide for details.

Contributors

Made with contrib.rocks.

whylogs's People

Contributors

Stargazers

Watchers

Forkers

ntiyison bernease niparis naddeoa ishotjr iut62elec csolitaire huytruong87 izzyfrank lacorna tawabshakeel cswarth sachuin23 valer-whylabs shamimrezasohag chunmk radhakrishnang ianshan0915 evanfwelch aptus-john felipeadachi lalmei jamie256 jaykimbravekjh anvelezec yangboyd datalearns pacodiaz2020 git-hub-tig liangtsao alex-zenml ldebb chibuikeeugene trodiva bastienboutonnet osmanatam phamduyly jasma-balasangameshwara hackthecrisis21 xiaolushuo felipedasilvazup monocongo wanlugu gabe0912 bearerpipelinetest milan-chicago amir-zakeri kuromt sbrugman mariaagil nthangarajan vpistis khurchla rishirelan bsrikar stuartwaller jghoman rayreed137 chuongloc asanatizadeh cyberflamego ajayarunachalam app-creative icodein perfectslayer apcmiguel amart85 sanyamlakhanpal priyankaiiit14 goodwanghan ccminc cloudnepal blankxyz gramhagen rochinel zenetio masonkadem oliverlwang vishalsingh17 jaedukseo msftscriptsculptor shalevy1 kiadavari hxdaze back2zion bojiang pbhorjee mat3usps aq-ashwin andrewelizondo tarivs natiska artemisep zaradana ai-mou brunoscaglione mikotron crazysz westamine langyee1

whylogs's Issues

[bug] Stop using utcnow()

See: https://blog.ganssle.io/articles/2019/11/utcnow.html

We should replace it with:

>>> from datetime import timezone
>>> dt_now = datetime.now(tz=timezone.utc)

Typo in summaries

We have "ununique" for strings instead of nunique

"Failed to import MLFLow" too verbose

Whenever you import whylogs when MLflow is not installed, it gives a warning (not marked as a warning). We should probably change this to a debug level logging statement:

In [1]: import whylogs
Failed to import MLFLow

MLFlow: Utility to visualize MLFlow data based on "runs"

MLFlow provides API to query model artifact
Users should be able to:
- List all dataset profile names collected
- For a given name, list all the actual profiles
Plot the data for a given profile in a chart
Given an experiment, user should be able to visualize whylogs data across different runs

Improvement: add dataset-level counters

We need to track dataset level data: Data points (though if you call log) output/input separately not sure how it works)

MLFlow: option to output whylogs metrics to MLFlow

WhyLogs automatically collects column-level metrics
Customers will have the option to output whylogs metrics to MLFlow metrics via mlflow.log_metric call
This will result in a lot of metrics in the output model, so the option should be disabled by default

PYPI Homepage link is wrong

The PYPI links to the deprecated whylogs github repo (https://github.com/whylabs/whylogs), not to whylogs-python (https://github.com/whylabs/whylogs-python)

Method to revert matplotlib theming

The MatplotlibProfileVisualizer updates matplotlib rcParams upon initialization. It should either have

A method to revert all changes to the previous values
should use the matplotlib stylesheets context managers: http://matplotlib.org/users/style_sheets.html
OR should use matplotlib rc context manager: https://matplotlib.org/3.3.2/api/matplotlib_configuration_api.html#matplotlib.rc_context

MLFlow NameError

When I don't have the optional MLFlow dependency installed I get the following exception the first time I try to import the numbertracker. The second time I run the import, everything works just fine.

from whylogs.core.statistics import numbertracker



Failed to import MLFLow
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-3964e19b3cb4> in <module>
----> 1 from whylogs.core.statistics import numbertracker

~/src/whylogs-github/src/whylogs/__init__.py in <module>
      4 from .app.session import get_or_create_session
      5 from .app.session import reset_default_session
----> 6 from .mlflow import enable_mlflow
      7 
      8 __all__ = [

~/src/whylogs-github/src/whylogs/mlflow/__init__.py in <module>
----> 1 from .patcher import enable_mlflow
      2 
      3 __all__ = ["enable_mlflow"]

~/src/whylogs-github/src/whylogs/mlflow/patcher.py in <module>
    145 
    146 _active_whylogs = []
--> 147 _original_end_run = mlflow.tracking.fluent.end_run
    148 
    149 

NameError: name 'mlflow' is not defined

API documentation is missing/broken on readthedocs

API documentation is missing/not indexed on ReadtheDocs:

https://whylogs.readthedocs.io/en/latest/

See the older version: https://whylogs.readthedocs.io/en/v0.1.0/

Link whylogs CLI to slack channel for visibility

Support configurable top-k values

Currently whylogs collects a fixed number of top-k values (128 entries).

Users should be able to:

Disable top-k value collection (privacy concerns)
Enable for a number of features (whitelisting)
Or enable all, but disable on a number of features (blacklisting)
Customize k values

Overly specified requirements

Currently the requirements.txt (and really the requirements-dev.txt) specify exact dependency versions. We should change them all from package == version to package >= min_version

Enable merging dataset profiles with mismatched tags

Tags were created to enable grouping when consuming data.

However, if the user tries to merge two DP with mismatched tags, the operation fails due to very strict tag checking. We should relax this.

Two options:

ignore_tags param (bool): this will drop the mismatched tags (might result in empty tags)
group_by_tags: List[str]: only check for tag matching among these tags

Boolean Python objects treated as unknown type

If you pass a Python boolean type via a Pandas dataframe, it is treated as type_unknown and are not in number or string tracker objects. Perhaps the expected behavior is for us to treat them as boolean type. (We may also want to have a bool tracker or similar object.)

This would match the behavior when we pass in the strings "True" and "False".

[MAJOR] Using Apache Arrow for managing segment datasets instead of in-memory cache

Proposal: using Apache Arrow to manage multiple datasets instead of relying in memory

Motivation:

Reduce memory footprint of whylogs when running with multiple segments
Apache Arrow allows us to use disk to reduce memory access

Challenges:

Arrow + Protobuf will require ser/der and that can be challenging. Long term we might consider re implementing protobuf format using Apache Arrow. Or mappping Protobuf schema to Parquet schema
Arrow only expects write once read many. Also Java API is much more difficult to integrate with (we need feature parity)

Cannot track a null item

We cannot track a None item with the standard DatasetProfile.track() interface.

This code succeeds:

In [7]: from whylogs import DatasetProfile
   ...: prof = DatasetProfile("name")
   ...: prof.track("column_name", 1)

But this code fails:

In [6]: from whylogs import DatasetProfile
   ...: prof = DatasetProfile("name")
   ...: prof.track("column_name", None)
   ...:
[autoreload of whylogs.core.datasetprofile failed: Traceback (most recent call last):
  File "/Users/ibackus/miniconda3/envs/whylogs/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/Users/ibackus/miniconda3/envs/whylogs/lib/python3.7/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/Users/ibackus/miniconda3/envs/whylogs/lib/python3.7/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/Users/ibackus/miniconda3/envs/whylogs/lib/python3.7/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 630, in _exec
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/ibackus/src/whylogs-github/src/whylogs/core/datasetprofile.py", line 780, in <module>
    prof.track("column_name", None)
  File "/Users/ibackus/src/whylogs-github/src/whylogs/core/datasetprofile.py", line 176, in track
    for column_name, data in columns.items():
AttributeError: 'str' object has no attribute 'items'
]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-f6c1e172475f> in <module>
      1 from whylogs import DatasetProfile
      2 prof = DatasetProfile("name")
----> 3 prof.track("column_name", None)

~/src/whylogs-github/src/whylogs/core/datasetprofile.py in track(self, columns, data)
    174             self.track_datum(columns, data)
    175         else:
--> 176             for column_name, data in columns.items():
    177                 self.track_datum(column_name, data)
    178

AttributeError: 'str' object has no attribute 'items'

TypeError: '>' not supported between instances of 'int' and 'NoneType'

When .whylogs.yaml is missing in the path

import whylogs
session = whylogs.get_or_create_session()
logger = session.logger('live', with_rotation_time='s')
logger.log({"a": 1})

Output:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-739ae624067c> in <module>
      2 session = whylogs.get_or_create_session()
      3 logger = session.logger('live', with_rotation_time='s')
----> 4 logger.log({"a": 1})

/Volumes/Workspace/whylogs-python/src/whylogs/app/logger.py in log(self, features, feature_name, value, segments, profile_full_dataset)
    304 
    305         if self.should_rotate():
--> 306             self._rotate_time()
    307 
    308         # segmnet check  in case segments are just keys

/Volumes/Workspace/whylogs-python/src/whylogs/app/logger.py in _rotate_time(self)
    213         self.flush(rotation_suffix)
    214 
--> 215         if len(self._profiles) > self.cache:
    216             self._profiles[-self.cache-1] = None
    217 

TypeError: '>' not supported between instances of 'int' and 'NoneType'

I think this is caused by this line here: https://github.com/whylabs/whylogs-python/blob/mainline/src/whylogs/app/session.py#L50

And cache is not passed in when the session created

[OS] Support windows

Current datasketches library is not published for Windows.

In order to support Windows we need to:

Publish datasketches for Windows:
Support windows pathing logic (testing + verification)

Check previous writen profiles during rotation

should_rotate() should check previous saved loggers, load and add to previous one or create new logger.

WhyLogs Community

Let's have a community on Slack/Discord for discussions, issues, or for getting help from one another.

Support Audio data

Add regex check on contraints

[documentation] Add architecture documentation

Inspired by this: https://matklad.github.io//2021/02/06/ARCHITECTURE.md.html

MLFlow integration: prediction monitoring

Breaking up #67.

Support monitoring python_function model deployments

https://www.mlflow.org/docs/latest/models.html#deploy-mlflow-models

The implementation takes in a Pandas dataframe - we should be able to ingest them into WhyLogs.

Will require:

Log rotation
Storage configuration for Production mode (versus using the model artifact backend in batch mode)

Do not create local path when writer is `s3`

When setting the writer to s3, whylog still creates local path and thus you'll get s3://bucket/etc.. locally.

This can be reproduced easily locally.

Stacktrace at the end of the unit test

See: https://github.com/whylabs/whylogs-python/pull/87/checks?check_run_id=1329517256

============================= 136 passed in 3.26s ==============================
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/__init__.py", line 32, in <module>
    import mlflow.tracking._model_registry.fluent
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/tracking/__init__.py", line 8, in <module>
    from mlflow.tracking.client import MlflowClient
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/tracking/client.py", line 8, in <module>
    from mlflow.entities import ViewType
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/entities/__init__.py", line 6, in <module>
    from mlflow.entities.experiment import Experiment
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/entities/experiment.py", line 2, in <module>
    from mlflow.entities.experiment_tag import ExperimentTag
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/entities/experiment_tag.py", line 2, in <module>
    from mlflow.protos.service_pb2 import ExperimentTag as ProtoExperimentTag
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/protos/service_pb2.py", line 18, in <module>
    from .scalapb import scalapb_pb2 as scalapb_dot_scalapb__pb2
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/mlflow/protos/scalapb/scalapb_pb2.py", line 15, in <module>
    from google.protobuf import descriptor_pb2 as google_dot_protobuf_dot_descriptor__pb2
  File "/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/google/protobuf/descriptor_pb2.py", line 1912, in <module>
    '__module__' : 'google.protobuf.descriptor_pb2'
TypeError: A Message class can only inherit from Message

Support passing arbitrary AWS credentials for s3 logging

whylogs currently takes the default AWS profile via .aws config or environment variables, but it would be beneficial to be able to directly pass arbitrary AWS credentials to the logger. whylogs currently uses smart-open to write to S3, which already supports passing a boto3 session for this purpose with this PR.

Perhaps this can be implemented in the .whylogs.yaml or by passing a boto3 session to the logger and then passing to smart-open?

DatasetProfile.track cannot handle `None` data

I encountered this when processing a dataframe with values equal to None. A datasetprofile cannot be generated for data that has None values when running track_dataframe.

Probably we an have separate methods for tracking an individual value versus a dictionary of values.

from whylogs import DatasetProfile
p = DatasetProfile('name')
p.track('column name', None)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-52-9c0f4562f8ba> in <module>
      1 from whylogs import DatasetProfile
      2 p = DatasetProfile('name')
----> 3 p.track('column name', None)

~/miniconda3/envs/data/lib/python3.8/site-packages/whylogs/core/datasetprofile.py in track(self, columns, data)
    182             self.track_datum(columns, data)
    183         else:
--> 184             for column_name, data in columns.items():
    185                 self.track_datum(column_name, data)
    186 

AttributeError: 'str' object has no attribute 'items'

Expose dataset profile logging

We should expose dataset profile logging directly. This is very useful. Users should be able to override dataset metadata when logging as well (such as dataset name).

Set numerical interval value for logrotation

Pypi metadata are missing/incorrect

Missing short description:
Incorrect License (should be Apache v2)
The project links/documentations/stars/forks etc are linked to pyscaffold rather than whylogs (we used pyscaffold to set up the package structure)

logging

We need to add lots more logging, especially at the debug level, all over WhyLogs.

Exception on log_dataframe() after RAPIDS change

Calling log_dataframe() on either Session or Logger object is producing the following error. This appears to be because we are passing data to the internal DatasetProfile.track() incorrectly or track() is mishandling the string object.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-8d9b5da2f1f0> in <module>
      1 sess = whylogs.get_or_create_session()
----> 2 profile = sess.log_dataframe(df)

~/miniconda3/envs/wlprod/lib/python3.8/site-packages/whylogs/app/session.py in log_dataframe(self, df, dataset_name, dataset_timestamp, session_timestamp, tags, metadata)
    151         )
    152 
--> 153         ylog.log_dataframe(df)
    154 
    155         return ylog.close()

~/miniconda3/envs/wlprod/lib/python3.8/site-packages/whylogs/app/logger.py in log_dataframe(self, df)
    148         if not self._active:
    149             return
--> 150         self._profile.track_dataframe(df)
    151 
    152     def is_active(self) -> bool:

~/miniconda3/envs/wlprod/lib/python3.8/site-packages/whylogs/core/datasetprofile.py in track_dataframe(self, df)
    228             x = df[col].values
    229             for xi in x:
--> 230                 self.track(col_str, xi)
    231 
    232     def to_properties(self):

~/miniconda3/envs/wlprod/lib/python3.8/site-packages/whylogs/core/datasetprofile.py in track(self, columns, data)
    182             self.track_datum(columns, data)
    183         else:
--> 184             for column_name, data in list(columns.items()):
    185                 self.track_datum(column_name, data)
    186 

AttributeError: 'str' object has no attribute 'items'

None field values being logged as string type

When trying to log a list of dictionaries like this:

to_send =  [
            {
                "residual": None,
                "timestamp": 1553004853455
            },
            {
                "residual": None,
                "timestamp": 1553004853662
            },
            {
                "residual": None,
                "timestamp": 1553004853868
            },
            {
                "residual": 0.1,
                "timestamp": 1553004854485
            },
            {
                "residual": 0.2,
                "timestamp": 1553004854697
            },
            {
                "residual": 0.3,
                "timestamp": 1553004854909
            }
            ]

Using the logger API:

for r in to_send:
    logger.log(r)

The None values are being treated as str, and no info of these values is being retrieved with the plot_missing_values function.

I believe the ideal behaviour would be to treat None values as np.nan values are currently being treated.

whylogs mainline requires Pillow as a hard dependency

Trying to run whylogs for Pandas and getting this:

  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/__init__.py", line 2, in <module>
    from .app.config import SessionConfig, WriterConfig
  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/app/__init__.py", line 5, in <module>
    from .session import SessionConfig
  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/app/session.py", line 12, in <module>
    from whylogs.app.logger import Logger
  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/app/logger.py", line 10, in <module>
    from whylogs.app.writers import Writer
  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/app/writers.py", line 15, in <module>
    from whylogs.core import DatasetProfile
  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/core/__init__.py", line 8, in <module>
    from .image_profiling import TrackImage, _METADATA_DEFAULT_ATTRIBUTES
  File "/miniconda/envs/custom_env/lib/python3.7/site-packages/whylogs/core/image_profiling.py", line 2, in <module>
    from PIL.Image import Image as ImageType
ModuleNotFoundError: No module named 'PIL'

Adding support for other languages

Track requests for other languages here.

We can create new issues as new language requests come in

Search is broken in docs site

Search is broken under: https://whylogs.readthedocs.io/en/latest/

Originally posted by @andyndang in #121 (comment)

Stop using s3fs

S3fs consumes aiobotocore, which pins botocore version: aio-libs/aiobotocore#840

With new Pip resolution logic, we are seeing failures when consuming whylogs in an environment that depends on a different version of boto3.

This means that users cannot install whylogs AND use different version of boto3. This is a major blocker for using the package for most people.

Solution: need to stop using s3fs

log_dataframe in session doest ot output profile

pandas warning related to viz module

  A value is trying to be set on a copy of a slice from a DataFrame.
  Try using .loc[row_indexer,col_indexer] = value instead
  
  See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
    isetter(ilocs[0], value)

I keep seeing these warnings in our unit test log

[WIP] whylogs API revamp

whylogs API:

Motivations:

Complex configurations for whylogs
YAML configurations are hard to write
Taggings are confusing at the moment
APIs are complex
We need to support various writers (whylabs, s3)
Need to enable custom top-k for log collection
Need to be able to disable data collection at session/dataset/feature level
Support lineage
Support tagging
Support segmentation
Compatible cross language

Design inspiration:

Spark DataFrame API: works well across different API
Builder pattern for creating a session
Configurations can be stored in files

session = whylogs.WhyLogsSession.builder
		# pipeline support
		.pipeline('facial-recognization-1')
		.stage('beta')
		# config for logger
		.config('whylogs.tracker.topk', 'enabled')
		.config('whylogs.tracker.topk.log_k', '7')
		.config('whylogs.tracker.counters.nullString', 'NaN,NaT,Nil')
		# config targeting feature name across session
		.config('whylogs.feature[feature_1].tracker.topK', 'disabled')
		.config('whylogs.feature[feature_1].tracking', 'disabled') # disabled tracking at all
		# writers
		.config("whylogs.writers.s3", "enabled")
		.config("whylogs.writers.whylabs", "enabled")
		# integration
		.config('whylogs.mlflow.enabled', 'true')
		.config('whylogs.pyspark.enabled', 'true')
		# summary config
		.config('whylogs.summary.histogram.n_bin', 128)
		# whylabs
		.cofig('whylabs.api.org_id', 'org-123')
		.config('whylabs.api.key', API_key)
		.config('whylabs.api.endpoint', endpoint)
		# metadata
		.tag("key", "value").tag("key", "value")
		.getOrCreate()

Logger

Create a session from config

session = whylogs.WhyLogsSession.builder.getOrCreate()

Accessing config object

session._conf
session.describe()

Pipeline ID: facial-recognition-1
Pipeline ID: UUID
print(session.run_id) # UUID about the session

Creating a logger:

Similarly, using builder pattern:

logger = session.logger(name=Optional[str])
      .withId(cusom_uuid)
      .withDatasetTimestamp(dt: datetime.datetime, ts: int)
      .withTags(...)
      .withTag(....)
      .withSegmentKeys(...)
      .withSegments(....)
      .withRotation(rotation_time: string)
      .withCacheSize(...)
      .withConstraints(...)
      .withNLP(TBD)
      .withImageMetrics(TBD)
      .withProgress(...)
      .withSegmentConfig(...) # TODO: figure out how to fit metadata into this picture
      .withWhyLabs() # Enable Whylabs integration
      .withWriters(writers..)  # Overwrite default writer
      .withOutputFormats(...)
      .dependsOn(trace_id: str, dataset_profile: whylogs.DatasetProfile)

logger.withTags(override tags).logPandas(...)
logger.withTags(...).logParquet(...)
logger.withConstraints(another_constraint).logCsv(...)

logger1 = session.logger('dataset-1') # always need to specify a dataset name

override top-level config

logger1.setWriters(writers).setOutputFormats(...).config("whylogs.feature[feature_1].tracker.topK". "enabled")

Do logging

logger1.close() # write to writer

Tracing

logger1.id # logger UUID
logger1.trace
logger2 = session.logger('dataset-2', datetime(2019, 1, 23, 0, 0)) # dataset name + time

Tracing between datasets

logger3 = session.logger('dataset-1').dependsOn(logger1)

Decoreators

Introduce decorator pattern

@whylogger('dataset-id', datetime(...), ...TBD)
def my_datasource() -> pd.DataFrame:
	return dataframe

MLFlow integration: batch logging support

MLFlow is the most popular model deployment framework at the moment. We'll like to support WhyLogs logging in MLFlow.

WhyLogs as model artifact: when running an MLFlow experiment run, we'd like to output WhyLogs as a model artifact using mlflow.log_artifact(..) call. This will enable WhyLogs to appear under artifacts/whylogs path of the model artifact storage system. - See PR
~* WhyLogs metrics as an experiment metric: when user trigger whylogs in mlflow, the library will then call into mlflow.log_metrics() to spit out whylogs metrics if enabled. ~ See #70
* There will be a "detailed" metric mode where metrics of all columns will be reported. This mode is disabled by default as it can increase the workload of the MLFlow system See #70
* Analyzing whylogs with mlflow runs See #71
* Production monitoring for python_function model deployments See #69

Cannot log multiple batches with a single session

I've got a list of dataframes, each representing a batch of data, each with a different corresponding data timestamp.
I should be able to, with a single session, log each of them independently, but I cannot.

Example

# .whylogs.yaml

# Example WhyLogs YAML configuration
project: example-project
pipeline: example-pipeline
verbose: false
writers:
# Save out the full protobuf datasketches data locally
- formats:
    - protobuf
  output_path: whylogs-output
  # Template variables can be accessed via $variable or ${variable}
  path_template: $name/dataset_profile
  filename_template: datase_profile-$dataset_timestamp
  type: local
# Save out the flat summary data locally, separately from the protobuf
- formats:
    - flat
    - json
  output_path: whylogs-output
  path_template: $name/dataset_summary
  filename_template: dataset_summary-$dataset_timestamp
  type: local

Code

from whylogs.app.session import get_or_create_session, reset_session

# Just make sure no previous session is active
reset_session()

# Load config from disk and return the active session
session = get_or_create_session()

for df in data_batches:
    data_timestamp = df['issue_d'].max()
    print('Data batch timestamp:', data_timestamp)
    with session.logger(dataset_timestamp=data_timestamp) as ylog:
        ylog.log_dataframe(df)

Output:

Data batch timestamp: 2020-08-26 00:00:00
WARNING: attempting to close a closed logger
Data batch timestamp: 2020-08-27 00:00:00
WARNING: attempting to close a closed logger
Data batch timestamp: 2020-08-28 00:00:00
WARNING: attempting to close a closed logger
Data batch timestamp: 2020-08-29 00:00:00
WARNING: attempting to close a closed logger
Data batch timestamp: 2020-08-30 00:00:00
WARNING: attempting to close a closed logger
Data batch timestamp: 2020-08-31 00:00:00
WARNING: attempting to close a closed logger

Additionally, only one set of files is generated, rather than 1 per batch:

ibackus@WhyLabs-Isaac:$ tree whylogs-output/
whylogs-output/
└── example-project
    ├── dataset_profile
    │   └── protobuf
    │       └── datase_profile-1598400000000.bin
    └── dataset_summary
        ├── flat_table
        │   └── dataset_summary-1598400000000.csv
        ├── freq_numbers
        │   └── dataset_summary-1598400000000.json
        ├── frequent_strings
        │   └── dataset_summary-1598400000000.json
        ├── histogram
        │   └── dataset_summary-1598400000000.json
        └── json
            └── dataset_summary-1598400000000.json

9 directories, 6 files

Broken "Logging" Example Link in the Readme

First off, nice work on the library - it looks really exciting!

For your information, I noticed that the link to the "Logging" notebook on the readme is broken.

`Writer.path_suffix` fails with no name specified

When no name is specified, Writer.path_suffix can break output by returning a path suffix starting with '/'. Later path joining uses os.path (probably not good behavior) and can make this the root, overriding other path output.

Use https instead of Git SSH protocol

Users will get a 128 exit code error if they don't have an SSH key registered with GitLab:

    check=True,
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/git', 'clone', '--depth', '1', '[email protected]:whylabs/whylogs-examples.git']' returned non-zero exit status 128.

We should switch to using https:

https://github.com/whylabs/whylogs-python/blob/1304d196c216c2075c4b530257564c79422a7b0d/src/whylogs/cli/demo_cli.py#L21

Add log rotation for dataprofile loggers

Support human-readable datetime for output paths

From the community slack:

and a followup question is in the yaml file when i am calling the $dataset_timestamp variable, it outputs a unix timestamp. Is there any way i can use this variable and still get a normal date like ‘2020-12-18’. Actually i am using this variable to name the subfolders and i want them to be dated normally, not in a unix timestamp

This is currently not supported by whylogs. Simply because we couldn’t decide what granularity to support (if you go down to seconds you have : and it makes linux path very unhappy).

We can probably have some precanned template value such as dataset_timestamp_iso8061_date (you'll get YYYY-MM-DD), dataset_timestamp_iso8061_hour and get YYYY-MM-DD"T"HH (like 2020-12-18T09)

Further granularity probably should be thought through since ISO 8061 involves : characters, which Unix file system path will scream at you

Use pydantic to parse whylogs config

We're using marshmallow to parse whylogs config from YAML

However, Pydantic is much more powerful as it allows users to set config via various mechanims, from YAML, JSON to Environment settings.

We should consider moving to pydantic

whylabs / whylogs Goto Github PK

whylogs's Introduction

The open standard for data logging

Documentation • Slack Community • Python Quickstart • WhyLabs Quickstart

What is whylogs

Python Quickstart

Table of Contents

whylogs Profiles

What are profiles

How do you generate profiles

What can you do with profiles

WhyLabs

Data Constraints

Profile Visualization

Data Types

Integrations

Examples

Benchmarks of whylogs

Usage Statistics

Community

Contribute

How to Contribute

Contributors

whylogs's People

Contributors

Stargazers

Watchers

Forkers

whylogs's Issues

Motivation:

Challenges:

Motivations:

Design inspiration:

Logger

Create a session from config

Accessing config object

Creating a logger:

override top-level config

Tracing between datasets

Decoreators

Example

Code

Output:

Recommend Projects

Recommend Topics

Recommend Org