aai-institute / sensai Goto Github PK

View Code? Open in Web Editor NEW

31.0 7.0 3.0 7.7 MB

The Python library for sensible AI.

Home Page: https://aai-institute.github.io/sensAI/docs/

License: Other

Python 90.51% Jupyter Notebook 8.83% Shell 0.61% HTML 0.05%

artificial-intelligence machine-learning python transferlab

sensai's People

Contributors

Stargazers

Watchers

Forkers

pombredanne schroedk aditya6548

sensai's Issues

Add tests for regression models

(wip on v0-legacy on my machine)

Generalize SpanningTree and graph-related methods

Currently we support graph-related stuff only for geospatial data. If ever needed in a more general context (e.g. for arbitrary Euclidean or even non-Euclidean data, arbitrary triangulations a.s.o.), the classes there should be refactored and extended. Here a snippet of a conversation highlighting some of the problems with the current structure:

It is related to coordinates, since a Delaunay triangulation only makes sense for euclidean points. If we would like to have stuff only related to the graph representation, it shouldn't be tightly coupled to a special type of graph (representing a Delaunay triangulation). The class

class SpanningTree:
    """
    Wrapper around a tree-finding algorithm that will be applied on the Delaunay graph of the datapoints
    """
    def __init__(self, datapoints: np.ndarray, tree_finder: Callable[[nx.Graph], nx.Graph] = nx.minimum_spanning_tree):

should not depend on an np.ndarray, which has to be an array of euclidean coordinates, since a Delaunay triangualation is computed, but rather a nx.Graph. Moreover, paramter tree_finder allows to construct a subgraph, which may not be a spanning tree at all. Actually, the class only provides convenience methods for inspecting a weighted (nx) graph. So, either the class should be

class WeightedGraphWrapper:
    """
    Wrapper around a nx.Graph
    """
    def __init__(self, graph: nx.Graph):

Or, if you really would like to have a spanning tree, it may look like

class SpanningTree:
    def __init__(self, graph: nx.Graph, mode="min"):
        self.tree = nx.minimum_spanning_tree if mode== "min" else ...

(resp. including an enum for min max). In any case, I don't know if this is enough functionality, to justify a class.
By now, it is tightly coupled to euclidean point graphs (even more, Delaunay graphs), so I would put it into geoanalytics. What do you think?

Originally posted by @schroedk in #50 (comment)

Cache .tox and possible apt packages

This is already done in the gitlab pipeline, we should also to it on github. Probably won't be totally trivial though

Support for tensor-like output

Notebook tests can fail

See https://github.com/jambit/sensAI/runs/2884663556?check_suite_focus=true

RuntimeError: Kernel died before replying to kernel_info
zmq.error.ZMQError: Address already in use

@MischaPanch, are you familiar with such errors? Could this be related to the newly introduced caching (#31)?

Allow possibility to install supported optional deps with pip

This can be done through the extras_require flag so we support something like pip install sensai[tensorflow] and so on

Notebook Test Failure: PyTorch-Lightning ImportError

@MischaPanch, new failure encountered.
Looks like the version of pytorch-lightning being used is incompatible with the version of torch being used; probably due to the relaxed version specification (pytorch-lightning~=1.1) in tox.ini.

Originally posted by @opcode81 in #57 (comment)

Rule-Based Models' fit interface

The overrides of the rule-based models violate the substitution principle. The specific fit methods should be deleted.
It is also unexpected for the fitPreprocessors argument to be completely ignored for rule-based models in VectorMode.fit.

Add backwards compatibility tests from v0 to v1

persist several types of models via the v0-legacy branch and add tests that load and apply them: torch models and sk learn models alike, classification and regression

Automate release mechanism

Maybe add a formatter?

This becomes especially relevant in case external devs want to contribute.

I personally grew to like the black formatter (after an initial phase of wtf), that I use together with isort in git hooks through pre-commit.

Add pytorch-lightning to sensai[torch] dependencies

We get very strange behavior when we add this dependency to setup.py. When it is installed separately during build, everything is fine though...

See this commit 2625669 and the TODO in it

Add notebooks demonstrating sensai model implementations

Support passing separate validation data set to pytorch-lightning based models

With the higher level vector model interfaces, validation in training of neural networks can currently only be performed with a train-test split. On the other hand, the lower level NNOptimizer itself supports passing a separate validation set. Pytorch-lightning's Trainer class also has similar capabilities which are currently disabled in sensai because of vector model's fit interface. I ran into this myself because I want to fit pytorch-lightning models to a data frame where the validation set cannot be created by splitting (due to data leakage).

Note that it is not sufficient to use evaluators for performing validation on separate sets since this way the validation set cannot be used for early stopping.

I propose to address this by relaxing VectorModel's interface and allowing to pass a validation set in fit and fit_classifier.

Something like fit(X, Y, X_validation=None, Y_validation=None)

@opcode81 @schroedk
If you think this is reasonable, I would prepare a PR in the next days.

Build notebooks and render them in docu

Notebook test failure due to TypeError in nbconvert

https://github.com/jambit/sensAI/runs/3392942386?check_suite_focus=true#step:9:307

>           if msg['parent_header'].get('msg_id') == msg_id:
E           TypeError: 'coroutine' object is not subscriptable```

Likely cause: issue in nbconvert
jupyter/jupyter_client#637

Add notebooks demonstrating basic feature engineering and evaluation

Build fails upon force-push to develop

Build of branch develop apparently tried to merge develop, which fails if it's non-fast-forward.

Fix random seeds in model tests

Sometimes (rarely) they fail due to desired accuracy not being reached

Add notebooks demonstrating caching and model persistence

Support training on (non-in-memory) datasets for VectorModel-derived classes

This is crucial for datasets that don't fit in RAM. Special care must be taken with featuregens and dft transformers since they typically cannot be trained batch-wise

Do not track rst-files in repository

For the current workflow, one needs to manually update rst-files (using a script build_scripts/update_docu.py). While the script generates rst-files for new modules, it does not delete rst-files for no longer existent modules. So either (hopefully) build breaks on documentation build step or resulting documentation is referencing non-existing code.

@MischaPanch. is there any reason, we track rst-files in the repository and do not generate them on the fly? In a different project, we use a script for exactly this task, which is called in build pipeline, so documentation always reflects actual code base.

Support models that can input and predict multidimensional tensors

This issue is only about setting up models and training, evaluation of such models will be handled in a separate issue. Thus, it is mainly about writing the TensorModel abstraction and some implementations of it.

The goal is to enable training of autoencoders, prediction of 2-dimensional tensors for geospatial analysis and so on.

I am not entirely sure how to deal with models that take multi-dim data and predict scalars like CNN-based classifiers and regressors (a very common use case). VectorModel does not feel like the right fit, nor does TensorModel (because it will focus on tensor-like prediction). @opcode81 Do you have an idea about that?

Generalize TorchDataSet to DataSet using batches of pd.DataFrame

Since I will probably write some data loaders for images anyway, I wonder whether this can be geneIralized a bit and used in the rest of sensai. For example, we might have a base DataSet that generates batches of numpy arrays and convert them to tensors either in a child class or directly in the methods consuming the dataset

Originally posted by @MischaPanch in #23 (comment)

Support conda for venv configuration

We should consider switching to conda for venv configuration (using the YAML-based environment specification: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#sharing-an-environment).
Installing libraries like tensorflow via conda has considerable advantages. In particular, the conda-based installation will install CUDA and CuDNN in versions that match the tf installation, whereas the pip-based installation requires that the matching installations be provided externally, which is extremely painful.

Originally posted by @opcode81 in #21 (comment)

Fix publish package github workflow

For some reason it stopped working. It should be triggered when releases are created but this did not happen for the last releases (that were created by a workflow themselves, maybe that's the reason). I uploaded the last release to pypi manually because of that

Update dependencies that are safe

I think we can safely update to the newest versions for everything apart from tf, torch and lgbm. Only for pandas there would be a bump in the major version but I am pretty confident it does not break any of our code

Write evaluation methods for TensorModel

This will include new metrics (like intersection over union, stuctural similarity indices and so on) and also new visualization methods for the special case of 2-dim. data.

Delete stale branches

@opcode81 You have a better overview for which branches are still needed (all of them?)

Update repo_dir_sync to use develop instead of master

Add tests for tensorflow and lightgbm models

We might also want to update the dependencies as we should support that latest stable versions of those libs

geopandas cannot be installed under Windows

pip install fails.
Perhaps this can be fixed by using conda (#24) or a different version of geopandas.

Simplify/improve naming of evaluator/evaluation utility classes

Allow torch loss evaluators for training without validation sets

Refactoring of the clustering package, bundling geo-analytics stuff in a separate optional package

ClusteringModel is not sufficiently general; it should be renamed to EuclidianClusterer as it assumes that data points are points in a Euclidian space. All subclasses should be named accordingly.

Since geopandas is a problematic dependency (see #45), all geo-analytics-related code should be bundled in a new package geoanalytics.
This package should also contain all the utils that depend on geopandas.

Restructuring of the package:
clustering/base/clustering.py -> clustering/clustering_base.py
clustering/sklearn_clustering.py: shall contain sklearn base classes and specialisations (using prefix "Sk" not "SK")
clustering/coordinate_clustering: move to geoanalytics

Maybe add a template for deploying with docker?

Let's see whether anyone is interested in that

Add tests for evaluation util classes

various combinations, e.g. classification with and without probabilities

Objects such as DFTs and Fgens should know whether they actually require fitting ...

As a user, I would object to receiving warnings that do not apply. Ideally, we should not warn that there might be a problem but know if there is a problem (by having the components be aware of the fitting requirement) - and throw an exception iff so.

Originally posted by @opcode81 in #33 (comment)

Update notebooks to v1 and re-enable them for the docs build

Update the notebooks that are still relevant to use sensAI v1 and re-enable the respective tests in test_notebooks.py.
Include the notebooks that provide useful documentation in index.rst (most are currently commented out)

Maybe improve _predict interface in VectorModel

To make this "cleaner", we could consider changing the interface of _predict to not return a DF but a more low-level result instead and make predict construct the DF correctly.