aai-institute / sensai Goto Github PK
View Code? Open in Web Editor NEWThe Python library for sensible AI.
Home Page: https://aai-institute.github.io/sensAI/docs/
License: Other
The Python library for sensible AI.
Home Page: https://aai-institute.github.io/sensAI/docs/
License: Other
(wip on v0-legacy on my machine)
Currently we support graph-related stuff only for geospatial data. If ever needed in a more general context (e.g. for arbitrary Euclidean or even non-Euclidean data, arbitrary triangulations a.s.o.), the classes there should be refactored and extended. Here a snippet of a conversation highlighting some of the problems with the current structure:
It is related to coordinates, since a Delaunay triangulation only makes sense for euclidean points. If we would like to have stuff only related to the graph representation, it shouldn't be tightly coupled to a special type of graph (representing a Delaunay triangulation). The class
class SpanningTree:
"""
Wrapper around a tree-finding algorithm that will be applied on the Delaunay graph of the datapoints
"""
def __init__(self, datapoints: np.ndarray, tree_finder: Callable[[nx.Graph], nx.Graph] = nx.minimum_spanning_tree):
should not depend on an np.ndarray, which has to be an array of euclidean coordinates, since a Delaunay triangualation is computed, but rather a nx.Graph. Moreover, paramter tree_finder allows to construct a subgraph, which may not be a spanning tree at all. Actually, the class only provides convenience methods for inspecting a weighted (nx) graph. So, either the class should be
class WeightedGraphWrapper:
"""
Wrapper around a nx.Graph
"""
def __init__(self, graph: nx.Graph):
Or, if you really would like to have a spanning tree, it may look like
class SpanningTree:
def __init__(self, graph: nx.Graph, mode="min"):
self.tree = nx.minimum_spanning_tree if mode== "min" else ...
(resp. including an enum for min max). In any case, I don't know if this is enough functionality, to justify a class.
By now, it is tightly coupled to euclidean point graphs (even more, Delaunay graphs), so I would put it into geoanalytics. What do you think?
Originally posted by @schroedk in #50 (comment)
This is already done in the gitlab pipeline, we should also to it on github. Probably won't be totally trivial though
See https://github.com/jambit/sensAI/runs/2884663556?check_suite_focus=true
RuntimeError: Kernel died before replying to kernel_info
zmq.error.ZMQError: Address already in use
@MischaPanch, are you familiar with such errors? Could this be related to the newly introduced caching (#31)?
This can be done through the extras_require flag so we support something like pip install sensai[tensorflow] and so on
@MischaPanch, new failure encountered.
Looks like the version of pytorch-lightning being used is incompatible with the version of torch being used; probably due to the relaxed version specification (pytorch-lightning~=1.1
) in tox.ini.
Originally posted by @opcode81 in #57 (comment)
The overrides of the rule-based models violate the substitution principle. The specific fit
methods should be deleted.
It is also unexpected for the fitPreprocessors argument to be completely ignored for rule-based models in VectorMode.fit
.
persist several types of models via the v0-legacy branch and add tests that load and apply them: torch models and sk learn models alike, classification and regression
We get very strange behavior when we add this dependency to setup.py. When it is installed separately during build, everything is fine though...
See this commit 2625669 and the TODO in it
With the higher level vector model interfaces, validation in training of neural networks can currently only be performed with a train-test split. On the other hand, the lower level NNOptimizer itself supports passing a separate validation set. Pytorch-lightning's Trainer
class also has similar capabilities which are currently disabled in sensai because of vector model's fit interface. I ran into this myself because I want to fit pytorch-lightning models to a data frame where the validation set cannot be created by splitting (due to data leakage).
Note that it is not sufficient to use evaluators for performing validation on separate sets since this way the validation set cannot be used for early stopping.
I propose to address this by relaxing VectorModel's interface and allowing to pass a validation set in fit
and fit_classifier
.
Something like fit(X, Y, X_validation=None, Y_validation=None)
@opcode81 @schroedk
If you think this is reasonable, I would prepare a PR in the next days.
https://github.com/jambit/sensAI/runs/3392942386?check_suite_focus=true#step:9:307
> if msg['parent_header'].get('msg_id') == msg_id:
E TypeError: 'coroutine' object is not subscriptable```
Likely cause: issue in nbconvert
jupyter/jupyter_client#637
Build of branch develop apparently tried to merge develop, which fails if it's non-fast-forward.
Sometimes (rarely) they fail due to desired accuracy not being reached
This is crucial for datasets that don't fit in RAM. Special care must be taken with featuregens and dft transformers since they typically cannot be trained batch-wise
For the current workflow, one needs to manually update rst-files (using a script build_scripts/update_docu.py). While the script generates rst-files for new modules, it does not delete rst-files for no longer existent modules. So either (hopefully) build breaks on documentation build step or resulting documentation is referencing non-existing code.
@MischaPanch. is there any reason, we track rst-files in the repository and do not generate them on the fly? In a different project, we use a script for exactly this task, which is called in build pipeline, so documentation always reflects actual code base.
This issue is only about setting up models and training, evaluation of such models will be handled in a separate issue. Thus, it is mainly about writing the TensorModel abstraction and some implementations of it.
The goal is to enable training of autoencoders, prediction of 2-dimensional tensors for geospatial analysis and so on.
I am not entirely sure how to deal with models that take multi-dim data and predict scalars like CNN-based classifiers and regressors (a very common use case). VectorModel does not feel like the right fit, nor does TensorModel (because it will focus on tensor-like prediction). @opcode81 Do you have an idea about that?
Since I will probably write some data loaders for images anyway, I wonder whether this can be geneIralized a bit and used in the rest of sensai. For example, we might have a base DataSet that generates batches of numpy arrays and convert them to tensors either in a child class or directly in the methods consuming the dataset
Originally posted by @MischaPanch in #23 (comment)
We should consider switching to conda for venv configuration (using the YAML-based environment specification: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#sharing-an-environment).
Installing libraries like tensorflow via conda has considerable advantages. In particular, the conda-based installation will install CUDA and CuDNN in versions that match the tf installation, whereas the pip-based installation requires that the matching installations be provided externally, which is extremely painful.
Originally posted by @opcode81 in #21 (comment)
For some reason it stopped working. It should be triggered when releases are created but this did not happen for the last releases (that were created by a workflow themselves, maybe that's the reason). I uploaded the last release to pypi manually because of that
I think we can safely update to the newest versions for everything apart from tf, torch and lgbm. Only for pandas there would be a bump in the major version but I am pretty confident it does not break any of our code
This will include new metrics (like intersection over union, stuctural similarity indices and so on) and also new visualization methods for the special case of 2-dim. data.
@opcode81 You have a better overview for which branches are still needed (all of them?)
We might also want to update the dependencies as we should support that latest stable versions of those libs
pip install
fails.
Perhaps this can be fixed by using conda (#24) or a different version of geopandas.
ClusteringModel
is not sufficiently general; it should be renamed to EuclidianClusterer
as it assumes that data points are points in a Euclidian space. All subclasses should be named accordingly.
Since geopandas is a problematic dependency (see #45), all geo-analytics-related code should be bundled in a new package geoanalytics
.
This package should also contain all the utils that depend on geopandas.
Restructuring of the package:
clustering/base/clustering.py -> clustering/clustering_base.py
clustering/sklearn_clustering.py: shall contain sklearn base classes and specialisations (using prefix "Sk" not "SK")
clustering/coordinate_clustering: move to geoanalytics
Let's see whether anyone is interested in that
various combinations, e.g. classification with and without probabilities
As a user, I would object to receiving warnings that do not apply. Ideally, we should not warn that there might be a problem but know if there is a problem (by having the components be aware of the fitting requirement) - and throw an exception iff so.
Originally posted by @opcode81 in #33 (comment)
test_notebooks.py
.index.rst
(most are currently commented out)To make this "cleaner", we could consider changing the interface of _predict
to not return a DF but a more low-level result instead and make predict
construct the DF correctly.
Originally posted by @opcode81 in #33 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.