Coder Social home page Coder Social logo

cods-gcs / kglids Goto Github PK

View Code? Open in Web Editor NEW
8.0 3.0 1.0 160.09 MB

Linked Data Science powered by Knowledge Graphs

License: Apache License 2.0

Python 57.35% Jupyter Notebook 42.57% Shell 0.09%
data-profiling datascience knowledge-graph pipelines linked-data-science

kglids's Introduction

KGLiDS - Linked Data Science Powered by Knowledge Graphs

KGLiDS_architecture

In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. While in this process a myriad of artifcats (datasets, pipeline scripts, etc.) are created, there has so far been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists resort to recovering information and experience from colleagues or learn via trial and error. Hence, this paper presents a scalable system, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information KGLiDS enables a variety of downstream applications, such as data discovery and pipelines automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML and shows that KGLiDS is significantly faster with a lower memory footprint as the state of the art while achieving comparable or better accuracy.

Quickstart on Colab

Try out our KGLiDS Colab Demo and KGLiDS DataPrep Demo that demonstrates our APIs on Kaggle data!

Linked Data Science: Systems and Applications

To learn more about Linked Data Science and its applications, please watch Dr. Mansour's talk at Waterloo DSG Seminar (Here).

Installation

  • Clone the kglids repo
  • Create kglids Conda environment (Python 3.8) and install pip requirements.
  • Activate the kglids environment
conda create -n kglids python=3.8 -y
conda activate kglids
pip install -r requirements.txt

Generating the LiDS graph:

# sample configuration
# list of data sources to process
data_sources = [DataSource(name='benchmark',
                           path='/home/projects/sources/kaggle',
                           file_type='csv')]
cd kg_governor/data_profiling/src/
python main.py
cd kg_governor/knowledge_graph_construction/src/
python data_global_schema_builder.py
cd kg_governor/pipeline_abstraction/
python pipelines_analysis.py

Uploading LiDS graph to the graph-engine (we recommend using GraphDB ): Please see populate_graphdb.py for an example of uploading graphs to GraphDB.


Using the KGLiDS APIs:

KGLiDS provides predefined operations in form of python apis that allow seamless integration with a conventional data science pipeline. Checkout the full list of KGLiDS APIs

LiDS Ontology

To store the created knowledge graph in a standardized and well-structured way, we developed an ontology for linked data science: the LiDS Ontology.
Checkout LiDS Ontology!

Benchmarks

The following benchmark datasets were used to evaluate KGLiDS:

KGLiDS APIs

See the full list of supported APIs here.

Citing Our Work

If you find our work useful, please cite it in your research.

@article{kglids,
         title={Linked Data Science Powered by Knowledge Graphs}, 
         author={Mossad Helali and Shubham Vashisth and Philippe Carrier and Katja Hose and Essam Mansour},
         year={2023},
         journal={ArXiv},
         url = {https://arxiv.org/abs/2303.02204}
}

Contributions

We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs.

Questions

For any questions please contact us:

[email protected]

[email protected]

kglids's People

Contributors

gmossadhelali avatar mansoure avatar mossadhelali avatar nikimonjazeb avatar p-carrier avatar shubhamvashisth7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

gmossadhelali

kglids's Issues

Reading tables in data profiling

@mossadhelali I think there is a problem here while dealing with some messy csv files

2022-11-16 21:25:48.854185 : Creating tables, Getting columns
Traceback (most recent call last):
  File "/mnt/miniconda3/envs/kglids/lib/python3.8/site-packages/pandas/io/parse$
s/python_parser.py", line 742, in _next_iter_line
    return next(self.data)
_csv.Error: ',' expected after '"'

Inclusion dependency for columns with signle values

If a column has exactly one unique value, it is excluded from inclusion dependency. Consequently, if two columns have exactly the same unique value, they don't have inclusion dependency (and therefore no PkFk), which shouldn't be the case.

Tuning Content Similarity Parameters

For the graph MovieLens on the discovery vm, even with a score of 0.75 , these pairs are undetected: [('movies2actors', 'actorid', 'actors', 'actorid'), ('movies2actors', 'movieid', 'movies', 'movieid'), ('movies2directors', 'movieid', 'movies', 'movieid'), ('movies2directors', 'directorid', 'directors', 'directorid'), ('u2base', 'movieid', 'movies', 'movieid'), ('u2base', 'userid', 'users', 'userid')]
I have tried these sets of parameters:
LABEL_SIM_THRESHOLD = 0.75
BOOLEAN_SIM_THRESHOLD = 0.75
EMBEDDING_SIM_THRESHOLD = 1.0
and
LABEL_SIM_THRESHOLD = 0.75
BOOLEAN_SIM_THRESHOLD = 0.75
EMBEDDING_SIM_THRESHOLD = 0.75

Do you have any suggestions for how I could tune this to get better results?

Column data types are incorrectly inferred

Spark read_csv infers column types automatically. For some reason, some numercial columns are inferred as string columns. This might be a reason for performance drop.

HINT: it appears if a column has a NaN value it is automatically inferred as string? Verify and fix accordingly.

Inclusion dependency are duplicated with different scores

The original implementation generates duplicated similarity triples between columns with different scores.
Example:

<<col1 hasInclusionDependency col2>> withCertainty 0.97
<<col2 hasInclusionDependency col1>> withCertainty 0.97
<<col1 hasInclusionDependency col2>> withCertainty 0.96
<<col2 hasInclusionDependency col1>> withCertainty 0.96

Proposed Solution: make the scores asymmetric?

<<col1 hasInclusionDependency col2>> withCertainty 0.97
<<col2 hasInclusionDependency col1>> withCertainty 0.96

Profiler error

I'm getting errors running the profiler. I have tried both the SAP dataset and Credit. I have attached the logs. I think there is an error in classifying the data types as both these datasets were profiled with the older version of kglids.
SAP-error.txt
Credit-error.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.