cods-gcs / kglids Goto Github PK

Linked Data Science powered by Knowledge Graphs

License: Apache License 2.0

Python 57.35% Jupyter Notebook 42.57% Shell 0.09%

data-profiling datascience knowledge-graph pipelines linked-data-science

kglids's Introduction

KGLiDS - Linked Data Science Powered by Knowledge Graphs

In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. While in this process a myriad of artifcats (datasets, pipeline scripts, etc.) are created, there has so far been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists resort to recovering information and experience from colleagues or learn via trial and error. Hence, this paper presents a scalable system, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information KGLiDS enables a variety of downstream applications, such as data discovery and pipelines automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML and shows that KGLiDS is significantly faster with a lower memory footprint as the state of the art while achieving comparable or better accuracy.

Quickstart on Colab

Try out our KGLiDS Colab Demo and KGLiDS DataPrep Demo that demonstrates our APIs on Kaggle data!

Linked Data Science: Systems and Applications

To learn more about Linked Data Science and its applications, please watch Dr. Mansour's talk at Waterloo DSG Seminar (Here).

Installation

Clone the kglids repo
Create kglids Conda environment (Python 3.8) and install pip requirements.
Activate the kglids environment

conda create -n kglids python=3.8 -y
conda activate kglids
pip install -r requirements.txt

Generating the LiDS graph:

Add the data sources to config.py:

# sample configuration
# list of data sources to process
data_sources = [DataSource(name='benchmark',
                           path='/home/projects/sources/kaggle',
                           file_type='csv')]

Run the Data profiler

cd kg_governor/data_profiling/src/
python main.py

Run the Knowledge graph builder to generate the data_items graph

cd kg_governor/knowledge_graph_construction/src/
python data_global_schema_builder.py

Run the Pipeline abstractor to generate the pipeline named graph(s)

cd kg_governor/pipeline_abstraction/
python pipelines_analysis.py

Uploading LiDS graph to the graph-engine (we recommend using GraphDB ): Please see populate_graphdb.py for an example of uploading graphs to GraphDB.

Using the KGLiDS APIs:

KGLiDS provides predefined operations in form of python apis that allow seamless integration with a conventional data science pipeline. Checkout the full list of KGLiDS APIs

LiDS Ontology

To store the created knowledge graph in a standardized and well-structured way, we developed an ontology for linked data science: the LiDS Ontology.
Checkout LiDS Ontology!

Benchmarks

The following benchmark datasets were used to evaluate KGLiDS:

Data Discovery: Table Union Search
Data Cleaning, Data Transformation, and AutoML
Kaggle
- setup_kaggle_data.py

KGLiDS APIs

See the full list of supported APIs here.

Citing Our Work

If you find our work useful, please cite it in your research.

@article{kglids,
         title={Linked Data Science Powered by Knowledge Graphs}, 
         author={Mossad Helali and Shubham Vashisth and Philippe Carrier and Katja Hose and Essam Mansour},
         year={2023},
         journal={ArXiv},
         url = {https://arxiv.org/abs/2303.02204}
}

Contributions

We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs.

Questions

For any questions please contact us:

[email protected]

kglids's People

Contributors

Stargazers

Watchers

Forkers

gmossadhelali

kglids's Issues

Reading tables in data profiling

@mossadhelali I think there is a problem here while dealing with some messy csv files

2022-11-16 21:25:48.854185 : Creating tables, Getting columns
Traceback (most recent call last):
  File "/mnt/miniconda3/envs/kglids/lib/python3.8/site-packages/pandas/io/parse$
s/python_parser.py", line 742, in _next_iter_line
    return next(self.data)
_csv.Error: ',' expected after '"'

Inclusion dependency for columns with signle values

If a column has exactly one unique value, it is excluded from inclusion dependency. Consequently, if two columns have exactly the same unique value, they don't have inclusion dependency (and therefore no PkFk), which shouldn't be the case.

Tuning Content Similarity Parameters

For the graph MovieLens on the discovery vm, even with a score of 0.75 , these pairs are undetected: [('movies2actors', 'actorid', 'actors', 'actorid'), ('movies2actors', 'movieid', 'movies', 'movieid'), ('movies2directors', 'movieid', 'movies', 'movieid'), ('movies2directors', 'directorid', 'directors', 'directorid'), ('u2base', 'movieid', 'movies', 'movieid'), ('u2base', 'userid', 'users', 'userid')]
I have tried these sets of parameters:
LABEL_SIM_THRESHOLD = 0.75
BOOLEAN_SIM_THRESHOLD = 0.75
EMBEDDING_SIM_THRESHOLD = 1.0
and
LABEL_SIM_THRESHOLD = 0.75
BOOLEAN_SIM_THRESHOLD = 0.75
EMBEDDING_SIM_THRESHOLD = 0.75

Do you have any suggestions for how I could tune this to get better results?

Column data types are incorrectly inferred

Spark read_csv infers column types automatically. For some reason, some numercial columns are inferred as string columns. This might be a reason for performance drop.

HINT: it appears if a column has a NaN value it is automatically inferred as string? Verify and fix accordingly.

TypeError: object of type 'numpy.uint64' has no len()

err-short17.txt
Error for profiling: /data/Niki/k-divided/kaggle-short17

Nan values generate error during graph generation

The generated kglids_data_items_graph.nq file:

Error while running: stardog data add --format turtle kaggleDatasets kglids_data_items_graph.nq

ArrayMemoryError: Unable to allocate 117. GiB for an array with shape (104812, 5070, 59) and data type int32

err-short16.txt
Error for profiling: /data/Niki/k-divided/kaggle-short16

SparkException: Python worker exited unexpectedly (crashed)

Error for profiling: /data/Niki/k-divided/kaggle-short32
err-short32.txt

Job aborted due to stage failure error in profiler

running /mnt/Niki/kglids/kg_governor/data_profiling/src/main.py on the kaggle datasets, I get a Job aborted due to stage failure error. I have attached the error file.
err.txt

Inclusion dependency are duplicated with different scores

The original implementation generates duplicated similarity triples between columns with different scores.
Example:

<<col1 hasInclusionDependency col2>> withCertainty 0.97
<<col2 hasInclusionDependency col1>> withCertainty 0.97
<<col1 hasInclusionDependency col2>> withCertainty 0.96
<<col2 hasInclusionDependency col1>> withCertainty 0.96

Proposed Solution: make the scores asymmetric?

<<col1 hasInclusionDependency col2>> withCertainty 0.97
<<col2 hasInclusionDependency col1>> withCertainty 0.96

Profiler error

I'm getting errors running the profiler. I have tried both the SAP dataset and Credit. I have attached the logs. I think there is an error in classifying the data types as both these datasets were profiled with the older version of kglids.
SAP-error.txt
Credit-error.txt