Coder Social home page Coder Social logo

dobraczka / klinker Goto Github PK

View Code? Open in Web Editor NEW
4.0 3.0 0.0 1.22 MB

🧱 blocking methods for entity resolution

Home Page: https://klinker.readthedocs.io

License: MIT License

Python 90.27% Shell 9.73%
blocking data-integration deduplication entity-alignment entity-resolution link-discovery record-linkage

klinker's Introduction

klinker logo

klinker

Actions Status Documentation Status Code style: black

klinker overview

Installation

Clone the repo and change into the directory:

git clone https://github.com/dobraczka/klinker.git
cd klinker

For usage with GPU create a micromamba environment:

micromamba env create -n klinker-conda --file=klinker-conda.yaml

Activate it and install the remaining dependencies:

mamba activate klinker-conda
pip install -e .

Alternatively if you don't intend to utilize a GPU you can install it in a virtual environment:

python -m venv klinker-env
source klinker-env/bin/activate
pip install -e .[all]

or via poetry:

poetry install

Usage

Load a dataset:

from sylloge import MovieGraphBenchmark
from klinker.data import KlinkerDataset

ds = KlinkerDataset.from_sylloge(MovieGraphBenchmark(graph_pair="tmdb-tvdb"))

Create blocks and write to parquet:

from klinker.blockers import SimpleRelationalTokenBlocker

blocker = SimpleRelationalTokenBlocker()
blocks = blocker.assign(left=ds.left, right=ds.right, left_rel=ds.left_rel, right_rel=ds.right_rel)
blocks.to_parquet("tmdb-tvdb-tokenblocked")

Read blocks from parquet and evaluate:

from klinker import KlinkerBlockManager
from klinker.eval_metrics import Evaluation

kbm = KlinkerBlockManager.read_parqet("tmdb-tvdb-tokenblocked")
ev = Evaluation.from_dataset(blocks=kbm, dataset=ds)

Reproduce Experiments

The experiment.py has commands for datasets and blockers. You can use python experiment.py --help to show the available commands. Subcommands can also offer help e.g. python experiment.py gcn-blocker --help.

You have to use a dataset command before a blocker command.

For example if you used micromamba for installation:

micromamba run -n klinker-conda python experiment.py movie-graph-benchmark-dataset --graph-pair "tmdb-tvdb" relational-token-blocker

This would be similar to the steps described in the above usage section.

In order to precisely reproduce the results from the paper we provide (adapted) run scripts from our SLURM batch scripts in the run_scripts folder. Please consult the run_scripts/README.md for further information. For archival purposes the experiment artifacts and the source code are stored in Zenodo.

klinker's People

Contributors

dobraczka avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

klinker's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.