Coder Social home page Coder Social logo

eqtpartners / companykg Goto Github PK

View Code? Open in Web Editor NEW
128.0 11.0 5.0 463 KB

A Large-Scale Company Relation Graph for Investment Industry

License: MIT License

Python 72.41% Shell 1.61% Jupyter Notebook 25.98%
benchmarking company dataset graph-algorithms graph-learning graph-neural-networks heterogeneous-graph invetsments knowledge-graph private-equity

companykg's Introduction

⚠️ Repository Upgraded and Migrated to Version 2.x ⚠️

This repository corresponds to CompanyKG Version 1.x. We have extended this work to Version 2.x, hosted in a new repository. For ease of maintenance, we recommend submitting issues and pull requests for both Version 1.x and 2.x to the CompanyKG2 repository.


CompanyKG Logo

A Large-Scale Heterogeneous Graph for Company Similarity Quantification

version python python

This repository contains all code released to accompany the release of the CompanyKG knowledge graph illustrated in Figure 1 below. For details of the dataset and benchmark experiments, see the official release of the paper and dataset.

CompanyKG Illustration

There are two main parts to the code release:

Pre-Requisites

  • Python 3.8

There are also optional dependencies, if you want to be able to convert the KG to one of the data structures used by these packages:

Setup

The companykg Python package provides a data structure to load CompanyKG into memory, convert between different graph representations and run evaluation of trained embeddings or other company-ranking predictors on three evaluation tasks.

To install the comapnykg package and its Python dependencies, activate a virtual environment (such as Virtualenv or Conda) and run:

pip install -e .

The first time you instantiate the CompanyKG class, if the dataset is not already available (in the default subdirectory or another location you specify), the latest version will be automatically downloaded from Zenodo.

Basic usage

By default, the CompanyKG dataset will be loaded from (and, if necessary, downloaded to) a data subdirectory of the working directory. To load the dataset from this default location, simply instantiate the CompanyKG class:

from companykg import CompanyKG

ckg = CompanyKG()

If you have already downloaded the dataset and want to load it from its current location, specify the path:

ckg = CompanyKG(data_root_folder="/path/to/stored/companykg/directory")

The graph can be loaded with different vector representations (embeddings) of company description data associated with the nodes: msbert (mSBERT), simcse(SimCSE), ada2 (ADA2) or pause (PAUSE).

ckg = CompanyKG(nodes_feature_type="pause")

If you want to experiment with different embedding types, you can also load embeddings of a different type into an already-loaded graph:

ckg.change_feature_type("simcse")

By default, edge weights are not loaded into the graph. To change this use:

ckg = CompanyKG(load_edges_weights=True)

A tutorial showing further ways to use CompanyKG is here.

Training benchmark models

Implementations of various benchmark graph-based learning models are provided in this repository.

To use them, install the ckg_benchmarks Python package, along with its dependencies, from the benchmarks subdirectory. First install companykg as above and then:

cd benchmarks
pip install -e .

Further instructions for using the benchmarks package for model training and provided in the benchmarks README file.

External Results

We collect all benchmarking results on this dataset here. Welcome to reach out to us (via github issue or email shown in our paper) if you wish to include your experimental results.

Cite This Work

Cite the paper:

@article{cao2023companykg,
    author = {Lele Cao and
              Vilhelm von Ehrenheim and
              Mark Granroth-Wilding and
              Richard Anselmo Stahl and
              Drew McCornack and
              Armin Catovic and
              Dhiana Deva Cavacanti Rocha},
    title = {{CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification}},
    journal = {IEEE Transactions on Big Data},
    year = {2024},
    doi = {10.1109/TBDATA.2024.3407573}
}

Cite the official release of the CompanyKG dataset on Zenodo:

@article{companykg_2023_8010239,
    author = {Lele Cao and
              Vilhelm von Ehrenheim and
              Mark Granroth-Wilding and
              Richard Anselmo Stahl and
              Drew McCornack and
              Armin Catovic and
              Dhiana Deva Cavacanti Rocha},
    title = {{CompanyKG Dataset: A Large-Scale Heterogeneous Graph for Company Similarity Quantification}},
    month = June,
    year = 2023,
    publisher = {Zenodo},
    version = {1.1},
    doi = {10.5281/zenodo.8010239},
    url = {https://doi.org/10.5281/zenodo.8010239}
}

companykg's People

Contributors

cao-lele avatar markgw-eqt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

companykg's Issues

Improved results by 'augmenting' matrix with fastRP algorithm

Hello!

I tried to use the companyKG graph together with the fastRP algorithm https://arxiv.org/pdf/1908.11512.pdf
I implemented the algorithm in apache spark https://github.com/Knorreman/graphxfastRP/tree/master
Forgive me for the incomplete README etc... :)

Here is the result when using the msBERT 512 dim vector as init vector instead of a randomly initialized one.
{"source": "embed torch.Size([1169931, 512])", "sp_auc": 0.848861754181647, "sr_validation_acc": 0.6195652173913043, "sr_test_acc": 0.6532258064516129, "cr_topk_hit_rate": [0.227659109895952, 0.32893550163287005, 0.4052213868003342, 0.47640123034859877, 0.566618724842409, 0.6384498177261335, 0.7838241436925648, 0.850617072985494]}
I could not get any results from SimCSE and ADA2 due to their large size and I ran into OOM problems on my PC. The msBERT took like 8-10h to run with my spark code... You can easily implement the fastRP algorithm in numpy/torch and get much better performance but I wanted to make the algorithm distributable with spark! :)

I used alpha1 and alpha2 as 1.0 and I also weighted the starting vector to 1.0 in the linear combination.
As you can see the 'sp_auc' and 'cr_topk_hit_rate' @50 and @100 is better than the results presented in the paper. However the 'sr_test_acc' is not quite as good.

GraphMAE has similar results with 'cr_topk_hit_rate' but not as good with 'sp_auc'

I didnt tune any hyper paramters for fastRP since I had so much trouble even getting it to work with that large graph + vector size. So there can potentially be even better results to gain if tune it even more!

I hope you find it interesting! :) And I can share the torch matrix I found if I can figure out a good host to upload it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.