Coder Social home page Coder Social logo

dlr-sc / gitlab2prov Goto Github PK

View Code? Open in Web Editor NEW
15.0 9.0 3.0 2.67 MB

πŸ”οΈ Extract provenance information (W3C PROV) from GitLab projects.

License: MIT License

Python 100.00%
provenance gitlab software-analytics graphs prov-generation knowledge-graph w3c-prov extract-provenance-information python git

gitlab2prov's Introduction

Hi there πŸ‘‹

gitlab2prov's People

Contributors

cdboer avatar daniel-mohr avatar onyame avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gitlab2prov's Issues

Evaluate GitLab Rate Limiting

To mine many repositories at once, it would help to know if the rate limit is calculated on user basis or on IP basis.

Bug: Node id collisions when inserting multiple documents

There seem to be merge issues when inserting documents that have node id's that are already given to nodes in the neo4j instance. This suggests, that the assumption of nodes being different when they are not in the same bundle but do have the same id is incorrect.

Approach to fix:

  • include project/bundle id in each node id for nodes of a bundle

Create different id namespaces for different bundles.

Multiple regex per event classifier

Older versions of GitLab have different system note strings denoting events. Some older versions do still have labeling information in system note strings. This could potentially lead to duplicated label events if they are also returned by the label event api endpoint.

A fix would be to allow a number of different regular expressions per classifier to cover old notations. The labeling issue remains to be investigated.

Originally posted by @cdboer in #30 (comment)

PROV Data Model

Develop a simple graphical and or tabular representation of how the used PROV model looks like.
The initial model can be the one used by #2.
Annotate nodes with their type.

Event Parsing

Implement event parsing for labels, awards, notes and system notes.

Relate file to its original entity

The PROV model for commits requires files to be related to their original entity and to proclaim themselves as a specialization of that entity. Finding this original entity is already implemented, though quite slow and does not scale well with a growing amount of files. Rethink the implementation for this to be faster and more resilient. Keep an eye on performance, this shouldn't compute longer than necessary.

Right now the implementation is searching through the ever changing file paths of all files, computing tracks for each file and choosing the longest track to be the correct history of a file. Whit it bringing along a lot of computation overhead, this does not seem to be the best solution to the problem.

Complete command line options

All configuration options that are available in config.ini should be present as command line arguments.
This is to avoid having an config.ini file.

Evaluate Neo4J Write Action Bottleneck

In the latest test runs with repositories of 500 commits and more, I noticed that it took to longer than expected to store sizable PROV documents in a Neo4J instance. Evaluate wether this is normal behaviour, or if a speed up is possible. Maybe chunking could help?

Error when processing parent commits

For some larger GitLab projects (> 50 commits) the following error occurs:

Traceback` (most recent call last):
  File "<stdin>", line 1, in <module>
  File "gitlab2prov\gitlab2prov\__init__.py", line 31, in compute_graph
    graphs = self.run_pipelines(url)
  File "gitlab2prov\gitlab2prov\__init__.py", line 50, in run_pipelines
    packages = pipe.process(*data)
  File "gitlab2prov\gitlab2prov\pipelines.py", line 33, in process
    packages = CommitProcessor.process(commits, diffs)
  File "gitlab2prov\gitlab2prov\procs\__init__.py", line 19, in process
    parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
  File "gitlab2prov\gitlab2prov\procs\__init__.py", line 19, in <listcomp>
    parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
KeyError: 'e4a41bbde6dfc8b152b2d94528edc5f78073baf4'

The problem is that commits are fetched in batches/pages of 50 commits. If the parent commit of a commit is not included in the same batch, the line parents = [commits[parent_id] for parent_id in commit["parent_ids"]] leads to a key error.

Update event parsing documentation

Update the list of events in /docs to reflect the current set of parsed events. Also add descriptions for what events denote and what information certain events convey. (As in what keys are added by events to the property labels.)

Paper: Git2PROV - Exposing Version Control System Content as W3C PROV

Thread regarding the paper (pdf) by authors Tom De Nies, Sara Magliacane, Ruben Verborgh, Sam Coppens, Paul Groth and Rik Van de Walle.

The related GitHub repository can be found here. :octocat:

The paper has likely been the inspiration for #2 aswell as for this project. Additional ressources, ideas and comments concerning Git2PROV will be posted in the comments of this issue.

Paper: GitHub2PROV - Provenance for Supporting Software Project Management

Thread regarding the aformentioned paper GitHub2PROV: Provenance for Supporting Software Project Management by authors Heather S. Packer, Adriane Chapman and Leslie Carr - all of the University of Southampton.

The paper has been published as part of the USENIX publication for the 11th International Workshop on Theory and Practice of Provenance in June 2019.

Additional ressources, ideas and comments concerning GitHub2PROV will be posted in the comments of this issue.

Support more than one output format through argument chaining

It'd be great if you were able to write > 1 format at the same time, e.g. gitlab2prov -t {token} -f json -f rdf -f xml -p {url} -r 1 > provout/{outfilename}.{format}. This would help when building a larger dataset with provenance data. This would be fairly easy to achieve using click or typer as well.

Add qualified relations

To have qualified relations instead of binary relation between PROV class elements, add attributes to relations, at least:

  • wasGeneratedBy

    • prov:time
    • prov:role
  • used

    • prov:time
    • prov:role
  • wasInvalidatedBy

    • prov:time

GitLab API Support

Basic information retrieval using GitLab API. Use data to fill in data classes defined by gitlab2prov. Should be interchangeable with different data sources (APIs). (Hint: github API)

Asynchronous diff fetching

To retrieve the files that have been used in a certain commit, we have to request the diff of the commit in question. The GitLab API does this by synchronously sending HTTP GET requests. If we have to fetch a large amount of diffs, this will take a while. To speed this up, we can exploit the rate limit of the GitLab Instance by asynchronously sending requests. The default rate limit is set at 10 requests/second which is roughly 3-4 times faster than waiting for synchronous requests.

Python modules:

Undefined names in Lint check

Error at Lint check:

./gl2p/register.py:35:14: F821 undefined name 'Union'
commits: Union[Dict[str, Any], List[Any]]
^
./gl2p/register.py:36:12: F821 undefined name 'Union'
diffs: Union[Dict[str, Any], List[Any]]
^
./gl2p/models.py:118:35: F821 undefined name 'List'
events = resource.events # type: List[ResourceEvent]
^
./gl2p/models.py:118:35: F821 undefined name 'ResourceEvent'
events = resource.events # type: List[ResourceEvent]
^
F821 undefined name 'List'

Remove resource names from attributes keys

Non-PROV attribute names are prepended with the name of the resource that they are describing.
To shorten attribute keys and to achieve uniformity for the same keys, we should remove the resource name from keys.
Discriminability should still be given, as the attributes are fixed to the resources that still carry a prov:type.

In short, attributes such as tag_message on the resource tag and commit_message on the resource commit convey the same thing and therefore should'nt be named differently.

Examples include:

IS SHOULD BE
commit_id id
file_path_at_addition path_at_addition
release_description description
tag_message message

Dataclasses

Multiple sources with potentially different structured data have to be added. This calls for a unified interface of simple dataclasses. The process of adding data sources would then simplify to populating these classes with the available data. The underlying translation process to PROV vocabulary would not need to be changed.

API client pagination - can't use 'x-total-pages' for GitLab.com

The api client does not request more than the first page for some resources from projects hosted at GitLab.com.
The pagination approach relies on the key x-total or x-total-pages being present in the request response headers.

The GitLab API doc's state the following concerning x-total and x-total-pages:

For performance reasons, if a query returns more than 10,000 records, GitLab doesn’t return the following headers:

x-total.
x-total-pages.
rel="last" link.

If both keys are missing from response headers, gitlab2prov naivly assumes that there is only one page of the requested resource.
Not all GitLab.com projects are affected. Updates will follow.

See also this section in the official GitLab documentation.

PROV Data Model Ressources

Ressources used to gain a foothold in PROV:

It is recommended to start with the PROV-Overview as it leads into the broader document structure of the PROV document stack. It is also recommended to follow the PROV document roadmap when first digging into PROV. In my experience this holds true.

Although I found the initial introduction to the concept of provenance to be a bit short, there is also this extensive version in the Provenance XG Final Report.

Additional Ressources:

Collection of documents that I expect to be useful in the future.

Express file revisions in PROV model.

Currently, for commit actions that change a file entity there is a wasDerivedFrom relation between the current and the new revision. This should be changed to the wasRevisionOf relation.

GitHub API Support

Basic data retrieval using GitHub API. Use gitlab2prov defined dataclasses such that the translation of dataclasses to PROV vocabulary only has to be written once. Interchangeable for other data sources. (APIs)

Potential for duplicate label events

Older versions of GitLab have different system note strings denoting events. Some older versions do still have labeling information in system note strings. This could potentially lead to duplicated label events if they are also returned by the label event api endpoint.

A fix would be to allow a number of different regular expressions per classifier to cover old notations. The labeling issue remains to be investigated.

Originally posted by @cdboer in #30 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.