The gitlab2prov from dlr-sc

Evaluate GitLab Rate Limiting

To mine many repositories at once, it would help to know if the rate limit is calculated on user basis or on IP basis.

Bug: Node id collisions when inserting multiple documents

There seem to be merge issues when inserting documents that have node id's that are already given to nodes in the neo4j instance. This suggests, that the assumption of nodes being different when they are not in the same bundle but do have the same id is incorrect.

Approach to fix:

include project/bundle id in each node id for nodes of a bundle

Create different id namespaces for different bundles.

Rate Limited Asynchronous Request Client

Implement a rate limited asynchronous request client.
Should support basic authentification, such as tokens etc. Simple header should suffice.

Multiple regex per event classifier

Older versions of GitLab have different system note strings denoting events. Some older versions do still have labeling information in system note strings. This could potentially lead to duplicated label events if they are also returned by the label event api endpoint.

A fix would be to allow a number of different regular expressions per classifier to cover old notations. The labeling issue remains to be investigated.

Originally posted by @cdboer in #30 (comment)

PROV Data Model

Develop a simple graphical and or tabular representation of how the used PROV model looks like.
The initial model can be the one used by #2.
Annotate nodes with their type.

Neo4J Connection

Based on findings of #11. Use prov-db-connector.

API client: refactoring

Event Parsing

Implement event parsing for labels, awards, notes and system notes.

Relate file to its original entity

The PROV model for commits requires files to be related to their original entity and to proclaim themselves as a specialization of that entity. Finding this original entity is already implemented, though quite slow and does not scale well with a growing amount of files. Rethink the implementation for this to be faster and more resilient. Keep an eye on performance, this shouldn't compute longer than necessary.

Right now the implementation is searching through the ever changing file paths of all files, computing tracks for each file and choosing the longest track to be the correct history of a file. Whit it bringing along a lot of computation overhead, this does not seem to be the best solution to the problem.

Complete command line options

All configuration options that are available in config.ini should be present as command line arguments.
This is to avoid having an config.ini file.

Create and test a Python application

Create and test a Python application with GitHub action.

Activate test after #21 ist finished.

Evaluate code cov automation, increase coverage and add a badge.

Use pytest to implement test cases.

Relation 'wasGeneratedBy' for 'tags' incorrect.

In the generated PROV document, the wasGeneratedBy relation between tag_event Activity and tag Entity has wrong direction. In the model file model.py its correct.

Check if prov doc for project already exists in neo4j

It is unnecessary to insert documents that are already present. Updating documents will be the goal for UPDATE functionality.

Error handling for missing event classifiers

The tool stops for missing event classifiers (#30 ) with
Exception: No match found for body:

Evaluate Neo4J Write Action Bottleneck

In the latest test runs with repositories of 500 commits and more, I noticed that it took to longer than expected to store sizable PROV documents in a Neo4J instance. Evaluate wether this is normal behaviour, or if a speed up is possible. Maybe chunking could help?

Error when processing parent commits

For some larger GitLab projects (> 50 commits) the following error occurs:

Traceback` (most recent call last):
  File "<stdin>", line 1, in <module>
  File "gitlab2prov\gitlab2prov\__init__.py", line 31, in compute_graph
    graphs = self.run_pipelines(url)
  File "gitlab2prov\gitlab2prov\__init__.py", line 50, in run_pipelines
    packages = pipe.process(*data)
  File "gitlab2prov\gitlab2prov\pipelines.py", line 33, in process
    packages = CommitProcessor.process(commits, diffs)
  File "gitlab2prov\gitlab2prov\procs\__init__.py", line 19, in process
    parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
  File "gitlab2prov\gitlab2prov\procs\__init__.py", line 19, in <listcomp>
    parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
KeyError: 'e4a41bbde6dfc8b152b2d94528edc5f78073baf4'

The problem is that commits are fetched in batches/pages of 50 commits. If the parent commit of a commit is not included in the same batch, the line parents = [commits[parent_id] for parent_id in commit["parent_ids"]] leads to a key error.

Mypy type hints

Add correct type hints to satisfy mypy.

Bug: list index out of range for releases that lack evidence

It seems possible for a GitLab release to not have associated release evidence files. Currently this case is not considered and should be fixed within the meta.py file when converting JSON to pre-model datatypes.

Update event parsing documentation

Update the list of events in /docs to reflect the current set of parsed events. Also add descriptions for what events denote and what information certain events convey. (As in what keys are added by events to the property labels.)

Create setup script.

Create a new setup.py script for distutils.

Paper: Git2PROV - Exposing Version Control System Content as W3C PROV

Thread regarding the paper (pdf) by authors Tom De Nies, Sara Magliacane, Ruben Verborgh, Sam Coppens, Paul Groth and Rik Van de Walle.

The related GitHub repository can be found here.

The paper has likely been the inspiration for #2 aswell as for this project. Additional ressources, ideas and comments concerning Git2PROV will be posted in the comments of this issue.

Translation Of Dataclasses To PROV

Translate dataclasses to PROV vocabulary. Keep as simple as possible.

Paper: GitHub2PROV - Provenance for Supporting Software Project Management

Thread regarding the aformentioned paper GitHub2PROV: Provenance for Supporting Software Project Management by authors Heather S. Packer, Adriane Chapman and Leslie Carr - all of the University of Southampton.

The paper has been published as part of the USENIX publication for the 11th International Workshop on Theory and Practice of Provenance in June 2019.

Additional ressources, ideas and comments concerning GitHub2PROV will be posted in the comments of this issue.

Support more than one output format through argument chaining

It'd be great if you were able to write > 1 format at the same time, e.g. gitlab2prov -t {token} -f json -f rdf -f xml -p {url} -r 1 > provout/{outfilename}.{format}. This would help when building a larger dataset with provenance data. This would be fairly easy to achieve using click or typer as well.

Answer-Request mapping of async client slows down for big batch size

Related to the implementation of #10.
Individual GET requests can take longer than others.
As batch sizes grow, the amount of slower requests does too. This leads to a convoluted answer-request mapping of asynchronous requests leading to a slow down of request fetching.

Colon in Identifier creates fake QualifiedName

Force UTF-8 encoding for qualified names.

Code refactorings.

Change code style, conventions, package layout...

Add qualified relations

To have qualified relations instead of binary relation between PROV class elements, add attributes to relations, at least:

wasGeneratedBy
- prov:time
- prov:role
used
- prov:time
- prov:role
wasInvalidatedBy
- prov:time

Deployment to AWS

Deploy to AWS

AWS Lambda
AWS Fargate + ECS

Add a CITATION.cff file.

Add a CITATION file in Citation File Format (CFF).

GitLab API Support

Basic information retrieval using GitLab API. Use data to fill in data classes defined by gitlab2prov. Should be interchangeable with different data sources (APIs). (Hint: github API)

Asynchronous diff fetching

To retrieve the files that have been used in a certain commit, we have to request the diff of the commit in question. The GitLab API does this by synchronously sending HTTP GET requests. If we have to fetch a large amount of diffs, this will take a while. To speed this up, we can exploit the rate limit of the GitLab Instance by asynchronously sending requests. The default rate limit is set at 10 requests/second which is roughly 3-4 times faster than waiting for synchronous requests.

Python modules:

asyncio
aiohttp

Store each PROV graph in a dedicated Neo4j instance

For scalability each PROV graph should be stored in a single/new instance of Neo4j 4.x.

Prerequisite is issue DLR-SC/prov-db-connector#86 for support of Neo4j Fabric.

Duplicated Agents on gitlab.com

On gitlab.com, the same user appear twice in the provenance graph.

For example at repo https://gitlab.com/onyame/provtest1 the result after giving a "Thumbs Up" (award_emoji event):
provtest1-step3.pdf

Undefined names in Lint check

Error at Lint check:

./gl2p/register.py:35:14: F821 undefined name 'Union'
commits: Union[Dict[str, Any], List[Any]]
^
./gl2p/register.py:36:12: F821 undefined name 'Union'
diffs: Union[Dict[str, Any], List[Any]]
^
./gl2p/models.py:118:35: F821 undefined name 'List'
events = resource.events # type: List[ResourceEvent]
^
./gl2p/models.py:118:35: F821 undefined name 'ResourceEvent'
events = resource.events # type: List[ResourceEvent]
^
F821 undefined name 'List'

Remove resource names from attributes keys

Non-PROV attribute names are prepended with the name of the resource that they are describing.
To shorten attribute keys and to achieve uniformity for the same keys, we should remove the resource name from keys.
Discriminability should still be given, as the attributes are fixed to the resources that still carry a prov:type.

In short, attributes such as tag_message on the resource tag and commit_message on the resource commit convey the same thing and therefore should'nt be named differently.

Examples include:

IS	SHOULD BE
`commit_id`	`id`
`file_path_at_addition`	`path_at_addition`
`release_description`	`description`
`tag_message`	`message`

Dataclasses

Multiple sources with potentially different structured data have to be added. This calls for a unified interface of simple dataclasses. The process of adding data sources would then simplify to populating these classes with the available data. The underlying translation process to PROV vocabulary would not need to be changed.

API client pagination - can't use 'x-total-pages' for GitLab.com

The api client does not request more than the first page for some resources from projects hosted at GitLab.com.
The pagination approach relies on the key x-total or x-total-pages being present in the request response headers.

The GitLab API doc's state the following concerning x-total and x-total-pages:

For performance reasons, if a query returns more than 10,000 records, GitLab doesn’t return the following headers:

x-total.
x-total-pages.
rel="last" link.

If both keys are missing from response headers, gitlab2prov naivly assumes that there is only one page of the requested resource.
Not all GitLab.com projects are affected. Updates will follow.

See also this section in the official GitLab documentation.

PROV Data Model Ressources

Ressources used to gain a foothold in PROV:

It is recommended to start with the PROV-Overview as it leads into the broader document structure of the PROV document stack. It is also recommended to follow the PROV document roadmap when first digging into PROV. In my experience this holds true.

Although I found the initial introduction to the concept of provenance to be a bit short, there is also this extensive version in the Provenance XG Final Report.

Additional Ressources:

Collection of documents that I expect to be useful in the future.

Add File Headers With License And Copyright Notice

Create Docker image

Docker image for deployment, see #20

Add support for 'tags' and 'releases'.

Add support for generated tags and releases.

For example: https://gitlab.com/lucaapp/android/-/releases

Express file revisions in PROV model.

Currently, for commit actions that change a file entity there is a wasDerivedFrom relation between the current and the new revision. This should be changed to the wasRevisionOf relation.

Live updates using gitlab webhooks

Implement live update capabilities using gitlab project webhooks. TBD

GitHub API Support

Basic data retrieval using GitHub API. Use gitlab2prov defined dataclasses such that the translation of dataclasses to PROV vocabulary only has to be written once. Interchangeable for other data sources. (APIs)

Exception handling for Neo4j errors

ValueError: invalid literal for int() with base 10: '7474:7687'

if config.ini contains

[NEO4J]
host = localhost:7474

Recreating the Git2PROV commit model using the GitLab API

This issue aims to recreate the commit model of #3 that Packer et al. extended in #2.
Resources, data sources and ideas will be posted in the comments.

Create typing stub files

Mypy doc on stub files.
https://mypy.readthedocs.io/en/stable/stubs.html

Create stub files for used third party libraries that do not provides their own types.
Create stub files for gitlab2prov code to allow projects that import modules from gitlab2prov to also import their type signatures.

Sub-Issue of #26

Potential for duplicate label events

Older versions of GitLab have different system note strings denoting events. Some older versions do still have labeling information in system note strings. This could potentially lead to duplicated label events if they are also returned by the label event api endpoint.

A fix would be to allow a number of different regular expressions per classifier to cover old notations. The labeling issue remains to be investigated.

Originally posted by @cdboer in #30 (comment)

dlr-sc / gitlab2prov Goto Github PK

gitlab2prov's Introduction

Hi there 👋

gitlab2prov's People

Contributors

Stargazers

Watchers

Forkers

gitlab2prov's Issues

Additional Ressources:

Recommend Projects

Recommend Topics

Recommend Org