dlr-sc / gitlab2prov Goto Github PK
View Code? Open in Web Editor NEWποΈ Extract provenance information (W3C PROV) from GitLab projects.
License: MIT License
ποΈ Extract provenance information (W3C PROV) from GitLab projects.
License: MIT License
To mine many repositories at once, it would help to know if the rate limit is calculated on user basis or on IP basis.
There seem to be merge issues when inserting documents that have node id's that are already given to nodes in the neo4j instance. This suggests, that the assumption of nodes being different when they are not in the same bundle but do have the same id is incorrect.
Approach to fix:
Create different id namespaces for different bundles.
Implement a rate limited asynchronous request client.
Should support basic authentification, such as tokens etc. Simple header should suffice.
Older versions of GitLab have different system note strings denoting events. Some older versions do still have labeling information in system note strings. This could potentially lead to duplicated label events if they are also returned by the label event api endpoint.
A fix would be to allow a number of different regular expressions per classifier to cover old notations. The labeling issue remains to be investigated.
Originally posted by @cdboer in #30 (comment)
Develop a simple graphical and or tabular representation of how the used PROV model looks like.
The initial model can be the one used by #2.
Annotate nodes with their type.
Based on findings of #11. Use prov-db-connector.
Implement event parsing for labels, awards, notes and system notes.
The PROV model for commits requires files to be related to their original entity and to proclaim themselves as a specialization of that entity. Finding this original entity is already implemented, though quite slow and does not scale well with a growing amount of files. Rethink the implementation for this to be faster and more resilient. Keep an eye on performance, this shouldn't compute longer than necessary.
Right now the implementation is searching through the ever changing file paths of all files, computing tracks for each file and choosing the longest track to be the correct history of a file. Whit it bringing along a lot of computation overhead, this does not seem to be the best solution to the problem.
All configuration options that are available in config.ini should be present as command line arguments.
This is to avoid having an config.ini file.
Create and test a Python application with GitHub action.
Activate test after #21 ist finished.
Use pytest to implement test cases.
In the generated PROV document, the wasGeneratedBy
relation between tag_event
Activity and tag
Entity has wrong direction. In the model file model.py
its correct.
It is unnecessary to insert documents that are already present. Updating documents will be the goal for UPDATE functionality.
The tool stops for missing event classifiers (#30 ) with
Exception: No match found for body:
In the latest test runs with repositories of 500 commits and more, I noticed that it took to longer than expected to store sizable PROV documents in a Neo4J instance. Evaluate wether this is normal behaviour, or if a speed up is possible. Maybe chunking could help?
For some larger GitLab projects (> 50 commits) the following error occurs:
Traceback` (most recent call last):
File "<stdin>", line 1, in <module>
File "gitlab2prov\gitlab2prov\__init__.py", line 31, in compute_graph
graphs = self.run_pipelines(url)
File "gitlab2prov\gitlab2prov\__init__.py", line 50, in run_pipelines
packages = pipe.process(*data)
File "gitlab2prov\gitlab2prov\pipelines.py", line 33, in process
packages = CommitProcessor.process(commits, diffs)
File "gitlab2prov\gitlab2prov\procs\__init__.py", line 19, in process
parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
File "gitlab2prov\gitlab2prov\procs\__init__.py", line 19, in <listcomp>
parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
KeyError: 'e4a41bbde6dfc8b152b2d94528edc5f78073baf4'
The problem is that commits are fetched in batches/pages of 50 commits. If the parent commit of a commit is not included in the same batch, the line parents = [commits[parent_id] for parent_id in commit["parent_ids"]]
leads to a key error.
Add correct type hints to satisfy mypy.
It seems possible for a GitLab release to not have associated release evidence files. Currently this case is not considered and should be fixed within the meta.py
file when converting JSON to pre-model datatypes.
Update the list of events in /docs to reflect the current set of parsed events. Also add descriptions for what events denote and what information certain events convey. (As in what keys are added by events to the property labels.)
Create a new setup.py script for distutils.
Thread regarding the paper (pdf) by authors Tom De Nies, Sara Magliacane, Ruben Verborgh, Sam Coppens, Paul Groth and Rik Van de Walle.
The related GitHub repository can be found here.
The paper has likely been the inspiration for #2 aswell as for this project. Additional ressources, ideas and comments concerning Git2PROV will be posted in the comments of this issue.
Translate dataclasses to PROV vocabulary. Keep as simple as possible.
Thread regarding the aformentioned paper GitHub2PROV: Provenance for Supporting Software Project Management by authors Heather S. Packer, Adriane Chapman and Leslie Carr - all of the University of Southampton.
The paper has been published as part of the USENIX publication for the 11th International Workshop on Theory and Practice of Provenance in June 2019.
Additional ressources, ideas and comments concerning GitHub2PROV will be posted in the comments of this issue.
It'd be great if you were able to write > 1 format at the same time, e.g. gitlab2prov -t {token} -f json -f rdf -f xml -p {url} -r 1 > provout/{outfilename}.{format}
. This would help when building a larger dataset with provenance data. This would be fairly easy to achieve using click or typer as well.
Related to the implementation of #10.
Individual GET requests can take longer than others.
As batch sizes grow, the amount of slower requests does too. This leads to a convoluted answer-request mapping of asynchronous requests leading to a slow down of request fetching.
Force UTF-8 encoding for qualified names.
Change code style, conventions, package layout...
To have qualified relations instead of binary relation between PROV class elements, add attributes to relations, at least:
wasGeneratedBy
prov:time
prov:role
used
prov:time
prov:role
wasInvalidatedBy
prov:time
Deploy to AWS
Add a CITATION file in Citation File Format (CFF).
Basic information retrieval using GitLab API. Use data to fill in data classes defined by gitlab2prov. Should be interchangeable with different data sources (APIs). (Hint: github API)
To retrieve the files that have been used in a certain commit, we have to request the diff of the commit in question. The GitLab API does this by synchronously sending HTTP GET requests. If we have to fetch a large amount of diffs, this will take a while. To speed this up, we can exploit the rate limit of the GitLab Instance by asynchronously sending requests. The default rate limit is set at 10 requests/second which is roughly 3-4 times faster than waiting for synchronous requests.
Python modules:
For scalability each PROV graph should be stored in a single/new instance of Neo4j 4.x.
Prerequisite is issue DLR-SC/prov-db-connector#86 for support of Neo4j Fabric.
On gitlab.com, the same user appear twice in the provenance graph.
For example at repo https://gitlab.com/onyame/provtest1 the result after giving a "Thumbs Up" (award_emoji
event):
provtest1-step3.pdf
Error at Lint check:
./gl2p/register.py:35:14: F821 undefined name 'Union'
commits: Union[Dict[str, Any], List[Any]]
^
./gl2p/register.py:36:12: F821 undefined name 'Union'
diffs: Union[Dict[str, Any], List[Any]]
^
./gl2p/models.py:118:35: F821 undefined name 'List'
events = resource.events # type: List[ResourceEvent]
^
./gl2p/models.py:118:35: F821 undefined name 'ResourceEvent'
events = resource.events # type: List[ResourceEvent]
^
F821 undefined name 'List'
Non-PROV
attribute names are prepended with the name of the resource that they are describing.
To shorten attribute keys and to achieve uniformity for the same keys, we should remove the resource name from keys.
Discriminability should still be given, as the attributes are fixed to the resources that still carry a prov:type
.
In short, attributes such as tag_message
on the resource tag
and commit_message
on the resource commit
convey the same thing and therefore should'nt be named differently.
Examples include:
IS | SHOULD BE |
---|---|
commit_id |
id |
file_path_at_addition |
path_at_addition |
release_description |
description |
tag_message |
message |
Multiple sources with potentially different structured data have to be added. This calls for a unified interface of simple dataclasses. The process of adding data sources would then simplify to populating these classes with the available data. The underlying translation process to PROV vocabulary would not need to be changed.
The api client does not request more than the first page for some resources from projects hosted at GitLab.com.
The pagination approach relies on the key x-total
or x-total-pages
being present in the request response headers.
The GitLab API doc's state the following concerning x-total
and x-total-pages
:
For performance reasons, if a query returns more than 10,000 records, GitLab doesnβt return the following headers:
x-total.
x-total-pages.
rel="last" link.
If both keys are missing from response headers, gitlab2prov
naivly assumes that there is only one page of the requested resource.
Not all GitLab.com projects are affected. Updates will follow.
See also this section in the official GitLab documentation.
Ressources used to gain a foothold in PROV:
It is recommended to start with the PROV-Overview as it leads into the broader document structure of the PROV document stack. It is also recommended to follow the PROV document roadmap when first digging into PROV. In my experience this holds true.
Although I found the initial introduction to the concept of provenance to be a bit short, there is also this extensive version in the Provenance XG Final Report.
Collection of documents that I expect to be useful in the future.
Docker image for deployment, see #20
Add support for generated tags and releases.
For example: https://gitlab.com/lucaapp/android/-/releases
Currently, for commit actions that change a file entity there is a wasDerivedFrom
relation between the current and the new revision. This should be changed to the wasRevisionOf
relation.
Implement live update capabilities using gitlab project webhooks. TBD
Basic data retrieval using GitHub API. Use gitlab2prov defined dataclasses such that the translation of dataclasses to PROV vocabulary only has to be written once. Interchangeable for other data sources. (APIs)
ValueError: invalid literal for int() with base 10: '7474:7687'
if config.ini contains
[NEO4J]
host = localhost:7474
Mypy doc on stub files.
https://mypy.readthedocs.io/en/stable/stubs.html
Create stub files for used third party libraries that do not provides their own types.
Create stub files for gitlab2prov code to allow projects that import modules from gitlab2prov to also import their type signatures.
Sub-Issue of #26
Older versions of GitLab have different system note strings denoting events. Some older versions do still have labeling information in system note strings. This could potentially lead to duplicated label events if they are also returned by the label event api endpoint.
A fix would be to allow a number of different regular expressions per classifier to cover old notations. The labeling issue remains to be investigated.
Originally posted by @cdboer in #30 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.