Coder Social home page Coder Social logo

kit-data-manager / base-repo Goto Github PK

View Code? Open in Web Editor NEW
0.0 4.0 3.0 1.13 MB

Basic repository service used as central component of the KIT DM 2.0 infrastructure.

Home Page: https://kit-data-manager.github.io/webpage/base-repo/index.html

License: Apache License 2.0

Java 98.16% Dockerfile 0.43% Shell 1.40%
research-data-management research-data-repository datacite

base-repo's People

Contributors

dependabot[bot] avatar github-actions[bot] avatar pjokit avatar thomasjejkal avatar volkerhartmann avatar

Watchers

 avatar  avatar  avatar  avatar

base-repo's Issues

Data Versioning Support

In addition to the versioning of metadata also a versioning of data might be desirable in the future. Therefor, existing approaches should be evaluated and implemented if possible. Furthermore, in the course of evaluating data versioning approaches, approaches on how to reduce the amount of stored data, e.g. by storing differential information, should be evaluated.

Datacite input not working

Describe the bug
Providing official datacite metadata for POST /api/v1/dataresources/ is not working properly. The interface expects content type application/json+datacite, whereas the converter is triggered on application/datacite+json.

Furthermore, the datacite format returned by DataCite Search has changed, such that the Jolt conversion has to be updated.

OpenAPI documentation misleading/incomplete

Describe the bug
According to the OpenAPI documentation (see here) the endpoints

  • /api/v1/dataresources/{id}
  • /api/v1/dataresources/{id}/data/**

are only related to obtaining audit information, which is wrong. Actually, there are different results possible for both endpoints depending on content negotiation. This should also be reflected in the OpenAPI documentation.

Improper dealing with escaped slashes in request URL

Describe the bug
If a user creates a resource giving a PID as its primary identifier, this PID is also used to access the resource (in comparison to not providing a PID, where an internally created UUID is mainly used). For uploading content or editing the resource, the escaped PID becomes part of the request URL. For base-repo <=1.4.0 accessing such resources will result in a misleading CORS error message and the resource is not accessible for write operations.

Expected behavior
The request is expected to be handled properly unescaping the PID only in the base-repo.

Creator can be deleted when reuse it's ID

When creating a new data resource an already existing creator can be removed from an old resource by reusing a creator id again.

To Reproduce

  1. Create a new data resource, using this documentation.
  2. Note down the id of the newly created resource and the id the database assigned to the creator.
  3. Create another new data resource according to the documentation and set id of the creator to the one from the previous step.
  4. request the resource created in step 2

Now the resource doesn't have a creator anymore.

Expected behavior
Either creator with same id gets updated and newly created resource gets assigned to it or it isn't allowed to set an id in the request.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04.2 LTS
  • Browser: curl ;)
  • Version: 7.81.0

Some tables in documentation broken

There is something wrong with some table columns in the documentation. They should have three columns, but only have two and the values of the third column are pushed into the two columns.

If versioning is not enabled locationUri shouldn't contain version.

Tested with docker kitdm/base-repo:latest (Version 1.1.0?) with default configuration. (without versioning)
After an ingest service will return locationUri with a version query parameter.
e.g: http://baseRepo:8080/api/v1/dataresources/823ac31e-5a8e-43e9-a7f5-d18df326c820/data/randomFile.txt?version=1
Accessing the resource via this locationUri will end with a 404.
Without the query parameter it works fine.
locationUri shouldn't contain version query parameter.

Search for state REVOKED returns VOLATILE and FIXED

When searching for state REVOKED the results contain resources with state VOLATILE and FIXED but not with state REVOKED.

To Reproduce

Change the state of a resource to REVOKED or create a new resource like so:

curl 'http://localhost:8090/api/v1/dataresources/' -i -X POST \
    -H 'Content-Type: application/json' \
    -d '{
            "creators" : [{
            "familyName" : "Doe",
            "givenName" : "John"
        }],
        "titles" : [{
            "value" : "Most basic resource for testing"
        }],
        "resourceType" : {
            "value" : "testingSample",
            "typeGeneral" : "DATASET"
        },
        "state" : "REVOKED"
    }'

The response should confirm that a resource with state == REVOKED exists.

Now search for resources with state == REVOKED like so:

curl 'http://localhost:8090/api/v1/dataresources/search' -i -X POST \
    -H 'Content-Type: application/json' \
    -d '{
        "state" : "REVOKED"
    }'

The result contains all (as far as they exist) resources with state VOLATILE as well as with state FIXED.

On the other hand searches for state VOLATILE, FIXED or GONE returns resources where the state ist set as expected.

Expected behavior
Search for state == REVOKED only returns resources with appropriate state.

Version
Tested version 1.4.1-SNAPSHOT as well as 1.5.4-SNAPSHOT with no obvious difference.

weak search performance under certain conditions

Describe the bug
Under certain conditions, search performance using the '/api/v1/dataresources/search' endpoint is unexpectedly slow. Mainly by including the ResourceType attribute, queries may cost a factor of 10. Due to relatively fast queries this effect gets relevant for very huge repositories (spotted at 84k DataResources)

To Reproduce
Steps to reproduce the behavior:

  1. Use search endpoint, e.g., http://{{HOSTNAME}}:{{PORT}}/api/v1/dataresources/search
  2. Provide a (valid) example DataResource including resourceType attribute in the request body, e.g.
{
   "resourceType":{
       "typeGeneral":"TEXT"
   },
   "publicationYear":"2019"
}
  1. Measure the response time, e.g., 2 seconds
  2. Change the example DataResource to include a full resourceType, e.g.
{
    "resourceType":{
        "typeGeneral":"TEXT",
        "value":"manuscriptMetadata"
    },
    "publicationYear":"2019"
}
  1. Measure the response time, e.g., 20 seconds
  2. Change the example DataResource again to include another top-level, primitive attribute, e.g., publisher
  3. Measure the response time, e.g., 2 seconds

Expected behavior
The expected behaviour is, that the response time does not change within an entire order of magnitude. Including resourceType will require a more complex query than querying only for, e.g., publisher and publicationYear. However, the impact should be much less and it should not depend on the number of primitive attributes within the query (see difference in measurement 5 and 7).

Screenshots
none

Desktop (please complete the following information):

  • OS: MacOS, Windows, Unix
  • REST Tool: Postman
  • Version 7.36.0

Additional context
none

Add ro-crate support for exporting (and importing?) Data Resources

Is your feature request related to a problem? Please describe.
Currently, base-repo allows to download data, optionally in a zip file. Metadata can also be accessed via RESTful endpoints, but is kept separated from downloaded data. There is no way to download a self-describing package of entire Data Resources.

Describe the solution you'd like
Providing support for RO-Crates could tackle this issue. Allowing to access Data Resources with a special content-type, which produces an RO-Crate for the entire Data Resource, might be helpful, e.g., for archiving contents. Optionally, RO Crates could also be used for importing Data Resources, e.g., for migration.

Unordered array content on PATCH

Maybe it's a problem on layer 8, but it seems that order of array-content is non deterministic after patch, even when using the same index multiple times.

To Reproduce

Create a new resource like so:

curl 'http://localhost:8090/api/v1/dataresources/' -i -X POST \
    -H 'Content-Type: application/json' \
    -d '{
            "creators" : [{
            "familyName" : "Doe",
            "givenName" : "John",
            "affiliations" : [ "test1", "test2" ]
        }],
        "titles" : [{
            "value" : "Most basic resource for testing"
        }],
        "resourceType" : {
            "value" : "testingSample",
            "typeGeneral" : "DATASET"
        }
    }'

This results in first inconsistency as affiliations will be ["test2","test1"].

Now add another entry to the array (replace ID and ETAG with values of the step before):

curl 'http://127.0.0.1:8090/api/v1/dataresources/<ID>' -i -X PATCH \
    -H 'Content-Type: application/json-patch+json' \
    -H 'If-Match: <ETAG>' \
    -d '[{
        "op" : "add",
        "path" : "/creators/0/affiliations/0",
        "value" : "test3"
    }]'

Fetch the changed resource:

curl 'http://127.0.0.1:8090/api/v1/dataresources/<ID>' -i -X GET

Now affiliations is ["test2","test3","test1"], the insertion happened between elements 0 and 1.

Repeating the previous steps (patch, get), results in:

  • ["test4","test2","test3","test1"], this is the only result that is as expected (at least for my understanding).
  • ["test4","test5","test2","test3","test1"], again between elements 0 and 1.
  • ["test4","test5","test2","test3","test6","test1"], insertion between elements 3 and 4.

Expected behavior

Adding [ "test1", "test2" ] leads to [ "test1", "test2" ] added to resource.

Adding an entry respects the path given, for example
{ "affiliations": [ "test1", "test2" ] }
with patch:
[{ "op": "add", "path": "/affiliations/0", "value": "test3" }]
leads to:
{ "affiliations": [ "test3", "test1", "test2" ] }
whereas patch:
[{ "op": "add", "path": "/affiliations/1", "value": "test3" }]
leads to:
{ "affiliations": [ "test1", "test3", "test2" ] }
and patch:
[{ "op": "add", "path": "/affiliations/2", "value": "test3" }]
as well as patch:
[{ "op": "add", "path": "/affiliations/-", "value": "test3" }]
leads to:
{ "affiliations": [ "test1", "test2", "test3" ] }
according to Specification:

An element to add to an existing array - whereupon the supplied value is added to the array at the indicated location. Any elements at or above the specified index are shifted one position to the right.

Version
Tested version 1.4.1-SNAPSHOT as well as 1.5.4-SNAPSHOT with no obvious difference.

Bump CodeQL from v1 to v2.

Is your feature request related to a problem? Please describe.
Use semantic code analysis engine to find security vulnerabilities.

Describe the solution you'd like
Add this file to your workflow:
.github/workflows/codeql-analysis.yml

# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL"

on:
  push:
    branches: [ main ]
  pull_request:
    # The branches below must be a subset of the branches above
    branches: [ main ]
  schedule:
    - cron: '29 1 * * 3'

jobs:
  analyze:
    name: Analyze
    runs-on: ubuntu-latest
    permissions:
      actions: read
      contents: read
      security-events: write

    strategy:
      fail-fast: false
      matrix:
        language: [ 'java', 'javascript' ]
        # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
        # Learn more about CodeQL language support at https://git.io/codeql-language-support

    steps:
    - name: Checkout repository
      uses: actions/checkout@v3

    # Initializes the CodeQL tools for scanning.
    - name: Initialize CodeQL
      uses: github/codeql-action/init@v2
      with:
        languages: ${{ matrix.language }}
        # If you wish to specify custom queries, you can do so here or in a config file.
        # By default, queries listed here will override any specified in a config file.
        # Prefix the list here with "+" to use these queries and those in the config file.
        # queries: ./path/to/local/query, your-org/your-repo/queries@main

    # Autobuild attempts to build any compiled languages  (C/C++, C#, or Java).
    # If this step fails, then you should remove it and run the build manually (see below)
    - name: Autobuild
      uses: github/codeql-action/autobuild@v2

    # โ„น๏ธ Command-line programs to run using the OS shell.
    # ๐Ÿ“š https://git.io/JvXDl

    # โœ๏ธ If the Autobuild fails above, remove it and uncomment the following three lines
    #    and modify them (or add more) to build your code if your project
    #    uses a compiled language

    #- run: |
    #   make bootstrap
    #   make release

    - name: Perform CodeQL Analysis
      uses: github/codeql-action/analyze@v2

Evaluate Caching Support

With an increasing size of a repository installation and an increasing number of users, caching might become relevant. Therefor, caching mechanisms available via Spring Boot should be evaluated and supported if possible.

Swagger not accessible

Describe the bug
Accessing the API Docs via /swagger-ui/index.html returns HTTP 404, while accessing the API docs via /v3/api-docs works fine.

To Reproduce
Start base-repo 1.5.4 and navigate to http://:/swagger-ui/index.html

Expected behavior
The UI should be visible.

Properties validation not working

Describe the bug
In one of the last depdency updates it seems, that property validation does not work any longer as expected. This can be spotted in ApplicationProperties.java with 'basepath'. According to the annotation, it should be validated to be a folder and if not exist, the folder should be created (implemented in service-base). This seems not to work any longer as even a malformed URL is accepted.

Elastic search proxy at /api/v1/search not working

Describe the bug
With the update to spring-boot 3.X changes on their ProxyExchange implementation lead to a wrong Content-Length header submitted to Elastic, which causes all requests to fail incompleted. (see (Issue #3154)[https://github.com/spring-cloud/spring-cloud-gateway/issues/3154])

As a workaround, the property 'spring.cloud.gateway.proxy.sensitive=content-length' should be added to application.properties.

In the next release, this property will either be part of the default application.properties or the issue will be solved in the code.

Codecov badge is broken

Describe the bug
The badge for codecov says 'unknown'.

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'https://github.com/kit-data-manager/base-repo'
  2. Look codecov badge inside README.md

Expected behavior
The current code coverage in % should be displayed on the badge.

Screenshots
grafik

Describe the solution you'd like
Markdown should look like this:
[![Codecov](https://codecov.io/gh/kit-data-manager/base-repo/graph/badge.svg)](https://codecov.io/gh/kit-data-manager/base-repo)
-> Codecov

Implement PID support

Configurable PID support will be required in future in order to cover the Findability aspect of FAIR. For technical reasons, the first supported PID provider should be handle.net

Listing DataResource Content Information

When curling the data elements (http://localhost:8080/static/docs/documentation.html#_listing_content_information), that are registered for a specific data resource, it returns all elements of the first level after the /data path, in case only the path up to /data is used. However, if multiple levels of directories are registered (i.e. /data/1level/2level), entries deeper than those in the first one (1level in this case) are not returned. In order to get the content information for these lower directory levels, one needs to provide the full path. Since it might be unknown beforehand, which directories/how many levels are registered in a data resource, it might be more convenient to return all directories (all data elements), when the data resource is queried for content information, only using the path up to /data.

Hateoas links broken

Due to an issue in the service-base library, navigation links returned in response headers are not working.

Mediatype detection issue not solved

Describe the bug
According to the Changelog of v1.3.0, detecting the mediaType of content should work. Due to a missing update of an internal library, this change is not part of v1.3.0 and should be fixed in the next version.

fileVersion seems to be based on version

When uploading a new version of a file, fileVersion jumps to the current version (or previous version + 1). Maybe this is intentional but at first glance it seems to be a bug.

To Reproduce

  1. Upload a file to a resource (replace ID with appropriate value):
    curl 'http://127.0.0.1:8090/api/v1/dataresources/<ID>/data/test.txt' -i -X POST -H 'Content-Type: multipart/form-data' -F '[email protected];type=multipart/form-data'

  2. Patch the metadata of the file (replace ETAG with value from previous response):

curl 'http://127.0.0.1:8090/api/v1/dataresources/<ID>/data/test.txt' -i -X PATCH \
    -H 'Content-Type: application/json-patch+json' \
    -H 'If-Match: "<ETAG>"' \
    -d '[ {
  "op" : "add",
  "path" : "/tags/-",
  "value" : "test"
} ]'
  1. Check version and fileVersion using a get request:

curl 'http://127.0.0.1:8090/api/v1/dataresources/<ID>/data/test.txt' -i -X GET -H 'Accept: application/vnd.datamanager.content-information+json'

Result will be: version = 2 and fileVersion = 1

  1. Repeat step 1

  2. Repeat step 3

Result will be: version = 3 and fileVersion = 3

Expected behavior

fileVersion only increments by one when uploading a new version of the file.

In the case mentioned previously result would have been: version = 3 and fileVersion = 2.

Version (output of actuator/info)

{
"git": {
  "branch": "main",
  "commit": {
    "id": "4e90c6a",
    "time": "2023-07-31T11:21:49Z"
  }
  },
  "build": {
    "artifact": "base-repo",
    "name": "base-repo",
    "time": "2023-12-13T15:17:52.545Z",
    "version": "1.4.1-SNAPSHOT"
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.