Coder Social home page Coder Social logo

ilri / dspace-statistics-api Goto Github PK

View Code? Open in Web Editor NEW
12.0 5.0 3.0 1.19 MB

A simple REST API to expose Solr view and download statistics for items in a DSpace repository.

License: GNU General Public License v3.0

Python 100.00%
dspace solr statistics rest-api api falcon-api

dspace-statistics-api's Introduction

DSpace Statistics API

Build Status Build and Test Code style: black

DSpace stores item view and download events in a Solr "statistics" core. This information is available for use in the various DSpace user interfaces, but is not exposed externally via any APIs. The DSpace 4/5/6 REST API, for example, only exposes metadata about communities, collections, items, and bitstreams.

This project contains an indexer and a Falcon-based web application to make the item, community, and collection statistics available via a simple REST API. You can read more about the Solr queries used to gather the item view and download statistics on the DSpace wiki.

If you use the DSpace Statistics API please cite:

Orth, A. 2018. DSpace statistics API. Nairobi, Kenya: ILRI. https://hdl.handle.net/10568/99143.

Requirements

Installation

Create a Python virtual environment and install the dependencies:

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Running

Set up the environment variables for Solr and PostgreSQL:

$ export SOLR_SERVER=http://localhost:8080/solr
$ export DATABASE_NAME=dspacestatistics
$ export DATABASE_USER=dspacestatistics
$ export DATABASE_PASS=dspacestatistics
$ export DATABASE_HOST=localhost

Index the Solr statistics core to populate the PostgreSQL database:

$ python -m dspace_statistics_api.indexer

Run the REST API:

$ gunicorn dspace_statistics_api.app

Test to see if there are any statistics:

$ curl 'http://localhost:8000/items?limit=1'

Testing

Install development packages using pip:

$ pip install -r requirements-dev.txt

Run tests:

$ pytest

Deployment

There are example systemd service and timer units in the contrib directory. The API service listens on localhost by default so you will need to expose it publicly using a web server like nginx.

An example nginx configuration is:

server {
    #...

    location ~ /rest/statistics/?(.*) {
        access_log /var/log/nginx/statistics.log;
        proxy_pass http://statistics_api/$1$is_args$args;
    }
}

upstream statistics_api {
    server 127.0.0.1:5000;
}

This would expose the API at /rest/statistics.

Using the API

The API exposes the following endpoints:

  • GET / — return a basic API documentation page.
  • GET /items — return views and downloads for all items that Solr knows about¹. Accepts limit and page query parameters for pagination of results (limit must be an integer between 1 and 100, and page must be an integer greater than or equal to 0).
  • POST /items — return views and downloads for an arbitrary list of items with an optional date range. Accepts limit, page, dateFrom, and dateTo parameters².
  • GET /item/id — return views and downloads for a single item (id must be a UUID). Returns HTTP 404 if an item id is not found.
  • GET /communities — return views and downloads for all communities that Solr knows about¹. Accepts limit and page query parameters for pagination of results (limit must be an integer between 1 and 100, and page must be an integer greater than or equal to 0).
  • POST /communities — return views and downloads for an arbitrary list of communities with an optional date range. Accepts limit, page, dateFrom, and dateTo parameters².
  • GET /community/id — return views and downloads for a single community (id must be a UUID). Returns HTTP 404 if a community id is not found.
  • GET /collections — return views and downloads for all collections that Solr knows about¹. Accepts limit and page query parameters for pagination of results (limit must be an integer between 1 and 100, and page must be an integer greater than or equal to 0).
  • POST /collections — return views and downloads for an arbitrary list of collections with an optional date range. Accepts limit, page, dateFrom, and dateTo parameters².
  • GET /collection/id — return views and downloads for a single collection (id must be a UUID). Returns HTTP 404 if an collection id is not found.

The id is the internal UUID for an item, community, or collection. You can get these from the standard DSpace REST API.

¹ We are querying the Solr statistics core, which technically only knows about items, communities, or collections that have either views or downloads. If an item, community, or collection is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository.

² POST requests to /items, /communities, and /collections should be in JSON format with the following parameters (substitute the "items" list for communities or collections accordingly):

{
    "limit": 100, // optional, integer between 0 and 100, default 100
    "page": 0, // optional, integer greater than 0, default 0
    "dateFrom": "2020-01-01T00:00:00Z", // optional, default *
    "dateTo": "2020-09-09T00:00:00Z", // optional, default *
    "items": [
        "f44cf173-2344-4eb2-8f00-ee55df32c76f",
        "2324aa41-e9de-4a2b-bc36-16241464683e",
        "8542f9da-9ce1-4614-abf4-f2e3fdb4b305",
        "0fe573e7-042a-4240-a4d9-753b61233908"
    ]
}

TODO

  • Better logging
  • Version API (or at least include a /version endpoint?)
    • Probably use /status with a version in the response
  • Use JSON in PostgreSQL
  • Add top items endpoint, perhaps /top/items or /items/top?
    • Actually we could add /items?limit=10&sort=views

License

This work is licensed under the GPLv3.

The license allows you to use and modify the work for personal and commercial purposes, but if you distribute the work you must provide users with a means to access the source code for the version you are distributing. Read more about the GPLv3 at TL;DR Legal.

dspace-statistics-api's People

Contributors

alanorth avatar renovate[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dspace-statistics-api's Issues

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

droneci
.drone.yml
  • postgres 15-alpine
  • python 3.10-slim
  • postgres 15-alpine
  • postgres 15-alpine
  • python 3.9-slim
  • postgres 15-alpine
  • postgres 15-alpine
  • postgres 15-alpine
  • python 3.8-slim
github-actions
.github/workflows/python-app.yml
  • actions/checkout v4
  • actions/setup-python v5
  • postgres 15-alpine
  • ubuntu 22.04
pep621
pyproject.toml
  • poetry >=0.12
poetry
pyproject.toml
  • python ^3.8.1
  • gunicorn ^21.0.0
  • falcon 3.1.3
  • psycopg2 ^2.9.1
  • requests ^2.24.0
  • falcon-swagger-ui falcon3-update-swagger-ui
  • black ^23.0.0
  • fixit ^2.1.0
  • flake8 ^7.0.0
  • isort ^5.9.1
  • pytest ^7.0.0

  • Check this box to trigger a request for Renovate to run again on this repository

KeyError: 'stats'

@alanorth, when I tried running python -m dspace_statistics_api.indexer, I received this error:

  File "C:\Users\euler\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\euler\.pyenv\pyenv-win\versions\3.7.9\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\dspace-statistics-api\dspace_statistics_api\indexer.py", line 223, in <module>
    index_views("items", "id")
  File "D:\dspace-statistics-api\dspace_statistics_api\indexer.py", line 56, in index_views
    results_totalNumFacets = res.json()["stats"]["stats_fields"][facetField][
KeyError: 'stats'

I tried this in a repository with no shards, and another with sharded statistics. Both repositories are using DSpace version 6.3 running on Windows 2019 Server and tested with Python versions 3.7.9, 3.9.1, and 3.9.10. What could I be missing?

Add tests

Perhaps I could learn from the tests in responder. Testing locally would work because a PostgreSQL and Solr server could be available. I'm not sure how to test on a remote CI environment, though.

KeyError missing facet_counts while running indexer

I'm not sure how this scenario arises, but if there are no facet_counts in a page of the results, then the indexing process dies with the following error:

Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/ilri/dspace-statistics-api/dspace_statistics_api/indexer.py", line 239, in
index_views("items", "id")
File "/ilri/dspace-statistics-api/dspace_statistics_api/indexer.py", line 114, in index_views
views = res.json()["facet_counts"]["facet_fields"]
KeyError: 'facet_counts'

It could be due to so many unmigrated IDs or something else, but I wonder if it makes sense to check for the key in the json response before referencing it on https://github.com/ilri/dspace-statistics-api/blob/v6_x/dspace_statistics_api/indexer.py#L114 and https://github.com/ilri/dspace-statistics-api/blob/v6_x/dspace_statistics_api/indexer.py#L196.

Issue with legacy IDs in Solr statistics on DSpace 6.x

First off, this is a great tool for reviewing DSpace statistics. Thanks for releasing it to the community. I wanted to ask if you have run into issues with DSpace 6.x instances that have been migrated from prior major versions and thus potentially contain non-UUID IDs in the Solr statistics?

After running solr-upgrade-statistics-6x on my instance I was left with some IDs in Solr that couldn't be migrated and thus were labeled "XXXXX-unmigrated". When I run the indexer while on the v6_x branch I see it fails when it comes across an unmigrated ID. So I'm wondering if some sort of UUID validation step would be useful before the calls to update views/downloads statistics in PostgreSQL?

DSpace 7 compatibility?

I'm curious if you've had any chance to test some of your Solr queries on a DSpace 7 instance? So far, I've tried only a few of the example views and downloads queries from the documentation (https://wiki.lyrasis.org/display/DSPACE/Solr). But they either aren't returning any results or they return results that don't make sense.

Fix statistics for sharded Solr cores

A DSpace's Solr statistics cores may or may not be sharded depending on whether or not the site has run the yearly dspace stats-util -s task. If the statistics core is sharded, views and download statistics returned by the API will be inaccurate because it currently indexes the statistics core.

To illustrate that this is problem see here the results of querying Solr for item views in statistics and statistics-2018 cores for an item:

$ http 'http://localhost:3000/solr/statistics/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
<result name="response" numFound="33" start="0">
$ http 'http://localhost:3000/solr/statistics-2018/select?indent=on&rows=0&q=type:2+id:11576&fq=isBot:false&fq=statistics_type:view' | grep numFound
<result name="response" numFound="241" start="0">

So we definitely have to handle sharded cores somehow. Perhaps with multicore join queries.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.