sorse / sorse-data-filter Goto Github PK

1.0 3.0 0.0 114 KB

License: MIT License

Python 100.00%

sorse-data-filter's Introduction

sorse-data-filter

This project manages the visibility of data and automatises several workflows centred around the abstract submission process of SORSE – A Series of Online Research Softare Events. It is our aim to ensure that no connection between personal data and diversity information can be derived from data exported from the Indico event management platform.

Input of the tool is the exported data from the Indico system. Via a command line interface a user can pick one of the available workflows. For each workflow, allowed fields from the import data is specified along with the output format that is required.

The tool currently supports the following automatised workflows:

website: export of Markdown-formatted data after acceptance to the website
scheduling: export of data to a google drive to support the scheduling process

Further workflows are planned including:

rejected: export of information on rejected abstracts
mentoring: export of data to a google drive to support the mentoring process
statistics: export of statistical data to a google drive

Configuring workflows

The export of data is centred around the concepts of allowing specific data fields and providing formatting with the help of the templating engine Jinja2.

Configuring allowed fields

Allowed fields within a workflow are specified in the workflows.yaml file. For each workflow an entry is created that holds a section for allow_lists. The entries in this field follow the model specified in models and include the contribution itself, persons and questionnaires holding specific information for each contribution type as well as diversity information.

Configuring templates

Templates that can be used are contained in the templates folder. For example the template for the website workflow is templates/website.md. Available templating constructs can be found in the documentation of Jinja2. In principle, all different file formats can be configured with this templating engine.

The name of a template for a given workflow is given via the field output_template in the configuration file.

sorse-data-filter's People

Contributors

Stargazers

Watchers

sorse-data-filter's Issues

Improve wording: allow_lists and deny_lists

I should improve the wording used to filter visible fields and work with allow_lists and deny_lists.

Where to write data for rejected contributions?

I currently hesitate exporting data of rejected contributions into the spreadsheet for scheduling.
We could either

create another spreadsheet or
create another sheet within the spreadsheet.

Export Scheduling Data

For making a proper scheduling proper information of accepted abstracts is needed to design an inclusive and effective programme for SORSE.

Create allow_list for scheduling following defined data visibility
Export data to google drive

My current plans for scheduling is to create a document or spreadsheet in the google drive. We might consider having one document for all tracks and also all submission rounds. So appending new information to the end of the document. However, I am still hesitating on how it might work best.

Create links in metadata

As mentioned by @Chilipp links in website metadata should be replaced by their corresponding html links.

instructions: "<a href='https://www.sphinx-doc.org/en/master/usage/installation.html#installation-from-pypi'>https://sphinx-doc.org</a>"

Originally posted by @Chilipp in SORSE/sorse.github.io#258 (comment)

Add event ID

As far as I remember, each submission get's a unique ID in Indico, right? Can we add this to the YAML front-matter and include it in the generated markdown? I.e. something like event-{{ eventID }}.md?

Ensure proper encoding of free text fields

Having a proper encoding is important here as users have free text fields they can write into. As seen in SORSE/sorse.github.io@a4031b2 and SORSE/sorse.github.io#248 (comment) this can lead to issues with PDF rendering.

Here it is also important to differ between the different types of fields, e.g. the title is of especial importance for PDF rendering (see SORSE/sorse.github.io#248 (comment)). So the change should also take care on this.

Strip whitespace in contents

As mentioned by @Chilipp we should take care to properly strip whitespace in the paragraphs of the contribution contents.

The challenges facing research software development are manifold and have long been a major topic at RSE conferences.

remove whitespace at the end of paragraph

Originally posted by @Chilipp in SORSE/sorse.github.io#258 (comment)

Updates to Indico form

Remove the streaming question completely.
Change the recording question to this -- 'I agree that my contribution may be recorded, the recording published and my talk streamed on youtube or similar'.
Under that same recording question add this sentence 'We will stream the talk on Youtube or similar for diversity, inclusivity and accessibility reasons.' Please add after the first sentence and before the sentence starting with 'For each contribution'.
Remove 'Track' if possible because we already have this question as 'Contribution type' further up the form
Remove 'Latest Delivery date'
Move all the talk questions to above the workshop questions, starting under 'Gender' so that the talk questions are first
Under 'Posters Mentoring' the button needs to change from 'Montoring wanted' to 'Mentoring wanted' (spelling mistake)
Under 'Panelists' - please change 'BY PROVIDING THESE NAMES YOU CONFIRM THAT THESE INDIVIDUAL HAVE AGREED TO PARTICIPATE AND HAVE GIVEN PERMISSION FOR YOU TO SHARE THEIR INFORMATION.' To 'BY PROVIDING THESE NAMES YOU CONFIRM THAT THESE INDIVIDUALS HAVE AGREED TO PARTICIPATE AND HAVE GIVEN PERMISSION FOR YOU TO SHARE THEIR INFORMATION.'
Please change 'In case you seek for mentoring' to 'More details about mentoring needs'
Under 'Posters only, mentoring' please change this text 'We aim to provide one-on-one mentoring to support inexperienced applicants who's poster is accepted. Please provide further information below.' to 'We aim to provide one-to-one mentoring to support inexperienced applicants whose poster is accepted. Please provide further information below.'

Add registration url

Each event has an individual url to register as a participant. Can we add this URL here?

Export Mentoring Data

For organising mentoring, information are required who actually needs mentoring.
My current plan is to export mentoring information to the Google Drive.

Create allow_list information for mentoring
Export to Google Drive

I currently still don't know if having a document or spreadsheet is the best option. I suppose a spreadsheet. Each entry then should contain a link to the website to allow checking for information such as the abstract.

Change name of contribution types

We have to change the name of software_demos into software-demos. Otherwise we have inconsistent paths on the website (I forgot to do by myself and just did this for the dummy events in 2f40df3).

Originally posted by @Chilipp in SORSE/sorse.github.io#234 (comment)

Also the other contribution types are now plural, so also need to be changed.

Add possibility to filter abstracts

Currently, we cannot configure in which state abstracts are relevant for given workflows. This should be added as a filter in the main configuration file.

Introduce date as publication date

Metadata for website should include date, that is date of publication and should be the current date.

fix YAML encoding of &speaker

I just realized,

authors:
    - &speaker name:  someone
author: *speaker

does not work because *speaker renders to name. It needs to be

authors:
    - &speaker 
      name:  someone
author: *speaker

Only one speaker reference in YAML

As mentioned in SORSE/sorse.github.io#279 (review) valid YAML syntax may only hold one alias reference. In case several people are considered speakers, only the first should explicitly written.

Allow optional fields when traversing

The questions for the different contribution types may differ. But the allow_list currently handles all the different types as each contribution is directly tight to one contribution type. Thus, it may happen that an allowed field is not available as another contribution type is processed. For these cases it should become possible to define optional types during traversing the namespace.

Group affiliations in website template

@eileen-kuehn: related to SORSE/sorse.github.io#248, do you have any problems if we group the affiliations in the YAML front-matter? This would make it easier in the PDF generation workflow and it's also easy to implement in the website. Instead of

authors:
    - name: Dr. Jack Brookes
      bio: University Health Network
    - name: Mr. Matthew Warburton
      bio: University Health Network
    - name: Prof. Mark Mon-Williams
      bio: University Health Network
    - name: Dr. Faisal Mushtaq
      bio: University Health Network

we would then have

authors:
    - name: Dr. Jack Brookes
      affiliation: 1
    - name: Mr. Matthew Warburton
      affiliation: 1
    - name: Prof. Mark Mon-Williams
      affiliation: 1
    - name: Dr. Faisal Mushtaq
      affiliation: 1
affiliations:
    - name: University Health Network
      index: 1

If you like, I can make a PR to do this. From an OOP perspective you likely don't want the affiliation number be a part of the Person class because the latter should work stand-alone. But we can avoid this and do everything with Jinja2 in the website.md template. I'd just need add a property named affiliations to the Contribution class, something like

@property
def affiliations(self):
    """Unique affiliations of the contributors"""
    affiliations = []
    for person in self.persons:
         if person.affiliation and person.affiliation not in affiliations:
            affiliations.append(person.affiliation)
    return affiliations