openrefine / commonsextension Goto Github PK

An OpenRefine extension that helps with Wikimedia Commons editing: start projects from Wikimedia Commons categories; Commons-specific GREL functions.

License: BSD 3-Clause "New" or "Revised" License

JavaScript 23.24% Java 66.92% HTML 2.97% Less 6.86%

java openrefine sdc wikimedia wikicommons extension

commonsextension's Introduction

Wikimedia Commons Extension for OpenRefine

This extension provides several helpful functionalities for OpenRefine users who want to edit (structured data of) media files (images, videos, PDFs...) on Wikimedia Commons. For more info, documentation and how-tos about OpenRefine for Wikimedia Commons, see https://commons.wikimedia.org/wiki/Commons:OpenRefine.

Features included in this extension:

Start an OpenRefine project by loading file names from one or more Wikimedia Commons categories (including category depth)
Add columns with Commons categories and/or M-ids of each file name
File names will already be reconciled when starting the project
A few dedicated GREL commands allow basic processing and extraction of Wikitext: extractFromTemplate and value.extractCategories
(In this extension's 0.1.1 release and later) Basic support for file thumbnail previews of existing Wikimedia Commons files. Thumbnails are displayed for some (but not all) file types/extensions. There is currently thumbnail support for jpeg, gif, png, djvu, pdf, svg, webm and ogv files.

It works with OpenRefine 3.6.x and later versions of OpenRefine. It is not compatible with OpenRefine 3.5.x or earlier. (OpenRefine supports editing Wikimedia Commons from version 3.6; this is not possible in earlier versions.)

This extension was first released in October 2022. It has been funded by a Wikimedia project grant.

How to use this extension

Install this extension in OpenRefine

Download the .zip file of the latest release of this extension. Unzip this file and place the unzipped folder in your OpenRefine extensions folder. Read more about installing extensions in OpenRefine's user manual.

When this extension is installed correctly, you will now see the additional option 'Wikimedia Commons' when starting a new project in OpenRefine.

Start an OpenRefine project from one or more Wikimedia Commons categories

After installing this extension, click the 'Wikimedia Commons' option to start a new project in OpenRefine. You will be prompted to add one or more Wikimedia Commons categories.

There's no need to type the Category: prefix.

You can specify category depth by typing or selecting a number in the input field after each category. Depth 0 means only files from the current category level; depth 1 will retrieve files from one sub-category level down, etc.

Next, in the project preview screen (Configure parsing options), you can choose to also include a column with each file's M-id (unique MediaInfo identifier) and/or Commons categories.

File names will already be reconciled when your project starts.

When you load larger categories (thousands of files) in a new project, OpenRefine will start slowly and will give you a memory warning. This is a known issue. Wait for a bit; the project will eventually start. The Commons Extension has been tested with a project of more than 450,000 files.

GREL commands to extract data from Wikitext

The Wikimedia Commons Extension also enables two dedicated GREL commands, which help to extract specific information from the Wikitext of Wikimedia Commons files. (GREL, General Refine Expression Language, is a dedicated scripting language used in OpenRefine for many flexible data operations. For a general reference on using GREL in OpenRefine, see https://docs.openrefine.org/manual/grelfunctions.)

Firstly, retrieve the Wikitext from a list of Commons files in your project. In the column menu of the reconciled file names' column, select Edit column > Add column from reconciled values... and select Wikitext in the resulting dialog window.

From this new column with Wikitext, you can now extract values and categories as described below. Start by selecting Edit column > Add column based on this column... in the column menu. In the next dialog window, you can use various specific GREL commands:

Extract values from template parameters: `extractFromTemplate`

Use the following syntax:

extractFromTemplate(value, "BHL", "source")[0]

where you replace BHL with the name of the template (without curly brackets) and source with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385.

Extract Wikimedia Commons categories: `value.extractCategories`

Use the following syntax:

value.extractCategories().join('#')

This GREL syntax will return all categories mentioned in the Wikitext, separated by the # character, which you can then use to split the resulting cell further as needed.

Development

Building from source

Run

mvn package

This creates a zip file in the target folder, which can then be installed in OpenRefine.

Developing it

To avoid having to unzip the extension in the corresponding directory every time you want to test it, you can also use another set up: simply create a symbolic link from your extensions folder in OpenRefine to the local copy of this repository. With this setup, you do not need to run mvn package when making changes to the extension, but you will still to compile it with mvn compile if you are making changes to Java files, and restart OpenRefine if you make changes to any files.

Releasing it

Make sure you are on the master branch and it is up to date (git pull)
Open pom.xml and set the version to the desired version number, such as <version>0.1.0</version>
Commit and push those changes
Add a corresponding git tag, with git tag -a v0.1.0 -m "Version 0.1.0" (when working from GitHub Desktop, you can follow this process and manually add the v0.1.0 and Version 0.1.0 tags)
Push the tag to GitHub: git push --tags (in GitHub Desktop, just push again)
Create the zip file for the release: mvn package
Create a new release on GitHub at https://github.com/OpenRefine/CommonsExtension/releases/new, providing a release title (such as "Commons extension 0.1.0") and a description of the features in this release. Upload the zip file you generated at the previous step as an attachment (it can be found in the target subfolder of your local copy of the repository).
Open pom.xml and set the version to the expected next version number, followed by -SNAPSHOT. For instance, if you just released 0.1.0, you could set <version>0.1.1-SNAPSHOT</version>
Commit and push those changes.

commonsextension's People

Contributors

Stargazers

Watchers

Forkers

j-sal eduardssk antoine2711 wetneb abbe98 seanpm2001 sskav7

commonsextension's Issues

New structure for category fetching

To simplify our code, I would propose the following architecture for the category fetching, based on Java iterators. We would need the following classes:

A class (say FileRecord) which would essentially represent the contents of a record in the project (although it would not yet be formatted as a list of rows). It would contain the attributes:
- a file name
- its corresponding mid
- the list of categories it belongs to
A class where the constructor takes a single category name as parameter, and implements the Iterator<FileRecord> interface: it iterates over the file names contained in that category. In each FileRecord the categories would be left empty as a first step. So really the only task of this class would be to make the HTTP requests to the Commons API with the appropriate paging.
A class which takes an Iterator<FileRecord> (an iterator over file names) as parameter, and implements Iterator<FileRecord> again: its task would be to fetch the categories each file belongs to, and store them in each FileRecord.
A class which takes an Iterator<FileRecord> and implements TableDataReader. Its task would be to convert each FileRecord to one or more rows (by spreading the categories down on blank rows as we are currently trying to do)

With all those building blocks, you could then combine them (chain them) all together into the importer.

Bug: extractFromTemplate and value.extractCategories GREL functions produce empty columns

I have been trying the extractFromTemplate and value.extractCategories GREL functions in various projects. Both work well in the GREL preview dialog window:

But then after clicking OK, in the project itself, both produce an empty column. I haven't been able to get it to work in any project for now, but just for testing purposes, here's a project in which it went wrong:
Barbalissos.openrefine.tar.gz

Implement findTemplateValues function

We could have a findTemplateValues function (name to be improved) which would work like this:

first argument, mandatory: the wikitext to parse
second argument, mandatory: the name of the template to look for in the wikitext
third argument, mandatory too: the name of the template parameter to extract

It would return the list of values of the given parameter in the given template.

For instance, calling findTemplateValues(value, "foo", "bar") on a cell containing the following value:

{{some template|bar=test}}
{{foo|bar={{other template}}}}
{{foo| foo = not important| bar = second value }}

should return
["{{other template}}", "second value" ].

Extensive documentation about templates in Wikitext can be found here: https://en.wikipedia.org/wiki/Help:Template (but that is probably much more than you need)

extractCategories fails on some example wikitext

Running the value.extractCategories() expression on the following cell value should give some categories as output, but it returns the empty list:

== {{int:filedesc}} ==
{{Information
|Description={{en|1=View of Earth taken during ISS Expedition 30.}}
|Source=[https://eol.jsc.nasa.gov/SearchPhotos/photo.pl?mission=ISS030&roll=E&frame=226922 JSC Gateway to Astronaut Photography of Earth]
|Date=2012-04-12 05:20:47
|Author=Earth Science and Remote Sensing Unit, NASA Johnson Space Center
|Permission=
|other_versions=
|other_fields=
{{InFi|name=Sun Azimuth|value=33°}}
{{InFi|name=Sun Elevatation|value=-34°}}
{{InFi|name=Altitude|value={{convert|211|nmi|km}}}}
{{InFi|name=Mission|value=ISS030}}
{{InFi|name=Roll|value=E}}
{{InFi|name=Frame|value=226922}}
{{InFi|name=Camera|value=NIKON D3S S/N: 2008336}}
{{InFi|name=Focal length|value=28 mm}}
}}

{{location}}

{{NASA-image|id=ISS030-E-226922|center=JSC}}
== {{int:license-header}} ==
{{PD-USGov-NASA}}

[[Category:ISS Expedition 30 Crew Earth Observations (dump)|226922]]
[[Category:Taken with Nikon D3s]]

I would be curious to know why, and if it can be fixed :)

Implement Commons categories (+ category depth) as 'starting point' for new OpenRefine projects

(This first was an issue in the OpenRefine/OpenRefine repository, but we have decided to implement this as part of the Commons extension.)

When editing batches of Wikimedia Commons files, regular Commons and GLAM contributors will typically take one or more Wikimedia Commons categories as input or 'starting point' in various tools. Examples include AC/DC, the ISA Tool, and VisualFileChange. It would be great if that would be possible in OpenRefine too (rather than asking users to start with a list of file names).

Alternatives considered

In the earlier integration scenarios, we kind of assumed that users would start off with lists of file names.

However, this does produce some extra hurdles to users (which may be especially difficult and annoying to newcomers). They would have to use other tools in order to get a list of file names on Wikimedia Commons, which complicates the workflow.

Typically, users would resort to

PetScan outputting csv, tsv, plain text or a PagePile (example query)
the Wikimedia Commons query service (example query) (much less frequently though, as WCQS is not yet very actively used)

These tools are not rocket science, but they're not super intuitive either. It's absolutely possible to help users figure this out through documentation and by providing sample queries that are very easy to modify. That said, it will be a more smooth and seamless and less frustrating experience if the Commons Extension facilitates direct usage of Commons categories.

Additional context

Typically, Wikimedia tools allow refined interaction with Commons categories, which we want to facilitate too:

Choosing/selecting categories at once
Specifying subcategory depth for each individual category

Various screenshots of how other Wikimedia tools do this:

The ISA Tool

PetScan

(link to this example)

Thumbnail previews of media files from URL, to be uploaded to a Wikibase (or Wikimedia Commons)

Scenario: User wants to use OpenRefine to upload new files to Wikimedia Commons, from the web. They select a set of URLs (or start a project from API, or similar).

We want to give users the option to toggle previews (thumbnails) of the media files, and the option to click the thumbnails to enlarge them. Such preview thumbnails are helpful during editing (e.g. to check if a certain thing is indeed depicted in a file, without having to preview the file via another application on the local drive itself).

Wireframes for this feature have been drawn by @lozanaross for another scenario (show thumbnails of files already on Wikimedia Commons), but I think the basic UX behavior is the same? OpenRefine/OpenRefine#5154 provides the technical basis for making this feature possible.

Most recent wireframes I found (v4, development version; shows files on Commons):

The small thumbnails: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/8c10c8d4-9ab6-4da2-adb6-9552f789593f/
Hover effect: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/e8aadeae-d6e6-414f-9852-26e85ce6964d/
Enlarged thumbnail: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/698c6f1c-d747-4adc-895b-142e21c2830c/

Extension name in project start UI should be new Wikimedia Commons (instead of Commons Extension)

Currently the Commons Extension is called 'Commons Extension' in the blue project start 'Get data from' menu.

To be consistent with other data sources / extensions, let's rename it to *new* Wikimedia Commons per @lozanaross suggestions.

Wireframes:

Issue reporting in Wikibase edit schema, focused on the Wikimedia Commons use case

When using OpenRefine's schema editor to prepare edits and uploads to Wikimedia Commons, users should get clear feedback (in the 'issues' tab) about what may go wrong in their edit/upload batches.

This feedback will in some cases be different from the generic feedback for any Wikibase (i.e. not just constraint related) and specific to Wikimedia Commons: think about feedback that concerns duplicate file names, file names with invalid characters...

@trnstlntk to write specifications for this (what potential problems do we need to cover here; draft copy for clear error messages?..) - deadline May 31, 2022
@wetneb to build this - deadline June 30, 2022

Build template support (minimum version) - w/o wikitext generation

From the roadmap documentation: this relates to building support for "schemas with holes" in OpenRefine (current schemas include the values, but we also need option to preserve only the shape of a schema, w/o the metadata values). Schemas will be saved as json files in the extension repo and potentially users can contribute templates by creating a pull request and following a review process towards merging.

Optimize fetching of related categories with "clcontinue"

Related-category fetching currently supports up to 500 related categories per api call. Using the clcontinue parameter will allow to fetch all related categories of a given group of titles (up to 50 titles) per api call.

Thumbnail previews of media files available on Wikimedia Commons

Scenario: user wants to use OpenRefine to add structured data to existing files from Wikimedia Commons. They load a series of file paths from Wikimedia Commons and reconcile them with Wikimedia Commons.

Wireframes for this feature have been drawn by @lozanaross and OpenRefine/OpenRefine#5154 provides the technical basis for making this feature possible.

Most recent wireframes I found (v4, development version):

The small thumbnails: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/8c10c8d4-9ab6-4da2-adb6-9552f789593f/
Hover effect: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/e8aadeae-d6e6-414f-9852-26e85ce6964d/
Enlarged thumbnail: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/698c6f1c-d747-4adc-895b-142e21c2830c/

Display entire Commons category names in category autosuggestion

When I want to select a Commons category with a long name in the Commons Extension category selector screen, and when there are multiple categories that have the same beginning, it's currently not possible to distinguis the difference between them during autosuggest. See screenshot:

In this example, we have for instance Category:Sculptures in Rotterdam-Noord / Category:Sculptures in Rotterdam-Zuid etc etc. It would be good to redesign the dropdown a bit so that the full category names become visible (even though they may sometimes be really long).
In the screenshot above, the category names (truncated) are actually repeated (bigger bold and smaller greyish), which may replicate other design patterns in OpenRefine (?) but perhaps it makes more sense to display the category name only once, but then fully.

For comparison, this is what autosuggest looks like in Wikimedia Commons' own search box

Thumbnail previews of media files from local drive, to be uploaded to a Wikibase (or Wikimedia Commons)

Scenario: User wants to use OpenRefine to upload new files to Wikimedia Commons, from harddrive. They select a set of files from their harddrive.

Most recent wireframes I found (v4, development version; shows files on Commons):

The small thumbnails: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/8c10c8d4-9ab6-4da2-adb6-9552f789593f/
Hover effect: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/e8aadeae-d6e6-414f-9852-26e85ce6964d/
Enlarged thumbnail: https://xd.adobe.com/view/df767737-c4ca-4d25-977e-7ad8ca3a6840-8a25/screen/698c6f1c-d747-4adc-895b-142e21c2830c/

Build the UI to fill out an SDC template

This relates to the schema tab UI that users get to fill out when they select a template, it goes beyond what is already supported in the generic WB extension UI. Pages 15-19 here: https://xd.adobe.com/view/d8b086f2-199a-4fcf-a1e8-8d939842b790-8de3/

Design workflows for Commons extension that don't start from a ready-made spreadsheet

Workflows starting from a URL, or from a local folder of files, or from a Commons category tree need additional design and development. Should be discussed at a team meeting, plus a survey should go out to users - to better understand what scenarios users need more support with.

Feature request: simple image upload from one or multiple Flickr URL(s)

Many Wikimedians upload (appropriately licensed) images from Flickr to Wikimedia Commons.

Upload from Flickr is supported by the (default) Wikimedia Commons UploadWizard (up to 500 files at once): https://commons.wikimedia.org/wiki/Special:UploadWizard

Which is pretty barebones (it literally takes the photo description as on Flickr and then still needs manual input for license and other metadata).

Another frequently used tool is Flickr2Commons by Magnus Manske: https://flickr2commons.toolforge.org/#/

It offers a bit more flexibility and intelligence, but still uploads the descriptions in a quite barebones way.

Advantages of Flickr integration in OpenRefine would include:

Ability to parse and reconcile elements from the Flickr file descriptions
Addition of (more) refined and diverse structured data, file names, and Wikitext to the files upon upload

I am posting this after receiving a request for test uploading from the very awesome Biodiversity Heritage Library, which stores many files on Flickr and for whom such a transfer functionality (including the more advanced features that OpenRefine could offer) would be very helpful. Here's one example of an album in their vast Flickr repository. I can imagine more GLAMs are in this situation.

We can't do this before the October 2022 Wikimedia grant deadline, but it's something to keep an eye on for future development. It would be good to involve the Wikimedia Commons / OpenRefine user community to help us prioritize this feature request.

Respond to feedback on v2 of the wireframes and design necessary updates into a v3 file

Design adjustments needed based on feedback from Olaf here & from Sandra here.

GREL function for {{Creator:Peter Paul Rubens}} {{Institution:Rijksmuseum}} style templates

Wikimedia Commons Wikitext also often contains templates that are formed as {{Creator:Peter Paul Rubens}} or {{Institution:Rijksmuseum}} (note the : instead of a | character). Because such templates are very prevalent (the {{Creator}} and {{Institution}} one are used on millions of files) it would be great to be able to have a dedicated GREL syntax/function for these. See https://commons.wikimedia.org/wiki/Template:Creator and https://commons.wikimedia.org/wiki/Template:Institution for a bit more info about both templates.

In the {{Creator:Peter Paul Rubens}} and {{Institution:Rijksmuseum}} examples respectively, the OpenRefine end user will want to extract Peter Paul Rubens and Rijksmuseum.

This file has both templates, as just one example.

Depth support for category fetching

We want to support fetching subcategories recursively up to some depth, like other tools like Petscan.

Here is a proposed architecture for this.

/**
  * Fetches a category recursively, up to the given depth, from the MediaWiki API.
  * The stream of FileRecords contains the filenames and mids, but not the related
  * categories (which must be fetched separately).
  * Set the depth to 0 to ignore subcategories.
  */
static Iterator<FileRecord> listCategoryMembers(String endpoint, String categoryName, int depth) {
    // TODO
}

/**
 * Fetches the direct subcategories of a given category, from the MediaWiki API.
 * The supplied stream contains category names (TBD: with or without the `Category:` prefix?).
 */
static Iterator<String> fetchSubcategories(String endpoint, String categoryName) {
    // TODO
}

/**
 * Fetches the files which are direct members of a given category, from the MediaWiki API.
 * The stream of FileRecords contains the filenames and mids, but not the related
 * categories (which must be fetched separately).
 */
static Iterator<FileRecord> fetchDirectFileMembers(String endpoint, String categoryName) {
   // TODO
}

/**
 * Internal function used to iterate over the paginated results of the MediaWiki API
 * when fetching files or categories. This function is used both by fetchSubcategories and
 * by fetchDirectFileMembers.
 * The `subcategories` parameter can be set to true to fetch categories and false to fetch files
 */
static Iterator<JsonNode> fetchCategoryMembers(String endpoint, String categoryName, boolean subcategories) {
   // TODO
}

To migrate to this architecture, I propose the following steps:

the current FileFetcher class is adapted to implement Iterator<JsonNode> instead of Iterator<FileRecord>: it is no longer responsible for parsing each JSON result into a FileRecord. Furthermore, the FileFetcher constructor takes a new boolean parameter indicating whether it should fetch files or subcategories (it cannot do both).
the static method fetchCategoryMembers is a simple wrapper on top of FileFetcher
the removed parsing code is moved into fetchDirectFileMembers, which converts the Iterator<JsonNode> to an Iterator<FileRecord> by parsing each result
similarly, the fetchSubcategories does a similar parsing, but extracting only the category names without pageids
finally, the listCategoryMembers method uses both fetchSubcategories and fetchDirectFileMembers into a recursive algorithm which parses categories up to a certain depth.

User testing in September

This issue serves as a deadline to complete and test a number of the added-value features needed to fulfil the additional WMF grant with an October deadline.

Design a consistent way of reporting fail scenarios during data upload / reconciliation to users

Related to OpenRefine/OpenRefine#4236

Onboarding - UI steps / documentation

Based on how the rest of the development process is going in the summer, this issue will be reviewed again and determined whether it involves interventions on the UI level (e.g. guided tour for new users via strategic pop-ups around the UI), or takes on a purely documentation character via e.g. video tutorials and written instructions.

Design a UI with which CommonsExtension users interact with Wikitext-specific GREL commands

@j-sal is building various pieces of very helpful GREL syntax that will make it easier for users of OpenRefine's Commons Extension to parse and process Wikitext.

It would be good to have a dedicated UI with which users can choose and test these various GREL commands; ideally without even having to know/memorize them or look them up. It would also be good to have some preview functionality of what each operation will do.

For inspiration / input for this task:

Sandra's wishlist of various things to parse from Wikitext: https://docs.google.com/document/d/1pFElwbRDBwXSZvuokd8OE1yWr2op1Zl2ckk0mg96OYQ/edit
GREL commands that are either already implemented or in progress: https://github.com/OpenRefine/CommonsExtension/issues?q=is%3Aissue

Register `ExtractFromTemplate()` in `controller.js`

New functions added to the Commons extension, such as the extractFromTemplate() GREL function for retrieving Template Values as specified in #2, need to be registered in the extension's controller.js module so that they are visible in OpenRefine.

Support for extracting positional template parameters

Sometimes we want to extract the values of template parameters which are positional (just designated by positions, not parameter names).

For instance:
{{location|60.165787|24.9460811}}
(taken from Sandra's parsing wishlist).

We could want to extract the first and second values in the template. One possible way would be to extend the functionality of the findTemplateValues() function that we already have, so that it accepts a number as third argument, like this:

findTemplateValues(wikitext, 'location', 1)

On the sample wikitext above, it would return ['60.165787'].

When running findTemplateValues(wikitext, 'location', 2) you would get ['24.9460811'].

Build dialog window that allows Commons Extension user to work with Commons-specific GREL commands

Dependent on issue #7

Currently based on the "Add column based on this column" dialog, but will most likely be its own separate, SDC-specific dialog.

Filenames already reconciled when starting an OpenRefine project from Commons categories

When users start an OpenRefine project using the Commons categories feature, it would be great if the file name column would already be reconciled against Wikimedia Commons.

Currently the project loads like this (checked M-id and categories options):

Where the user still needs to manually reconcile the file names.

It would be great if a project would start with the file names already reconciled like this:

Export to MediaWiki's Tabular JSON and upload to Wikimedia Commons

MediaWiki supports a tabular data format that can hold some metadata. It would make sense to ease the export to that format and upload to Wikimedia Commons.

https://www.mediawiki.org/wiki/Help:Tabular_Data

Write a Tabular JSON exporter
Optional: Add upload wizard for Wikimedia Commons as part of the Wikidata extension

extractFromTemplate is sensitive to newlines in the template name

This works:

extractFromTemplate("{{foo|bar=hello}}", "foo", "bar")

(it evaluates to: [ "hello" ])

This does not work:

extractFromTemplate("{{foo\n|bar=hello}}", "foo", "bar")

(it evaluates to [ ], and should evaluate to [ "hello" ]).

Review interface copy and tooltips in @Lozanaross wireframes dd Apr 26, 2022

During an OpenRefine/SDC workshop at De Krook in Ghent, April 26, 2022, @lozanaross presented a new version of Wikimedia Commons wireframes: https://xd.adobe.com/view/fdf5a12c-9c30-4449-9eba-d1ea8523dddb-a8a6/

Several of these wireframes need a review by @trnstlntk

General review of interface copy that is specific to Wikimedia Commons
Create draft texts for tooltips - (i) icons which are newly introduced here
For tooltips for structured data statements in the Schema dialog: investigate whether these are generically applicable to Commons or Wikidata (and hence be a Wikidata statement) or are really OR-specific (and hence need to be maintained outside Wikidata)

Make usage of 'category depth' input field a bit clearer by greying out (until category is entered) and perhaps by integrating a counter

To make the function of the 'category depth' input box a bit clearer, @lozanaross suggests it should be greyed out until the user picks a category, then it becomes white.

Additionally, it could include a counter (currently users didn't understand they are supposed to type a number). If fetching only the current category means depth=0, perhaps the input box can be prefilled with a 0 by default, and when clicking the user gets a dropdown with more numbers to select?

Build thumbnail support for local and remote URL images

This builds on the work outlined in #34, but is less urgent.

Allow people to remove Commons categories from the selected list, while listing desired categories to retrieve files from

When I start the Commons extension, and I am listing the Commons categories I am interested in retrieving files from, I sometimes makes mistakes. I may have listed several categories (and their depth) but then may want to remove one or more of them again (because I changed my mind, I mistyped, etc)

E.g. in the example below:

Oops! I absolutely do NOT want to load files from the Category:Sculptures in Amsterdam. Why would I? Sculptures in Rotterdam and Delft are way cooler.

It would be great to add the option to remove categories here, by adding a cross at the end of each line (behind the category depth field). If the user clicks that cross, the line will be deleted and files from the category will not be retrieved anymore.

We already use such a 'removing cross' in various other places in OpenRefine's interface, e.g. in the selection of (and removal of) reconciliation services:

Selecting 'Start Over' button should clear previously set categories

In the parsing menu, when a user decides to 'Start Over', the category name(s) of the files to fetch should be empty, and previously set categories should be removed from the request sent to the backend, allowing for a new set of categories to be set.

Warn user that they are working with one or more file name(s) that already exist(s) on Wikimedia Commons

File names on Wikimedia Commons must be unique (two files can't have the same name).

The default Wikimedia Commons UploadWizard warns the user when they are naming a file the same as one that already exists.

It would be great if OpenRefine also warns uploaders of new Commons files if this happens. I can imagine this will be part of the 'Issues' tab when creating a schema for uploading files to Commons, see #22

Build the UI to pick a template

This relates specifically to the possibility to select templates from a dropdown, as well as possibility to save them (see latest version of the wires below):

Define behavior for when a Category does not have elements

Should a message be displayed?

Define behavior for when a manually set Category does not exist

-Is there a redirection possible?

Integrate thumbnail previews during the Wikimedia Commons batch file upload process

See https://commons.wikimedia.org/wiki/Commons_talk:OpenRefine

Request received in a conversation in the Wikimedia Commons Telegram channel. During a batch upload process of media files, it is extremely helpful if one can (easily) see thumbnail previews of the media files that are being uploaded.

Some existing Wikimedia Commons (batch) upload tools support this indeed (the default UploadWizard, for instance), others don't (Pattypan only shows previews of files and their infoboxes during the checking phase of the upload process, after all data has been prepared already).

OpenRefine is essentially a data-centric tool, so this may be a stretch, but it's good to have this request on the radar, as it makes a lot of sense IMO.

Documentation of features of the Commons Extension

Prepare documentation that will help guide users, stakeholders, and OSS community developers around the new SDC extension features.

Support the specific OpenRefine/SDC upload workflow from a IIIF endpoint

When talking to potential users of the Structured Data on Commons (SDC) batch upload functionalities for OpenRefine, we hear a lot about the use case of IIIF endpoints.

IIIF is the International Image Interoperability Framework. According to the framework's website it is "a set of open standards for delivering high-quality, attributed digital objects online at scale. It’s also an international community developing and implementing the IIIF APIs. IIIF is backed by a consortium of leading cultural institutions."

Many cultural institutions around the world present their files through a IIIF endpoint. This is indeed a section-wide API standard.

Many IIIF endpoint managers are, or may be, interested to upload files to Wikimedia Commons leveraging this specific set of APIs.

In any endpoint, the source files to be uploaded to Wikimedia Commons can be called upon in a specific standardized way.
Metadata about the files (if present) can also be called upon in the same kind of standardized way.

OpenRefine users can use both of these specific API calls, during project creation and while wrangling data inside OpenRefine. But that's advanced stuff, and we can make that process easier.

We can tackle this in various ways:

Lightweight, documentation-focused approach: we don't build specific features for IIIF users but we document the process well for them;
And/or (perhaps at a later stage, if we see a lot of interest in this) we indeed create a specific IIIF-focused feature or wizard, probably to be used during project creation.

Rename 'Include nested category levels:' to 'Subcategory depth:'

During user testing, @lozanaross asked users (without giving them instructions) to guess what the subcategory depth input field meant. Many people thought it was a checkbox and most couldn't guess what it was supposed to do (especially people unfamiliar with other tools to work with Wikimedia Commons categories).

We suggest to rename the Include nested category levels: text in the interface to Subcategory depth: which is a bit shorter, and gives more indication that a number needs to be entered in the input field.

User testing in June

This issue serves as a deadline to complete and test a number of the minimum requirement issues to fulfil the original WMF grant with what we promised for a June deadline.

Allow manual setting of categories

Allow for users to manually set a Category name that is not suggested by the suggest widget

Make it possible to Import IIIF collections

IIIF and the IIIF Presentation API are used by many GLAM institutions and the ability to import records IIIF Collections would greatly reusers who wish to clean GLAM data or users of the Commons extension.

Proposed solution

Given the collection root URL, an importer would traverse its content and fetch data from the various IIIF manifests in it.

Additional context

https://iiif.io/api/presentation/3.0/#51-collection
#19
https://lbiiif.riksarkivet.se/collection/kartor-och-ritningar (example collection)

Category autocompletion while entering Commons categories

Refinement / specific sub-feature for #3.

Many Wikimedia Commons tools/interfaces allow users to enter / work with Commons categories. Usually, these tools or interfaces offer autocompletion of names of Commons categories. We will make Wikimedia Commons users in OpenRefine quite happy if the Commons Extension also offers this functionality!

Just for inspiration, showing what this looks like in various tools.

In the Wikimedia Commons UploadWizard

In the HotCat gadget (note the blue checkmark that appears when the user has selected a correct category name)

In the ISA Tool (the user types something without Category: but then Category: is being displayed)

Finalize basic Commons-specific schema template specifications: Information, Artwork, Art Photo, Book

We've continued thinking about Wikimedia Commons-specific template support inside OpenRefine's Wikibase schema builder.

We've come up with the concept of 'schema templates': basically, these are empty Wikibase schemas inside OpenRefine. For the Wikimedia Commons use case, we want to add a few default ones corresponding to frequently-used file information templates that are also Structured Data on Commons-driven (Information, Artwork, Art Photo, Book - for now). Users will be able to add their own.

Current wireframes by @lozanaross were based on a spreadsheet by Sandra, but need some tweaking and finalization (basically, there will be other custom statements that reflect current SDC modeling conventions). @trnstlntk will create these based on her knowledge of these modeling practices.

Update README.md with basic documentation on installation and functionalities of this extension

By end October it would be good to update this repo's README.md so that it becomes clear for laypeople who want to install and use the extension. Basic things to include:

Info on how to install the extension (can link to more specific info in our docs, but let's have some basic info here too)
Info and a few examples on the GREL commands we built (same)
Info and examples of the workflow of starting a project with Commons categories (same)

Decide upon, design and develop parsing options that appear after user has entered Commons categories

After a user has entered one or more Commons categories to start an OpenRefine project with, it makes sense to present them with some custom parsing options in the 'Configure Parsing Options' dialog window.

Several that I can think of, which would make sense, from the top of my head:

Reconcile file names (+ let users specify the language against which they want to reconcile - for possible data extension they may want to do later)
Display thumbnails of files? y/n
Display column with Wikitext? y/n
Display column with categories? y/n
Display one or more columns with some SDC in it already (user specifies the properties they are interested in)

I may miss some obvious ones, and I can imagine that conversations with potential end users may give us more good ideas/suggestions.

As for building this, the existing Wikitext parsing options in OpenRefine can be used for inspiration (although that interface is not optimal).

In our March 17 team meeting we talked about this a bit. Some of this should be retrievable via API, bypassing the need for the end user to run the Commons reconciliation service, which would be a good thing!

Add extension tab in UI

The Commons extension needs to be accessible from the 'Create project'->'Get data from' options.

Follow documentation from the technical reference to implement the required .js files, using the 'Database' and the 'GData' extension files as examples.

openrefine / commonsextension Goto Github PK

commonsextension's Introduction

Wikimedia Commons Extension for OpenRefine

How to use this extension

Install this extension in OpenRefine

Start an OpenRefine project from one or more Wikimedia Commons categories

GREL commands to extract data from Wikitext

Extract values from template parameters: extractFromTemplate

Extract Wikimedia Commons categories: value.extractCategories

Development

Building from source

Developing it

Releasing it

commonsextension's People

Contributors

Stargazers

Watchers

Forkers

commonsextension's Issues

Alternatives considered

Additional context

The ISA Tool

PetScan

Proposed solution

Additional context

Recommend Projects

Recommend Topics

Recommend Org

Extract values from template parameters: `extractFromTemplate`

Extract Wikimedia Commons categories: `value.extractCategories`