Coder Social home page Coder Social logo

openlibrary-librarians's Introduction

openlibrary-librarians

Coordination between the OpenLibrary.org Librarian community

The purpose of this repository is to track and manage Open Library issues that require human review, cleanup, or discussion. This may include bulk edits via librarian bots but should not include issues with the Open Library website.

Types of issues appropriate for this repository include, but are not limited to:

  • Duplicate and conflated records (authors, works, editions)
  • Review of data quality
  • Review of questionable, inappropriate, or spammy edits.

Team lead: @seabelis

openlibrary-librarians's People

Contributors

cclauss avatar mekarpeles avatar seabelis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

seabelis cclauss

openlibrary-librarians's Issues

30 broken author names

This search returns 36 results: https://openlibrary.org/search/authors?q=kru
most of which are corrupted names where composed diacritics got replaced by spaces, e.g. "Gerd Kru ssmann" from the MARC record "Krüssmann, Gerd"

Most of the replacements should be obvious, but the MARC records can be consulted for confirmation.

More incorrect records beyond #1

Problem

The publisher date is incorrect on multiple records past the ones with 9999 (that was figured out) were found in the data dump. I put together all the ones I could into one file.

Notes of document

  • this is an estimate of the records that should be changed, so some on here may be legitimate and some that need correction are not on this spreadsheet).
  • the first column is the OL ID, one would just add OL in front and M behind it to make it complete

Solutions

As mentioned in #1, the number of records that need a correction are almost 1/2 million. There are methods to approach it: by hand or by bot. Since I don't really know the answer, I'm opening this up so that it's known and there's a way to work on it. Not every record is incorrect, but should be evaluated. Here's how far I got with the manual process, so feel free to improve it:

Bad work titles that include multi-copy packaging

Amazon-listed multiple copy packs with mangled titles and often-mangled authors:

https://openlibrary.org/search?q=title%3A+%22Copy+counter%22&mode=everything
https://openlibrary.org/search?q=title%3A+%22Copy+display%22&mode=everything
https://openlibrary.org/search?q=title%3A+%22Copy+dump%22&mode=everything

Other similar forms may be added to the issue.
Handling suggested:
These “works” should first be reassigned to the correct author
Packaging info should be moved to the edition’s physical description
Edition titles should be stripped of packaging info and merged to the appropriate work
Redundant now-void work should be deleted

Bad authors "Delete", "Duplicate", etc

A previous manual process of resolving duplicate author records, used by patrons without merge privileges, was to move all the works to a single record and then rename the empty records to something which flagged that they shouldn't be used like "Delete Duplicate" or "Delete - see John Smith" or something similar.

There are also a few records of similar forms which appear to have been imported directly, presumably from an ILS where they had been similarly flagged. Those records will probably need to be resolved individually using a process different than that below.

For records which were originally accurate, but duplicated, I believe the process should be to:

  • Revert to a version of the record with the original name
  • Merge with the master record (for this step you may need to either: a) wait for the renamed record to get reindexed, or b) manually add it's OLID to the merge candidates URL)

You can see an example here: https://openlibrary.org/authors/OL6810374A?m=history

Here are some searches:

Report Duplicate Works Here

Please use one entry for a group of duplicates. Please include the complete links. It is not required to indicate the "best" record, but if you're interested...

  • The Work record that will be kept is the oldest record. It is possible to see the creation date at the bottom of the record; you can also tell by the Work ID as the lowest number is the oldest record.

  • Exceptions are made if a newer record has associated user lists; in this case that record will be kept in order to prevent the lists from breaking.

  • Prior to posting the duplicates, it is helpful if metadata is transferred from the duplicate records to the record that will be retained. If you're not sure which record should be kept, please ask before doing this. Especially in the case of conflated records or if more than one of the records has user lists.

Pro Tip! Sometimes the titles of works have been changed from one work to an entirely different work. This is usually the result or cause of conflation. You can check the original title of a work by looking at the first version of a record's history. To do this, go all the way to the bottom of the record and click the date of the bottom record (highlighted in grey).
Screenshot 2020-04-22 at 12 04 13

Untangle two Giovanni Pico della Mirandola authors

These two authors, uncle and nephew, have been merged together:

along with a couple of symposia, which I believe should also be separate author records.

I've reverted the author records, but I'm not going go through all 475 works to untangle them. It'll take some care to pick them all apart (and I wouldn't be surprised if some of them had incorrect authors when originally imported since the two authors are very confusable).

Conflation, Grundlagen und Gedanken

From a patron:

https://openlibrary.org/books/OL12701896M/Grundlagen_Und_Gedanken

I ran across this in an AUTHOR search for Max Frisch. I started trying to edit it working on the individual "editions" but it's such a huge mess that I gave up and suggest that someone look at WorldCat and start over.

  1. Grundlagen und Gedanken is a series title, not a work title, except it isn't even that. Rather it's the first few words of several different series titles referring to specific literary genres. So two of the three "editions" are in the Grundlagen series about drama and one is in the novel series.
    Since it's a publisher's series, there's no author, certainly not Frisch.

  2. These aren't "editions," they're three completely separate books that happen to have the same authors. One is about Frisch's play Andorra, one about his play Biedermann und die Brandstifter, and one about his novel Homo Faber.

A lack of definition for subject tags

Problem

It seems like the lack of a definition of what a 'subject' is creates various issues:

  1. People might add in tags that might not be true subjects
  2. An inconsistency forms between the OL and IA subjects: https://internetarchive.slack.com/archives/C0ETZV72L/p1586201847229900

Solutions

  • A community discussion about what a subject tag is, along with a consensus on what to use
  • provide instructions for subject tags when entering them in
  • possibly splitting the subject tag categories out further (than just people, places, etc.) to accommodate different subject tags that do belong, but just not with each other

A number of questions, mostly about translations

I've been using OL quite extensively for some research, and in a fit of possibly lockdown-related ADD have started to correct errors along the way. A few questions have cropped up that I could not find answers to. Please feel free to point me at any document I may have overlooked.

  • I have seen both the original-language title being the title of the work as well as the title of the English translation being used. Is there a preferred policy?

  • Many German books include "Roman" as a sort of subtitle, equivalent to the English "a novel". Should these be regarded as subtitles?

  • Subjects and, on occasion, other metadata such as pagination is sometimes English and sometimes in the work's or edition's language. Which is preferred here?

  • Some language confusion regarding place names and "published in" metadata: Should these be recorded as "Milan, Italy", "Milano, Italia" (or, for people who like to see the world burn, maybe "Milan, Italia")

  • On that topic: US place names typically include the two-letter state abbreviation ("New York, NY"). Sometimes this is necessary to distinguish between the, for example, 43 different Springfields, so I believe changing these to just "Springfield, USA" as the instructions on the form field suggest is unadvisable. I feel like "New York, NY" is probably universally understood to imply "USA". But at the same time, that assumption seems slightly arrogant and maybe US-centric. So: should it be "New York, NY, USA" or "New York, NY"?

  • There are some metadata fields where certain abbreviations are common, such as "1st ed.". Since these are just text fields, I guess it doesn't make much of a difference. But assume that I have absolutely no personal preference and want to add that information, which should I choose: "First edition", "1st ed.", or "1st edition"?

  • There seems to be a "genre" field for editions. This doesn't seem to be available in the edit form?

  • I've recently deleted "Referral IDs" from a few dozen links to Amazon. I have since noted that links to Amazon are generated automatically for editions that include ISBNs and shown in the table of editions. Considering manually-created links to Amazon could both be easily created for most editions as well as easily deleted, I am wondering which is preferred?

The following are observations/suggestions more than questions:

  • For translations, it is common practice in publishing to include a note such as "Translated from American English by ...". The dropdown under "languages", however, only offers English. Since the distinction seems to matter to the people doing the work, it might be advisable to respect that by adding appropriate entries. Besides American English, I remember seeing "South African English" and "Brazilian Portuguese", but the list is likely to be longer.

  • There's an open issue in this repository to report duplicate works. I have in the past used the "Link" metadata on duplicate works to link to the canonical work, thinking that it should be possible to programmatically identify and resolve such links in the future. So I want to suggest possibly advising users of such a workflow, as it seems slightly more streamlined than asking for and manually working through individual reports here on GitHub.

  • https://github.com/internetarchive/openlibrary/wiki/Library-Metadata-Standards advises sentence case for titles, using the somewhat confusing argument that it is easier to convert from there to title case than in the other direction. I've tried to follow that instructions, but possibly due to language rules being considered somewhat more binding in my native language the part of my brain that likes to follow rules in getting rather confused.

Correct authors for "Michigan Historical Reprint Series"

https://openlibrary.org/authors/OL2982176A/Michigan_Historical_Reprint_Series

Although supposedly a scholarly publisher, this is just another publisher of out-of-copyright reprints with obfuscated provenance. We have almost 4600 works with many of the works having multiple editions -- some as many as 20.

This example shows the problem. We already have two (separate, unfortunately) work records for this work:
https://openlibrary.org/search?q=A+History+of+the+Feud+Between+the+Hill+and+Evans+Parties+of+Garrard+County&mode=everything
which we'll eventually be able to clean up by merging the authors and works, but the third has a bogus author and no way to reconcile it with the other two other than the title.

This "work": https://openlibrary.org/works/OL8765841W/Washington_observations
has 22 editions which appear to be various publications of the US Naval Observatory at Washington, but the records don't have enough information to be able to do anything useful with them.

I propose we delete everything from this "author." It will be too difficult to clean up, the intent was (is) spammy, and anything of value we have or can get from other sources.

There is an author named "Unknown" with 3,224 works

Evidence / Screenshot (if possible)

Unknown_author

Relevant url?

https://openlibrary.org/authors/OL2624611A/Unknown
https://openlibrary.org/authors/OL2629888A/None

Expectation

These "authors" should not exist

Details

Searching for author:none, author:unknown, etc. will result in thousands of incorrectly edited works.

  • Logged in (Y/N)?
    N
  • Browser type/version?
    FF70.0
  • Operating system?
    Windows 10,

Proposal & Constraints

Ideally, a bot can remove these bad authors from infobase.

Stakeholders

@hornc

What is and isn't a book seems incomplete

There are a couple of questions that the lists don't cover:

Where are we placing dissertations/theses?
What's the difference between a brochure and pamphlet?
Then there's also 'leaflets'. If there's a difference, then it's good to write it out. The reason is that I wouldn't consider every pamphlet to be a book, as some act like a brochure: https://www.differencebetween.com/difference-between-pamphlet-and-vs-brochure/ says instructionals with products are pamphlets - now that wouldn't be a book right?

There's likely more questions than this, but am starting here first.

Add what is and isn't a book in the ReadMe

I can't really tell the difference between a book and a document. Since this is the librarians github, I would say it's the perfect place to show librarians what is and isn't a book, especially since it's not really written in many places.

If it's too big for the ReadMe, at least make a separate document and put the link in the ReadMe.

In the meantime, would https://www1.maine.gov/dhhs/mecdc/population-health/odh/documents/tasty-treats-teeth.pdf be a book or a document? The government creates many documents, but it kind of looks like an e-book.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.