Coder Social home page Coder Social logo

Comments (17)

robertknight avatar robertknight commented on June 15, 2024

The format of the new selectors is described at https://github.com/hypothesis/client/blob/7267c198adbf31bcd0bf0065aa376b3a4bf2702e/src/types/api.ts#L75. Only the "url" field is currently marked as required.

For PDF/fixed-layout books, there is also a PageSelector selector. None of the information in that selector is available in the old API, so it would need to be looked up via the VS API.

from h.

robertknight avatar robertknight commented on June 15, 2024

Some notes on how many VitalSource annotations will need to be migrated and the books and Hypothesis groups they are associated with: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1664901048607019.

from h.

seanh avatar seanh commented on June 15, 2024

Are new VitalSource annotations with the old format still being created? If not then I wonder about doing a one-off DB migration to migrate all annotations to the new format in the DB. That's what we'd normally do if we wanted to migrate a bunch of data in the DB.

Uou'd then use the admin pages to reindex those annotations. There are already various admin pages in h to reindex all annotations of a user/group/etc. You may be able to use one of those, or you may need to add a new one.

I think it should also be possible to write a "migrate to the new VitalSource format" admin page if you want to do it that way. But this will be the first time we've written a Celery task to do a bulk migration on the DB, those've always been done using DB migrations in the past. The task would also schedule each annotation for reindexing after the annotation has been changed in the DB. And I suppose once you're finished, you'll delete the admin page?

Do we know what volume of annotations we're talking about here?

from h.

robertknight avatar robertknight commented on June 15, 2024

Do we know what volume of annotations we're talking about here?

Number of annotations: 12,656. (2023-01-24 update: 15,360)
h DB Query:

select count(*) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

Number of groups: 75 (2023-01-24 update: 101)
Query:

select count(distinct(annotation.groupid)) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

Number of users: 432 (2023-01-24 update: 726)
Query:

select count(distinct(annotation.userid)) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

Number of URLs (== number of distinct chapters/pages): (2023-01-24 update: 655)

select count(distinct(uri_normalized)) from document_uri where uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

from h.

robertknight avatar robertknight commented on June 15, 2024

Given that there are only a small number of groups, we could use the existing "Reindex all annotations in a group" facility in the search index management page at http://localhost:5000/admin/search to handle reindexing. It would be more convenient if we modified that form to support supplying a list of groups (eg. as a comma-separated list) to reindex.

from h.

robertknight avatar robertknight commented on June 15, 2024

As noted in the issue description, the migrated annotations should include some data which is not present in the original annotation:

// Fields not available in previous data. Need to be either omitted or found via lookup
cfi: "/4",
title: "Chapter 2"

The cfi field is used to sort annotations by chapter in the sidebar. Then annotations within each chapter are sorted by text position. The title field is used to display chapter headings in the sidebar.

We could omit these fields and make the client dynamically look up the CFI and title that correspond to the path value, by querying the VitalSource reader. However this would mean that we'd be missing this information when presenting annotations outside of the reader.

To add this data during the migration, we have a couple of options:

  1. Generate a data set (eg. as a JSON file) of all the CFIs and chapter titles for all books annotated so far, add that to the h repo and use it locally during a migration
  2. Make HTTP requests to VitalSource's API in order to fetch the information during a migration. The LMS app has code that already uses the same API for use in the assignment picker

A total of 74 different books have been annotated so far.

Query:

select distinct(substring(uri_normalized, '/books/[0-9A-Z-]+')) from annotation join document_uri on annotation.document_id = document_uri.document_id where document_uri.uri_normalized like 'httpx://jigsaw.vitalsource.com/%';

from h.

robertknight avatar robertknight commented on June 15, 2024

I think it might be helpful to do this annotation in several stages:

  1. Deploy hypothesis/client#5072, so we capture the new selectors for all new annotations
  2. Run a migration to backfill the new selectors for annotations created prior to (1). This can be done prior to enabling the book_as_single_document feature flag for everyone. This will require additional data beyond what is in the DB, per my previous comment.
  3. Enable the book_as_single_document feature flag for everyone.
  4. Run a final migration to update the annotation URLs from https://jigsaw.bookshelf.com/books/{id}/{suffix} to https://bookshelf.vitalsource.com/reader/books/{id}. This will not require any data beyond what is stored in the annotations.

from h.

robertknight avatar robertknight commented on June 15, 2024

Some notes on step (2) of the migration:

For each existing annotated VitalSource URL (example: "https://jigsaw.vitalsource.com/books/L-999-70049/epub/OPS/loc_002.xhtml") we need to:

  1. Extract the book ID ("L-999-70049" here)
  2. Extract the content path ("epub/OPS/loc_002.xhtml" here)
  3. Look up the table of contents data ("TOC data") for the book via the VitalSource API. See LMS app for code that does this.
  4. Find the first entry with a matching path in the TOC data
  5. Generate an "EPUBContentSelector" selector in JSON format (see PR description)
  6. Find all annotations with the old URL
  7. Read the target_selectors data (this should be a JSON array) and append the "EPUBContentSelector" JSON from step (5)

from h.

robertknight avatar robertknight commented on June 15, 2024

I'm currently working on a script to gather the data needed for the backfilled EPUBContentSelector selectors. I encountered an issue with PDF-based books, as not all pages have an entry in the table of contents. See https://vitalsource.slack.com/archives/C01208U1A2F/p1671548778110049.

from h.

robertknight avatar robertknight commented on June 15, 2024

Using the above APIs I got a dump of the TOC and pages data for all the VS books annotated so far. See https://drive.google.com/file/d/16FMKv2VmKDnpZEzdA-3MTc4W22c1pPHB/view?usp=share_link (H internal only). This covers steps 1-3.

from h.

robertknight avatar robertknight commented on June 15, 2024

I have a first pass of a JSON file containing the data for the updates we'll need to apply: https://gist.github.com/robertknight/96a438e4869930d3e4fc285ca711d989 contains a mapping from the current URL of an annotation, to an object with url and selectors fields. The annotation's URL needs to be changed to the value in the "url" field, and the entries in selectors need to be added to the target_selectors field of the annotation, but only if there is not already an entry in that list with a matching type property.

The JSON output here was generated from an input list of current annotation URLs using this script.

This data is not final because there were some URLs in the input list for which I could not find the necessary entries in the VitalSource data, and I need to check some issues relating to the "title" field for some entries. These issues won't affect the structure of the data though.

from h.

robertknight avatar robertknight commented on June 15, 2024

I have updated the data at https://gist.github.com/robertknight/96a438e4869930d3e4fc285ca711d989 with document titles. When we migrate annotation URLs, we'll need to make sure document entries get created for the new URLs and have at least the titles set. The data now looks like:

VitalSource chapter info and book title

from h.

robertknight avatar robertknight commented on June 15, 2024

There were a small number of annotated PDF page URLs which no longer appear in the page index for the book. I suspect what has happened is that the book has been updated or re-processed since it was originally annotated. We didn't record page numbers or CFIs at the time when these annotations were created, so we can't easily locate the correct page in the book. Fortunately for all new annotations that are created, we are capturing the CFI and page number.

Log output from https://github.com/hypothesis/vitalsource-url-migration/blob/main/gen_epub_selectors.py:

Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077445/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077446/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077447/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077448/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077449/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077450/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077451/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/24340/pages/691077452/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/9780133599145/pages/584498507/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/9780133599145/pages/584498510/content: Could not find CFI or title data for chapter
Error processing https://jigsaw.vitalsource.com/books/9780133599145/pages/584499351/content: Could not find CFI or title data for chapter

from h.

robertknight avatar robertknight commented on June 15, 2024

The latest version of the data that we'll need for the migration is now at https://github.com/hypothesis/vitalsource-url-migration/blob/main/vs-selectors.json. It has updated URLs and document (book) titles for all books. A small number of chapter/page URLs, mentioned in the previous comment, still had to be skipped.

from h.

robertknight avatar robertknight commented on June 15, 2024

Looking through a list of all the document titles that were fetched, I see there are some HTML entities and character references (", ') in their which we'll need to convert to Unicode. I'll do that as part of an update to the script data.

"T. rex" and the Crater of Doom
2019 MyLab Management with Pearson eText for Fundamentals of Human Resource Management plus Third Party eText
A Christmas Carol
A Citizen’s Guide to the Political Psychology of Voting
A Tale of Two Cities
American Government
Anatomy and Physiology
Automating Inequality
Behavioral Neuroscience
Biology: A Global Approach, Enhanced eBook, Global Edition
Bookshelf Tutorial
Build and Program Your Own LEGO Mindstorms EV3 Robots
Cardiología
Chemical Process Safety
Children's Play
College Algebra Essentials
Concise Text of Neuroscience
Deep Learning with Python, Second Edition
Discovering Psychology
Diversity in America
Doing Visual Ethnography
EBOOK: Economics, 12e
Engineering Fluid Mechanics, Enhanced eText
Essentials of Marketing Research
Everything's An Argument with Readings
Everything's an Argument with Readings
Fundamentals of General, Organic, and Biological Chemistry (Subscription)
Give Me Liberty!: An American History (Seagull Sixth Edition)  (Vol. 2)
Great Expectations
Head First Mobile Web
Heart of Darkness (Fifth Edition)  (Norton Critical Editions)
How Music Works
International Economics (Subscription)
International Relations: A Very Short Introduction
Introducing Relativity
Kant: Groundwork of the Metaphysics of Morals
Listening Well
Macroeconomics
Marketing
Media/Society
Methods for Teaching Students with Autism Spectrum Disorders
Operations Management: Processes and Supply Chains
Paradise Lost
Personal Connections in the Digital Age
Personality Psychology: Domains of Knowledge About Human Nature
Philosophy in the United States
Physics for Engineers and Scientists (Third Edition)  (Vol. 2)
Politics and International Law
Popular Culture, Geopolitics, and Identity
Principles of Economics
Qualitative Research Design
Reason
Salt, Fat, Acid, Heat
Selling School: The Marketing of Public Education
Service Management: Operations, Strategy, Information Technology
Silencing the Past (20th anniversary edition)
Sustainability: A Comprehensive Foundation
Teaching through Text
Testing Hypothesis in Bookshelf Online
The Greek Plays
The Language of Confession, Interrogation, and Deception
The Learner-Centered Curriculum: Design and Implementation
The Life of Sir Thomas More
The Lost Boys of Zeta Psi: A Historical Archaeology of Masculinity at a University Fraternity
The Political Philosophy of AI
The Pragmatic Programmer
The Routledge Handbook of Social Work and Addictive Behaviors
The Shattering: America in the 1960s
The Soul of A New Machine
The Spirit of Laws
U.S. History
US: A Narrative History Volume 1: To 1877
Understanding Cisco Networking Technologies, Volume 1
Understanding World Regional Geography
University Physics for the Life Sciences (Subscription)
Vladimir Putin: Life Coach
Welcoming Young Children into the Museum
Writing about Writing
Wuthering Heights
Wyllie's Treatment of Epilepsy

from h.

robertknight avatar robertknight commented on June 15, 2024

The migration has been initiated and is expected to complete in the next 20 minutes or so. Slack thread with operations analysis here: https://hypothes-is.slack.com/archives/C4K6M7P5E/p1674638705104229.

from h.

robertknight avatar robertknight commented on June 15, 2024

The bulk of the migration is complete. There were a total of 24 out of ~15,400 annotations that could not be migrated. See notes at https://hypothes-is.slack.com/archives/C4K6M7P5E/p1674643433767469?thread_ts=1674638705.104229&cid=C4K6M7P5E.

from h.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.