Coder Social home page Coder Social logo

Comments (2)

nwy avatar nwy commented on August 25, 2024

This bug seems to apply to other punctuation , too- not just hyphens.

Example with a period searching "gener.il":

On the results page, you see hit highlights.
http://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1836&date2=1922&proxtext=gener.il&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic

On the individual pages, no hit highlights . . .
http://chroniclingamerica.loc.gov/lccn/sn83030214/1912-06-02/ed-1/seq-37/#date1=1836&index=1&rows=20&words=gener+gener.il+il&searchType=basic&sequence=0&state=&date2=1922&proxtext=gener.il&y=18&x=9&dateFilterType=yearRange&page=1

from chronam.

keshavmagge avatar keshavmagge commented on August 25, 2024

@dbrunton
Though we have problems with punctuation marks in OCR words, this issue was a different one. I'll try explaining it.

Borrowing the search text "coca cola" from @eikeon 's comment, take a look at one of the search hits
http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-23/ed-1/seq-43/#date1=1836&index=0&rows=20&words=Coca+Coca-Cola+Cola&searchType=basic&sequence=0&state=&date2=1922&proxtext=coca+cola&y=-220&x=-1136&dateFilterType=yearRange&page=1

The request parameter we care about here 'words' - words=Coca+Coca-Cola+Cola. A piece of javascript in page.js tries to find coordinates for and highlight one word at a time in the OCR, meaning, it tries to find coordinates for Coca, then Cola-Cola and finally Cola. Due to a bug in the javascript, if a word was not found in the OCR, it did not proceed to try the next word, instead, bailed out completely.

So, (please pay attention to words request parameter)
http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-23/ed-1/seq-43/#date1=1836&index=0&rows=20&words=Coca-Cola+Coca+Cola&searchType=basic&sequence=0&state=&date2=1922&proxtext=coca+cola&y=-220&x=-1136&dateFilterType=yearRange&page=1
would work and
http://chroniclingamerica.loc.gov/lccn/sn83030214/1909-05-23/ed-1/seq-43/#date1=1836&index=0&rows=20&words=Coca+Coca-Cola+Cola&searchType=basic&sequence=0&state=&date2=1922&proxtext=coca+cola&y=-220&x=-1136&dateFilterType=yearRange&page=1
would not.

The fix would make the javascript pass the words not found in the OCR and keep looking until we run out of 'words'.

Did I explain that right? Does that make sense at all?

from chronam.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.