zotero / translators Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 747.0 16.35 MB

Zotero Translators

Home Page: http://www.zotero.org/support/dev/translators

JavaScript 99.94% Shell 0.06%

translators's People

Contributors

Stargazers

Watchers

Forkers

avram stakats fcheslack hektech jenshnielsen adam3smith rmzelle smjwsk aurimasv unhammer gracile-fr mjg renskis kevinreiss nchachereau pbnjay simonster agoldst staticd-growthecommons thomasxie scito juris-m wingfay federico-arias jeanfred karnesky heikojansen robinpaulson adreagui calbo acbergan vl2 jpwarren alexwatkins edsu bpw proximamonkey krevad ramblehead hktang piyushs texkyzqk fredriondet fraba otterfan fefe982 gaomx samuel38 andersjohansson o-zone jfbeatty gabstehr paregorios andreas-h driky mete0r lreznick hiekehuistra longmatthewh htpham karya0 dominik-k f-mb marclajoie123 magicmark nschneid hityjj twistedmove mba811 holocronweaver zuphilip sjewo mfenner baig proquest duweifu rm2342 nemobis jon-freed jkreft-usgs apcshields infolis liberalartist reubot gmcharlt ggiordano greenwicher jglev falaca tertiarysources dbs cjzeng shmakt sheepeeh jimapps jjweis hkurotaki franziskahorn eranroz mdlincoln

translators's Issues

Expand embedded metadata detection

The <link rel="alternate" /> syntax for providing alternate representations should be used when we look for embedded metadata. A recent discussion notes a site providing dissertations that we don't import correctly. In addition to Google/Highwire metadata which we're parsing, it includes such <link rel="alternate" /> references to structured descriptions:

<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=librismarcxml&pids=diva2:459013"
  rel="alternate" title="MARC-XML Representation" type="text/xml" />
<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=swepubmods&pids=diva2:459013"
  rel="alternate" title="MODS Representation" type="text/xml" />

I don't think we can expect to read these as-is, since the text/xml type is too vague, but we should look for known types for formats we do read, just like we do for intercepting RIS/BibTeX download. That means application/mods+xml for MODS, etc.

The Times and Sunday Times - restricted access

Sydney Morning Herald - site changed; requires rewrite

We should probably write a translator for Fairfax Media, since there are several news sites that use this content delivery suite.

See http://www.fairfax.com.au/network-map.aspx for a list of sites

eLibrary.ru - creators are no longer scraped.

I'm hoping @AJLyon can have a look - the translator does some funky stuff with creators that I don't want to mess with w/o understanding it.

eMJA - broken after site-redesign. Needs complete re-write.

BIUM - Translator works; session timeouts --> no tests. Results are low quality

Berkeley Electronic Press - needs complete re-write after being taken over by de Gruyter

Sirsi eLibrary support

@usclibraries has prepared a site-agnostic version (8b921dca58871474568d55c825035126bc318c94) of Rice/Rutgers modifications of the Sirsi translator. The translator is intended to cover the most recent iteration of the SirsiDynix OPAC, called eLibrary.

It looks like it can be used as a drop-in replacement for Rice and Rutgers immediately, and it works for USC. The translator still needs some work, since we can probably tear out some of the out code from the original Sirsi translator and streamline this one. More importantly, it doesn't yet work with all eLibrary installations. Joyce at USC pointed out a list of Association of Research Libraries members using SirsiDynix, Unicorn on the list: http://www.librarytechnology.org/arl.pl

Of these installations, the present translator definitely doesn't work with the Indiana catalog

We need to review the rest of the major installations and see which ones we're missing, and see how we can shoehorn support for them into a unified translator.

KOBV - translator works; no permalinks so no working tests.

Ab Imperio - translator works, but tests fail - they're commented out

Create Labels

We should have some labels here - major, minor, error, new translator needed

MODS - translate "extent" element into numPages, at least for books

http://groups.google.com/group/zotero-dev/tree/browse_frm/thread/4c00e8ebacfcfcf1/7a5041c27d3389c5?rnum=1&_done=%2Fgroup%2Fzotero-dev%2Fbrowse_frm%2Fthread%2F4c00e8ebacfcfcf1%3F#doc_a1e870dfc15a8c8d

www.loc.gov/standards/mods/userguide/physicaldescription.html

Since this can contain other info for non-book item types we should check that this doesn't cause any problems.

Library Catalog (Aleph) - translator works; couldn't track down any Aleph catalog with permalinks --> no tests.

dLibra - no test for multiples but translator works

(the search result page is a bit odd & takes a long time to load, so this is probably to be expected).

Escaping of note content in RDF export

People in the TEI world have noticed that our RDF export makes a mess of HTML tags in item data:

<rdf:value>&lt;h6>1256 to 1272&lt;/h6>
&lt;p>&amp;nbsp;&lt;/p>
&lt;p>page 32&amp;nbsp; roll 1218a 1272 John the Clerk against William de Grendon regarding the warrant of 8 acres&lt;/p>
&lt;p>page 40 ditto&lt;/p>
&lt;p>&amp;nbsp;p108 roll 144 1269 Claim by Margery who was the wife of Henry of Ashbourne&amp;nbsp; re dower from various individuals including Stephen of Ireton the third part of an acre of meadow in Snelston, and ?( William de ) Hulton in Clifton .&amp;nbsp; William de hylton gives up dower amongst others.&amp;nbsp; Makes one wonder whether&amp;lt;per corresp='#williamofhultonclerk' role='m'&amp;gt;William de Hulton&amp;lt;/per&amp;gt; and William the clerk are the same person.&lt;/p>
&lt;p>page 109 ditto Roger is the son of Henry of Ashbourne and is in the custody of Margaret countess Derby&amp;nbsp; and lands in the custody of Edmund king's son&lt;/p>
&lt;p>page 9&amp;nbsp; and 10 1258 Information re Henry of Ashbourne.&amp;nbsp; Holds a court. Case of villeinage.&amp;nbsp; Confirms Henry heir of&amp;nbsp; Robert of Ashbourne.&amp;nbsp; Stephen of Ireton one of the pledges for Henry.&lt;/p>
</rdf:value>
</bib:Memo>

We are presumably doing the same with things like <i> in item titles. A proper solution to this, as suggested in the linked thread on eXist-TEIXML, is to namespace those tags. We would also need to replace non-XML entities like  .

Unfortunately, this behavior has its roots in the underlying Tabulator RDF engine; I don't how we'd convince it to handle this with namespacing.

I would like help on this, if we have anyone still on the team who has experience with the RDF engine.

Winnipeg Free Press - completely broken after site update

MODS - needs to be rewritten with DOM Parser for Conncetor compatibility

AdvoCAT - catalog has been replaced by Voyager7 catalog that works great. Remove translator during next clean-up

Library Catalog (Voyager) - search results time out, so no tests for multiples, but they do work.

TV by the Numbers - site changed; requires complete rewrite

Huffington Post translator test seems to intermittently request the same page in an infinite loop

I had to disable a translator test in the Huffington Post translator because it was intermittently trying to fetch http://search.huffingtonpost.com/false an infinite number of times in both Gecko and Chrome. I don't actually know the framework well enough to understand why this is. @adam3smith, since it's your translator, can you take a look at this?

Washington Monthly - multiple item tests fail because single items require "defer"

Frontiers - works fine; test sometimes returns data mismatch with kryptic Param 0 error

American Institute of Aeronautics and Astronautics - broken after site-update; I don't see how this can be fixed

Works pretty well with DOIs, though, which are displayed for all items.

IEEE Xplore - translator works; Multiples test works on and off;

Emerald Publishing - translator works; multiples test fails only on server

Institute of Pure and Applied Physics - multiples work but no permalink --> no test

The Microfinance Gateway - broken after site update, requires complete rewrite

washingtonpost.com - certain articles are restricted access, so search test may fail at some point

3news.co.nz works but tests fail

The Globe and Mail - site updated, needs complete rewrite

Library Catalog (BiblioCommons) - translator broken after MARC display structure changed; no test for multiples

Preserve italics in BibTeX import/export

\textit{ } can be put into our italics, possibly including round-trip
Per http://forums.zotero.org/discussion/19316

Pleade translator detects false positives

See this example. It looks like the problem is the Twitter tweet button's iframe URL, which contains both "ead.html" and "id=". Since this is all detectWeb checks, the Pleade translator comes up, even though the site has nothing to do with Pleade. Ideally, there would be something more unique that we could use to detect Pleade, but even if there isn't, we should be able to make the regexp stricter.

The Hindu - site moved to thehindu.com and redesigned, needs complete rewrite

BioInfoBank - Translator works; seems to use some internal session ID so no tests.

The Open Library - broken after site update. Requires complete rewrite.

Google Scholar Error

fails on
http://scholar.google.com/scholar?hl=en&q=smith&btnG=Search&as_sdt=0%2C22&as_ylo=&as_vis=0
which is one of the tests.
The reason is the link to the author biography between the 3rd and 4th item, which has the same Xpath as the article titles/links.

HighWire (1.0) multiple test failing

The multiple test fails on the server, but works here (I have full text access). Not sure if we should care about fixing this one; there aren't many HighWire 1.0 sites left.

MODS - incorrectly puts last name of personal authors in single field mode

See here:
http://groups.google.com/group/zotero-dev/tree/browse_frm/thread/4c00e8ebacfcfcf1/7a5041c27d3389c5?rnum=1&_done=%2Fgroup%2Fzotero-dev%2Fbrowse_frm%2Fthread%2F4c00e8ebacfcfcf1%3F#doc_a1e870dfc15a8c8d

"MODS allows for either breaking up parts of the name (given and family, for example) in different elements or enclosing the entire name in one element."
http://www.loc.gov/standards/mods/userguide/name.html#namepart

currently the MODS translator only deals correctly with the first version.

Cambridge Journals Online - Data mismatch only in server-run translator test (and for misterious reasons). Test&translator work fine locally

World Shakespeare Bibliography Online - tests work but site has restricted access

US National Archives Research Catalog - uses sessions, cannot create tests; translator sort of works when browsing

Archeion - broken after site relaunch at www.archeion.ca (good export data)

Embedded metadata missing authors

The page at http://respiratory-research.com/content/11/1/133 has copious, nice RDF. In the old translator, we got the whole author list intact, as you can see in the test case for BioMed Central, which was made using Scaffold when BMC was able to get the complete author list by calling Embedded RDF. Something in the revised translator is limiting us to just the first two authors.

Hoping @simonster can take a look; I can try to work this out in several days (and the motivating issue, http://forums.zotero.org/discussion/17365, needs to be resolved even more promptly).