Coder Social home page Coder Social logo

infrae / pyoai Goto Github PK

View Code? Open in Web Editor NEW
82.0 9.0 53.0 328 KB

The oaipmh module is a Python implementation of an "Open Archives$ Initiative Protocol for Metadata Harvesting"

Home Page: http://pypi.python.org/pypi/pyoai

License: Other

Python 99.89% Shell 0.11%

pyoai's Introduction

OAIPMH

The oaipmh module is a Python implementation of an "Open Archives Initiative Protocol for Metadata Harvesting" (version 2) client and server. The protocol is described here:

http://www.openarchives.org/OAI/openarchivesprotocol.html

Below is a simple implementation of an OAIPMH client:

>>> from oaipmh.client import Client
>>> from oaipmh.metadata import MetadataRegistry, oai_dc_reader
>>> URL = 'http://uni.edu/ir/oaipmh'
>>> registry = MetadataRegistry()
>>> registry.registerReader('oai_dc', oai_dc_reader)
>>> client = Client(URL, registry)
>>> for record in client.listRecords(metadataPrefix='oai_dc'):
>>>    print record

The pyoai package also contains a generic server implementation of the OAIPMH protocol, this is used as the foundation of the MOAI Server Platform

pyoai's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyoai's Issues

How to create readers for other OAI Metadata Schemas?

I am trying to harvest datasets and metadata from OAI Servers.

I was successful to retrieve metadata from oai_dc, using available, oai_dc-reader.

How can I create a new reader, such that I can harvest GetRecord from other metadata schemas.

Here, I require metadata from DataCite.org which uses oai_datacite.
Currently, I could create a MetadataReader, using normal XML Parser Syntax. But fail to parse or retrieve data.

oai_datacite_reader = MetadataReader(
fields={
    'title':       ('textList', '//resource/titles/title/text()'),
    'creator':     ('textList', '//resource/creator/creator/text()'),
    'subject':     ('textList', '//resource/subjects/subject/text()'),
    'description': ('textList', '//resource/descriptions/description/text()'),
    'publisher':   ('textList', '//resource/publisher/text()'),
    'contributor': ('textList', '//resource/contributors/contributor/text()'),
    'date':        ('textList', '//resource/dates/date/text()'),
    #'type':        ('textList', '//resource/type/text()'),
    'format':      ('textList', '//resource/format/text()'),
    'identifier':  ('textList', '//resource/identifier/text()'),
    #'source':      ('textList', '//resource/source/text()'),
    'language':    ('textList', '//resource/language/text()'),
    'relation':    ('textList', '//resource/relatedIdentifiers/relatedIdentifier/text()'),
    #'coverage':    ('textList', '//resource/coverage/text()'),
    'rights':      ('textList', '//resource/rights/text()'),
    'version':      ('textList', '//resource/version/text()'),
    'publicationYear': ('textList', '//resource/publicationYear/text()')
    },
    namespaces={'oai_datacite:' 'http://datacite.org/schema/kernel-4'}
)

All the field are returned empty.
result:

{"title": [], "creator": [], "subject": [], "description": [], "publisher": [], "contributor": [], "date": [], "format": [], "identifier": [], "language": [], "relation": [], "rights": [], "version": [], "publicationYear": []}

Please give guidance to define a new field metadata reader.

Thanks in advance

Loop over all records ?

How to loop over all records ?
My script stops after 50 records.
How to get the number of records as well ?

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://ws.pangaea.de/oai/provider?set=project4173'

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

record = client.listRecords(metadataPrefix='oai_dc')

for record in client.listRecords(metadataPrefix='oai_dc'):
    print(record)

There are 1501 records in the project.

$ oai-harvest --limit 10000 -p dif --set project4173 http://ws.pangaea.de/oai/provider

This command harvests correctly all metadata from the 1501 records.
I would like to do this from oaipmh to save them after some reformating into a json file.

Any help welcomed.

remove interfaces.py (RFC)

interfaces.py declares interfaces (i.e. classes) which are meant to be used in the server part of pyoai.

There are some issues

  1. All methods are missing the self argument
  2. Interfaces in itself are not pythonic.
  3. There are mixins in common.py which seem to serve the same idea.

While the interfaces are broken, they still have value as documentation and example for a server implementation.

My proposal is to move the comments from interfaces to the common mixins, maybe adding default implementations where its sensible and remove interfaces.py

Python3 support not in PyPI

Commit beced90 fixed a Python3 syntax issue in client.py but that commit is not included in the current release 2.4.5 on PyPI.

Here's the error message:

  File "[...]/pyoai-2.4.5-py3.5.egg/oaipmh/client.py", line 40
    raise Error, "Non-standard granularity on server: %s" % granularity
SyntaxError: invalid syntax

A preliminary workaround is to use

pip install git+https://github.com/infrae/pyoai.git

(depending on the configuration, you have to use pip3 instead of pip)

Can you release a new version 2.4.6 in PyPI so this issue is resolved for Python3?

504 Gateway Time-out

Hello,

I've been trying to listRecords that are more than 100, but I get 504 Gateway Time-out. When I try this with less than 100 records, the server responds after a while. Is there any setting that tells the server to wait more before returning this error?

(same happens with listIdentifiers)

The problem occurs only when I hit the URL endpoint (with curl or through the browser), but not when calling the function from the python console
oai.oaipmh_server.listRecords(metadataPrefix='myprefix')

Additionally, when hitting the endpoint the process consumes nearly all the VMs memory...

Thank you.

resumption token is not found when using listIdentifiers

When using listIdentifiers and the result is batched, the resumption token is not found. The evaluator string for listRecords and listIdentifiers is different, whereas the listRecords uses a pattern matching (*), the listIdentifiers tries to match on an exact string.

Error in makeRequestErrorHandling from a listRecords call with from_ parameter

This very simple code is requesting records from a figshare set:

#!/usr/bin/env python3

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

import datetime

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client('https://api.figshare.com/v2/oai', registry)

month_ago = datetime.datetime.now() - datetime.timedelta(days=30)
for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=month_ago):
  print(record[0].datestamp(), end=' ')
  print(record[1]['title'][0])

After finding several records, the code throws an exception with the following error:

Traceback (most recent call last):
  File "./toto.py", line 13, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=month_ago):
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 365, in ResumptionListGenerator
    result, token = nextBatch(token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 194, in nextBatch
    resumptionToken=token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 308, in makeRequestErrorHandling
    raise getattr(error, code[0].upper() + code[1:] + 'Error')(msg)
oaipmh.error.NoRecordsMatchError: The result in an empty list.

If I remove the from_ parameter from the listRecords call, it all works fine.

XML parse errors parsing metadata payload

I'm using oaipmh via oai-harvest.
When the program hits an exception from a parse error in the metadata payload, it quits and stops harvesting. ( Typically not a problem with simple oai_dc metadata, but oai_ead has a lot more to get correct, and I'm seeing this frequently on several ArchivesSpace OAI feeds. )
I have found that back patching the Client.parse method to create an XMLParser with recover=True manages to finish harvesting the entire feed.
etree.XML(xml, etree.XMLParser(recover=True))

You might consider adding that as an option on creating Client, or as a subclass.

Port to using urllib3 or requests

Would it be beneficial to port pyoai to use urllib3 or requests? (After #15 is merged)

It would remove a lot of complex http urllib/urllib2 code, and less future problems due to oddball http server configurations.

Only supports POST

I've come across an oaipmh repository that is incorrectly configured such that it only supports GET requests, not POST. Perhaps it might be wise to fallback to GET incase POST fails?

Unable to harvest Biomedcentral's OAI PMH feed

Hi,
I am able to harvest most repositories with this code. Works great, thank you!

I am however unable to harvest

http://www.biomedcentral.com/oai/2.0/

This OAI PMH feed works in the browser but throws the following error when running the Python code from the README file, see error below. (My guess is biomedcentral provides more than the average site (list of DC in the browser seems really long) per resumption token and the code is timing out). Any assistance would be greatly appreciated.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 115, in method
    return obj(self, **kw)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 110, in __call__
    return bound_self.handleVerb(self._verb, kw)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 65, in handleVerb
    kw, self.makeRequestErrorHandling(verb=verb, **kw))    
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 269, in makeRequestErrorHandling
    xml = self.makeRequest(**kw)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 316, in makeRequest
    return retrieveFromUrlWaiting(request)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 362, in retrieveFromUrlWaiting
    raise Error, "Waited too often (more than %s times)" % wait_max
oaipmh.client.Error: Waited too often (more than 5 times)

Reading metadata results

I'm running this code:

`from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://revista-iberoamericana.pitt.edu/ojs/index.php/Iberoamericana/oai'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

for record in client.listRecords(metadataPrefix='oai_dc'):
print(record)
`
i was especting a kind of xml file on tuples, but the results are like this:

(<oaipmh.common.Header object at 0x00000251FAA16A20>, <oaipmh.common.Metadata object at 0x00000251FAA160B8>, None) (<oaipmh.common.Header object at 0x00000251FA9DB5C0>, <oaipmh.common.Metadata object at 0x00000251FA9C6518>, None) (<oaipmh.common.Header object at 0x00000251FA9DB0F0>, <oaipmh.common.Metadata object at 0x00000251FA9DB208>, None)

could you tellme if i'm forgetting something

The 2.5.1 version is not available in Pypi?

Hi, I am trying to install the latest version of pyoai (2.5.1) but when I issue the pip install pyoai==2.5.1 command it throws the following exception:

ERROR: Could not find a version that satisfies the requirement pyoai==2.5.1 (from versions: 2.1.4, 2.2, 2.2.1, 2.3, 2.3.1, 2.4, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.5.0)
ERROR: No matching distribution found for pyoai==2.5.1

Running the command pip index versions pyoai don't show the 2.5.1 version:

WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
pyoai (2.5.0)
Available versions: 2.5.0, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4, 2.3.1, 2.3, 2.2.1, 2.2, 2.1.4

The latest version is also not available on https://infrae.com/download/oaipmh

What I can be missing, please?

RDF and ORE Support

Is there any plan to support other metadata prefixes appart from oai_dc? I've been working to add support for RDF and ORE for a project. Maybe these could be included, if useful. If someone is interested, I could request the PR.

raise BadStatusLine(line) httplib.BadStatusLine: ''

I do not use Python frequently but it appears the problem I am experiencing might be as a result of a well known issue. I am trying to harvest some content and run into the following error. Are you able to suggest a workaround for this?

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 115, in method
return obj(self, **kw)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 110, in call
return bound_self.handleVerb(self._verb, kw)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 65, in handleVerb
kw, self.makeRequestErrorHandling(verb=verb, **kw))
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 269, in makeRequestErrorHandling
xml = self.makeRequest(**kw)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 316, in makeRequest
return retrieveFromUrlWaiting(request)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 343, in retrieveFromUrlWaiting
f = urllib2.urlopen(request)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''

Limit request rate

As with any other bot, there should be a configurable delay of a few seconds between two requests.
We see our repository hammered with too many requests per second; pyoai is considered then as an unfriendly bot.

Migrate from Travis to GitHub actions

The Travis integration has been broken for a while, and repo admins are not available to fix it.

We should set up CI using GitHub actions as it only requires configuring things by adding files to the repo, which we can do.

from_ and until arguments throw error

I am trying to restrict the number of papers to download from Pubmed using the from_ and until arguments, however, that doesn't work, no matter what I try. Without from/until arguments the code just works fine. The error is

image

I tried arxiv.org, but no success either. Here is a minimal example:

import datetime
from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

# specify from date
from_date ='2017-01-01'
from_date = datetime.datetime.strptime(from_date, "%Y-%m-%d")

# pubmed url
url = "https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi"

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(url, registry)
serverinfo = client.identify()
print(serverinfo.repositoryName())

ids = []
for record in client.listIdentifiers(metadataPrefix='oai_dc', from_=from_date):
     
    ids.append(record.identifier())
    
print('total # of ids:', len(ids))

Note also this #7 : I believe the from_ and until arguments need to be specified as datetimes, not strings.

The error indicates that the datetime format is wrongly formated. It appears that pubmed DOES NOT support the time argument. For example, this request work just fine without time specified:

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=ListIdentifiers&from=2002-01-01&metadataPrefix=oai_dc

With time specified it fails with the same error as stated above:

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=ListIdentifiers&from=2002-01-01T00:00:00Z&metadataPrefix=pmc_fm

pyoai health / maintenance status?

This project seems to have fallen into a state of semi-disrepair, no recent commits in master, PRs unmerged, and so on. That said, there are few Python OAI-PMH resources, so it looks like this one does still see some use, and could be important for the community.

Looking at forks of the project, I note that none is expanding dramatically on pyoai, more just little fixes and customisations. So by looking at the repo, there is also the chance that the project is simply stable and does what it needs to. Or, erm, it's essentially abandoned without clarification from maintainers.

Do the authors want to make a comment regarding current health of the project? If the project is no longer properly maintained, is there a successor project, advice, etc?

From what I can tell, one viable alternative is the sickle project, OAI-PMH for humans, also Python: https://github.com/mloesch/sickle. Have any users switched from one to the other? Any experiences they'd like to share? This project too has some concerns regarding maintenance status.

Would really appreciate a bit of clarification here, as it doesn't seem totally abandoned, but for example, there's a PR that's ready to go, but been waiting a year: #39 ...

Declared Python objects should override __repr__

When debugging, pyoai objects are displayed as <oaipmh.common.Header object at 0x7f024ea693d0>, which is not very helpful. Each class should override __repr__ to return a more useful representation.

cqi.parse_qs deprecated

Hello -
thank you for your library - very useful !

I report an issue that I discover recently after an ubuntu migration from bionic to focal which migrates python3.6 to python3.8.
I installed pyoai, but there is a problem with oaipmh/server.py that uses cgi.parse_qs:

AttributeError: module 'cgi' has no attribute 'parse_qs'

It seems that the method is deprecated - see eventlet/eventlet#580

Thank for your development,
Gilles Landais

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.