infrae / pyoai Goto Github PK

View Code? Open in Web Editor NEW

84.0 9.0 53.0 328 KB

The oaipmh module is a Python implementation of an "Open Archives$ Initiative Protocol for Metadata Harvesting"

Home Page: http://pypi.python.org/pypi/pyoai

License: Other

Python 99.89% Shell 0.11%

pyoai's Introduction

OAIPMH

The oaipmh module is a Python implementation of an "Open Archives Initiative Protocol for Metadata Harvesting" (version 2) client and server. The protocol is described here:

http://www.openarchives.org/OAI/openarchivesprotocol.html

Below is a simple implementation of an OAIPMH client:

>>> from oaipmh.client import Client
>>> from oaipmh.metadata import MetadataRegistry, oai_dc_reader

>>> URL = 'http://uni.edu/ir/oaipmh'

>>> registry = MetadataRegistry()
>>> registry.registerReader('oai_dc', oai_dc_reader)
>>> client = Client(URL, registry)

>>> for record in client.listRecords(metadataPrefix='oai_dc'):
>>>    print record

The pyoai package also contains a generic server implementation of the OAIPMH protocol, this is used as the foundation of the MOAI Server Platform

pyoai's People

Contributors

Stargazers

Watchers

Forkers

axiomsofchoice breyten ccare mhluongo bertrandbordage kiorky veriojon jordanreiter miku noisy gugek cogfor davidgillies jakke ulikoehler huanglz tpmccallum mwojnars neon-ninja metaodi tkurze andreyromanyukov dissemin rygbee arthurzenika jascoul adimascio hemenxyz alex-ip uudigitalhumanitieslab sdm7g jesusmtzs mitar rigelk asulibraries unt-libraries samuelstevens gsastry gustavofonseca bltravis ggoetzelmann marc-portier eudat-b2find neogeo-technologies datacite acz-unibi frubini rtorres1507 paulsamways ucals agerardin

pyoai's Issues

The 2.5.1 version is not available in Pypi?

Hi, I am trying to install the latest version of pyoai (2.5.1) but when I issue the pip install pyoai==2.5.1 command it throws the following exception:

ERROR: Could not find a version that satisfies the requirement pyoai==2.5.1 (from versions: 2.1.4, 2.2, 2.2.1, 2.3, 2.3.1, 2.4, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.5.0)
ERROR: No matching distribution found for pyoai==2.5.1

Running the command pip index versions pyoai don't show the 2.5.1 version:

WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
pyoai (2.5.0)
Available versions: 2.5.0, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4, 2.3.1, 2.3, 2.2.1, 2.2, 2.1.4

The latest version is also not available on https://infrae.com/download/oaipmh

What I can be missing, please?

debian packaging

would anyone be interested in debian packaging ?

remove interfaces.py (RFC)

interfaces.py declares interfaces (i.e. classes) which are meant to be used in the server part of pyoai.

There are some issues

All methods are missing the self argument
Interfaces in itself are not pythonic.
There are mixins in common.py which seem to serve the same idea.

While the interfaces are broken, they still have value as documentation and example for a server implementation.

My proposal is to move the comments from interfaces to the common mixins, maybe adding default implementations where its sensible and remove interfaces.py

Port to using urllib3 or requests

Would it be beneficial to port pyoai to use urllib3 or requests? (After #15 is merged)

It would remove a lot of complex http urllib/urllib2 code, and less future problems due to oddball http server configurations.

Unable to harvest Biomedcentral's OAI PMH feed

Hi,
I am able to harvest most repositories with this code. Works great, thank you!

I am however unable to harvest

http://www.biomedcentral.com/oai/2.0/

This OAI PMH feed works in the browser but throws the following error when running the Python code from the README file, see error below. (My guess is biomedcentral provides more than the average site (list of DC in the browser seems really long) per resumption token and the code is timing out). Any assistance would be greatly appreciated.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 115, in method
    return obj(self, **kw)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 110, in __call__
    return bound_self.handleVerb(self._verb, kw)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 65, in handleVerb
    kw, self.makeRequestErrorHandling(verb=verb, **kw))    
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 269, in makeRequestErrorHandling
    xml = self.makeRequest(**kw)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 316, in makeRequest
    return retrieveFromUrlWaiting(request)
  File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 362, in retrieveFromUrlWaiting
    raise Error, "Waited too often (more than %s times)" % wait_max
oaipmh.client.Error: Waited too often (more than 5 times)

How to create readers for other OAI Metadata Schemas?

I am trying to harvest datasets and metadata from OAI Servers.

I was successful to retrieve metadata from oai_dc, using available, oai_dc-reader.

How can I create a new reader, such that I can harvest GetRecord from other metadata schemas.

Here, I require metadata from DataCite.org which uses oai_datacite.
Currently, I could create a MetadataReader, using normal XML Parser Syntax. But fail to parse or retrieve data.

oai_datacite_reader = MetadataReader(
fields={
    'title':       ('textList', '//resource/titles/title/text()'),
    'creator':     ('textList', '//resource/creator/creator/text()'),
    'subject':     ('textList', '//resource/subjects/subject/text()'),
    'description': ('textList', '//resource/descriptions/description/text()'),
    'publisher':   ('textList', '//resource/publisher/text()'),
    'contributor': ('textList', '//resource/contributors/contributor/text()'),
    'date':        ('textList', '//resource/dates/date/text()'),
    #'type':        ('textList', '//resource/type/text()'),
    'format':      ('textList', '//resource/format/text()'),
    'identifier':  ('textList', '//resource/identifier/text()'),
    #'source':      ('textList', '//resource/source/text()'),
    'language':    ('textList', '//resource/language/text()'),
    'relation':    ('textList', '//resource/relatedIdentifiers/relatedIdentifier/text()'),
    #'coverage':    ('textList', '//resource/coverage/text()'),
    'rights':      ('textList', '//resource/rights/text()'),
    'version':      ('textList', '//resource/version/text()'),
    'publicationYear': ('textList', '//resource/publicationYear/text()')
    },
    namespaces={'oai_datacite:' 'http://datacite.org/schema/kernel-4'}
)

All the field are returned empty.
result:

{"title": [], "creator": [], "subject": [], "description": [], "publisher": [], "contributor": [], "date": [], "format": [], "identifier": [], "language": [], "relation": [], "rights": [], "version": [], "publicationYear": []}

Please give guidance to define a new field metadata reader.

Thanks in advance

Declared Python objects should override repr

When debugging, pyoai objects are displayed as <oaipmh.common.Header object at 0x7f024ea693d0>, which is not very helpful. Each class should override __repr__ to return a more useful representation.

Reading metadata results

I'm running this code:

`from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://revista-iberoamericana.pitt.edu/ojs/index.php/Iberoamericana/oai'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

for record in client.listRecords(metadataPrefix='oai_dc'):
print(record)
`
i was especting a kind of xml file on tuples, but the results are like this:

(<oaipmh.common.Header object at 0x00000251FAA16A20>, <oaipmh.common.Metadata object at 0x00000251FAA160B8>, None) (<oaipmh.common.Header object at 0x00000251FA9DB5C0>, <oaipmh.common.Metadata object at 0x00000251FA9C6518>, None) (<oaipmh.common.Header object at 0x00000251FA9DB0F0>, <oaipmh.common.Metadata object at 0x00000251FA9DB208>, None)

could you tellme if i'm forgetting something

cqi.parse_qs deprecated

Hello -
thank you for your library - very useful !

I report an issue that I discover recently after an ubuntu migration from bionic to focal which migrates python3.6 to python3.8.
I installed pyoai, but there is a problem with oaipmh/server.py that uses cgi.parse_qs:

AttributeError: module 'cgi' has no attribute 'parse_qs'

It seems that the method is deprecated - see eventlet/eventlet#580

Thank for your development,
Gilles Landais

504 Gateway Time-out

Hello,

I've been trying to listRecords that are more than 100, but I get 504 Gateway Time-out. When I try this with less than 100 records, the server responds after a while. Is there any setting that tells the server to wait more before returning this error?

(same happens with listIdentifiers)

The problem occurs only when I hit the URL endpoint (with curl or through the browser), but not when calling the function from the python console
oai.oaipmh_server.listRecords(metadataPrefix='myprefix')

Additionally, when hitting the endpoint the process consumes nearly all the VMs memory...

Thank you.

Python3 support not in PyPI

Commit beced90 fixed a Python3 syntax issue in client.py but that commit is not included in the current release 2.4.5 on PyPI.

Here's the error message:

  File "[...]/pyoai-2.4.5-py3.5.egg/oaipmh/client.py", line 40
    raise Error, "Non-standard granularity on server: %s" % granularity
SyntaxError: invalid syntax

A preliminary workaround is to use

pip install git+https://github.com/infrae/pyoai.git

(depending on the configuration, you have to use pip3 instead of pip)

Can you release a new version 2.4.6 in PyPI so this issue is resolved for Python3?

2.4.5 is not tagged in git

https://pypi.python.org/pypi/pyoai/ says latest version of pyoai is 2.4.5 but the tag does not exist on the git repo.

There's no bdist_wheel version on pypi

Hi,

no bdist version on pypi,
no automatic version number management in package,

fixed in PR #27 please have a look

Migrate from Travis to GitHub actions

The Travis integration has been broken for a while, and repo admins are not available to fix it.

We should set up CI using GitHub actions as it only requires configuring things by adding files to the repo, which we can do.

Error in makeRequestErrorHandling from a listRecords call with from_ parameter

This very simple code is requesting records from a figshare set:

#!/usr/bin/env python3

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

import datetime

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client('https://api.figshare.com/v2/oai', registry)

month_ago = datetime.datetime.now() - datetime.timedelta(days=30)
for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=month_ago):
  print(record[0].datestamp(), end=' ')
  print(record[1]['title'][0])

After finding several records, the code throws an exception with the following error:

Traceback (most recent call last):
  File "./toto.py", line 13, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc', set='portal_259', from_=month_ago):
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 365, in ResumptionListGenerator
    result, token = nextBatch(token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 194, in nextBatch
    resumptionToken=token)
  File "/Users/fx/anaconda3/lib/python3.6/site-packages/oaipmh/client.py", line 308, in makeRequestErrorHandling
    raise getattr(error, code[0].upper() + code[1:] + 'Error')(msg)
oaipmh.error.NoRecordsMatchError: The result in an empty list.

If I remove the from_ parameter from the listRecords call, it all works fine.

Keyword argument ``set`` in several methods clobbers Python ``set`` builtin

For example, interfaces.IOAI.listIdentifiers. Consider using setSpec?

Loop over all records ?

How to loop over all records ?
My script stops after 50 records.
How to get the number of records as well ?

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://ws.pangaea.de/oai/provider?set=project4173'

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

record = client.listRecords(metadataPrefix='oai_dc')

for record in client.listRecords(metadataPrefix='oai_dc'):
    print(record)

There are 1501 records in the project.

$ oai-harvest --limit 10000 -p dif --set project4173 http://ws.pangaea.de/oai/provider

This command harvests correctly all metadata from the 1501 records.
I would like to do this from oaipmh to save them after some reformating into a json file.

Any help welcomed.

from_ and until arguments throw error

I am trying to restrict the number of papers to download from Pubmed using the from_ and until arguments, however, that doesn't work, no matter what I try. Without from/until arguments the code just works fine. The error is

I tried arxiv.org, but no success either. Here is a minimal example:

import datetime
from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

# specify from date
from_date ='2017-01-01'
from_date = datetime.datetime.strptime(from_date, "%Y-%m-%d")

# pubmed url
url = "https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi"

registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(url, registry)
serverinfo = client.identify()
print(serverinfo.repositoryName())

ids = []
for record in client.listIdentifiers(metadataPrefix='oai_dc', from_=from_date):
     
    ids.append(record.identifier())
    
print('total # of ids:', len(ids))

Note also this #7 : I believe the from_ and until arguments need to be specified as datetimes, not strings.

The error indicates that the datetime format is wrongly formated. It appears that pubmed DOES NOT support the time argument. For example, this request work just fine without time specified:

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=ListIdentifiers&from=2002-01-01&metadataPrefix=oai_dc

With time specified it fails with the same error as stated above:

https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=ListIdentifiers&from=2002-01-01T00:00:00Z&metadataPrefix=pmc_fm

resumption token is not found when using listIdentifiers

When using listIdentifiers and the result is batched, the resumption token is not found. The evaluator string for listRecords and listIdentifiers is different, whereas the listRecords uses a pattern matching (*), the listIdentifiers tries to match on an exact string.

pyoai health / maintenance status?

This project seems to have fallen into a state of semi-disrepair, no recent commits in master, PRs unmerged, and so on. That said, there are few Python OAI-PMH resources, so it looks like this one does still see some use, and could be important for the community.

Looking at forks of the project, I note that none is expanding dramatically on pyoai, more just little fixes and customisations. So by looking at the repo, there is also the chance that the project is simply stable and does what it needs to. Or, erm, it's essentially abandoned without clarification from maintainers.

Do the authors want to make a comment regarding current health of the project? If the project is no longer properly maintained, is there a successor project, advice, etc?

From what I can tell, one viable alternative is the sickle project, OAI-PMH for humans, also Python: https://github.com/mloesch/sickle. Have any users switched from one to the other? Any experiences they'd like to share? This project too has some concerns regarding maintenance status.

Would really appreciate a bit of clarification here, as it doesn't seem totally abandoned, but for example, there's a PR that's ready to go, but been waiting a year: #39 ...

Limit request rate

As with any other bot, there should be a configurable delay of a few seconds between two requests.
We see our repository hammered with too many requests per second; pyoai is considered then as an unfriendly bot.

RDF and ORE Support

Is there any plan to support other metadata prefixes appart from oai_dc? I've been working to add support for RDF and ORE for a project. Maybe these could be included, if useful. If someone is interested, I could request the PR.

Create o modify the fields of metadata

Hi there, I have already installed this package:

https://github.com/infrae/moai

In order to have a little Open Access Server Platform but this installation have a standard fields, I want to change for another types of fields that it shows below:

Is there easy way to do this?
Thanks

Regards

XML parse errors parsing metadata payload

I'm using oaipmh via oai-harvest.
When the program hits an exception from a parse error in the metadata payload, it quits and stops harvesting. ( Typically not a problem with simple oai_dc metadata, but oai_ead has a lot more to get correct, and I'm seeing this frequently on several ArchivesSpace OAI feeds. )
I have found that back patching the Client.parse method to create an XMLParser with recover=True manages to finish harvesting the entire feed.
etree.XML(xml, etree.XMLParser(recover=True))

You might consider adding that as an option on creating Client, or as a subclass.

Fix CI

Our CI seems to be failing to build pull requests:
https://github.com/infrae/pyoai/runs/5054311221?check_suite_focus=true
This should be fixed.

raise BadStatusLine(line) httplib.BadStatusLine: ''

I do not use Python frequently but it appears the problem I am experiencing might be as a result of a well known issue. I am trying to harvest some content and run into the following error. Are you able to suggest a workaround for this?

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 115, in method
return obj(self, **kw)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/common.py", line 110, in call
return bound_self.handleVerb(self._verb, kw)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 65, in handleVerb
kw, self.makeRequestErrorHandling(verb=verb, **kw))
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 269, in makeRequestErrorHandling
xml = self.makeRequest(**kw)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 316, in makeRequest
return retrieveFromUrlWaiting(request)
File "/usr/local/lib/python2.7/dist-packages/pyoai-2.4.4-py2.7.egg/oaipmh/client.py", line 343, in retrieveFromUrlWaiting
f = urllib2.urlopen(request)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''

Description in setup.py should be a single line

Newlines in setup.py description are disallowed and some versions of setuptools (e.g. v59.0.1) fail to install pyoai, giving a ValueError: Newlines are not allowed.

See pypa/setuptools#2870 (reverted in pypa/setuptools@68795af but is expected to break again later).

Only supports POST

I've come across an oaipmh repository that is incorrectly configured such that it only supports GET requests, not POST. Perhaps it might be wise to fallback to GET incase POST fails?