namd / pypln.api Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 3.0 1.17 MB

Python library to access PyPLN's API.

License: GNU General Public License v3.0

Makefile 0.78% Python 99.22%

pypln.api's People

Contributors

Stargazers

Watchers

Forkers

turicas flavioamieiro pombredanne

pypln.api's Issues

Reuse the request when uploading multiple documents

The add_documents function should reuse the same request to speedup upload.

Release 0.2.0

Release new version with the new PyPLN API.

Check problem when uploading many documents

There is a problem when uploading many documents - if we try to upload, for example, 100 documents, some of these documents are not shown in Web interface (sometimes 3 are lost...but this number changes). No exception is raised in this process.

Should exist a test to verify if the documents were really uploaded.

Should be able to list documents

List all documents owned by a user
List all documents in a corpus

add confirmation on corpus deletion via the web interface.

we should show an "are you sure you want to delete corpus "x"? modal windows before proceding with the delete request.

Should be able to delete corpora

Create a method Corpus.delete, with option on only deleting the corpus or also all documents inside it. Something like this:

def delete(self, delete_documents=False):
    if delete_documents:
        for document_url in self.documents:
            self.session.delete(document_url)
    return self.session.delete(self.url).ok

Get corpus object by corpus name

Currently if I have a corpus named "test" and I want to access its object, I need some code like this:

from pypln.api import PyPLN

pypln = PyPLN('http://demo.pypln.org', ('username', 'myprecious'))
corpora = pypln.corpora()
test_corpus = [corpus for corpus in corpora if corpus.name == 'test']

We should provide better methods of retrieving corpora (better if the API provides special methods for it).
Another possible helper method is something like "get_corpus_or_create", for example:

from pypln.api import PyPLN

pypln = PyPLN('http://demo.pypln.org', ('username', 'myprecious'))
test_corpus = pypln.get_corpus_or_create('test')

We could also see other helper methods in Django's ORM to inspire us and improve library usability.

Tutorial fails if PDF filename contains non-ascii characters

This was added as NAMD/pypln.web#93 but it really belongs to this repository. Bellow is the original text.

Sending Cancré et al._2000.pdf...

UnicodeDecodeError Traceback (most recent call last)
in ()
4 files={'blob':fp}
5 resp = requests.post('http://demo.pypln.org/documents/', data=data,
----> 6 files=files, auth=credentials)
7 print(resp.status_code)
8

/usr/local/lib/python2.7/dist-packages/requests/api.pyc in post(url, data, **kwargs)
86 """
87
---> 88 return request('post', url, data=data, **kwargs)
89
90

/usr/local/lib/python2.7/dist-packages/requests/api.pyc in request(method, url, **kwargs)
42
43 session = sessions.Session()
---> 44 return session.request(method=method, url=url, **kwargs)
45
46

/usr/local/lib/python2.7/dist-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
322
323 # Prepare the Request.
--> 324 prep = req.prepare()
325
326 # Send the request.

/usr/local/lib/python2.7/dist-packages/requests/models.pyc in prepare(self)
223 p.prepare_headers(self.headers)
224 p.prepare_cookies(self.cookies)
--> 225 p.prepare_body(self.data, self.files)
226 p.prepare_auth(self.auth, self.url)
227 # Note that prepare_auth must be last to enable authentication schemes

/usr/local/lib/python2.7/dist-packages/requests/models.pyc in prepare_body(self, data, files)
383 # Multi-part file uploads.
384 if files:
--> 385 (body, content_type) = self._encode_files(files, data)
386 else:
387 if data:

/usr/local/lib/python2.7/dist-packages/requests/models.pyc in _encode_files(files, data)
131 new_fields.append((k, new_v))
132
--> 133 body, content_type = encode_multipart_formdata(new_fields)
134
135 return body, content_type

/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/filepost.pyc in encode_multipart_formdata(fields, boundary)
73 content_type = get_content_type(filename)
74 writer(body).write('Content-Disposition: form-data; name="%s"; '
---> 75 'filename="%s"\r\n' % (fieldname, filename))
76 body.write(b('Content-Type: %s\r\n\r\n' %
77 (content_type,)))

/usr/lib/python2.7/codecs.pyc in write(self, object)
349 """ Writes the object's contents encoded to self.stream.
350 """
--> 351 data, consumed = self.encode(object, self.errors)
352 self.stream.write(data)
353

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 60: ordinal not in range(128)

registering another account with the same email

When you try to register another user with an email already associated to another account, you get no error or warning, but account does not get created... At least I could not login to it after receiving the confirmation email.

Add missing docstrigns

In the module, classes and its methods.

fetching multiple documents from a single request

The overhead of making a request for each document one wants to download from a large corpus is too large to be acceptable. A batch download should be available which requires a single request. The set of documents could be returned as a tar-ball.

This issue supersedes #29

Create documentation (tutorial) and reference (methods and docstrings)

We need a sphinx documentation for the API ASAP!

I am marking this as a bug because an API without docs is pretty much useless.

Should be able to get document visualizations data

In the Web interface we have option to download visualizations as CSV, TXT etc. This API should be able to get this data so users can work programatically in data processed by the pipeline.

change default behavior of add_documents

Currently corpus.add_documents returns two lists [documents with were successfully added] and [documents which failed].

The problem with this is that we loose positional information of which were added and which weren't.

For example if I am sending a list of documents, I want to know which ones in my original sequence succeeded. So I suggest that the first list is composed of Document objects or None according to the success in adding

Get Document object by document filename or slug

Create README with a little tutorial

Option to use auth token instead of username/password

Since pypln.web now has support for authentication using token (instead of username/password), this feature should be added also to pypln.api.

Create CLI tool

I'd be nice to interact with the API using the command line (without needing to use curl or wget directly).
I've started a little script that can evolve to something we can ship with the library:

#!/usr/bin/env python
# coding: utf-8

import argparse
import sys

from pypln.api import PyPLN, Document


API_BASE = 'http://demo.pypln.org/'

def print_document(document):
    print('Filename: {}\n  {} bytes\n  URL: {}\n  Properties: {}'
            .format(document.blob, document.size, document.url,
                    document.properties_url))

def main():
    args = argparse.ArgumentParser()
    args.add_argument('username')
    args.add_argument('password')
    args.add_argument('--list-corpora', action='store_true')
    args.add_argument('--list-documents', action='store_true')
    args.add_argument('--list-documents-from-corpus', type=unicode)
    argv = args.parse_args()

    username = argv.username
    password = argv.password
    credentials = (username, password)

    pypln = PyPLN(API_BASE, credentials)
    if argv.list_corpora:
        for corpus in pypln.corpora():
            print('Corpus name: {}, {} documents'
                    .format(corpus.name, len(corpus.documents)))
    elif argv.list_documents:
        for document in pypln.documents():
            print_document(document)
    elif argv.list_documents_from_corpus:
        corpus_name = argv.list_documents_from_corpus
        found_corpora = [corpus for corpus in pypln.corpora()
                         if corpus.name == corpus_name]
        if not len(found_corpora):
            sys.stderr.write('ERROR: corpus "{}" not found.\n'
                    .format(corpus_name))
            exit(2)
        else:
            corpus = found_corpora[0]
            print('Retrieving documents from corpus "{}" ({} found)...\n'
                    .format(corpus_name, len(corpus.documents)))
            for document_url in corpus.documents:
                document = Document.from_url(document_url, credentials)
                print_document(document)
    else:
        sys.stderr.write('ERROR: you should choose one option.\n')
        exit(1)


if __name__ == '__main__':
    main()

Release version 0.1.0

Corpus objects missing 'url' when returned by PyPLN.corpora() method

Self-explanatory.

Get Corpus object by corpus slug or name

Should raise exception (RuntimeError?) if not logged in

Currently if you do not call pypln_object.login(...) before calling its methods (like add_document), there is no explicity exception being raised saying that this session is not logged in.

Create setup.py

And the setup.py has to be in the same namespace as pypln.{backend,web} (and pypln/__init__.py has to include

import pkg_resources
pkg_resources.declare_namespace(__name__)

Should be able to delete documents

Create a method Document.delete -- a very simple one, something like this:

def delete(self):
    return self.session.delete(self.url).ok

Should be able to search

The PyPLN class should have methods for searching (full-text search) documents (in global namespace and in a specific corpus).

Corpus.add_document(s) should return `Document` object(s)

Currently, Document.add_document (and so Document.add_documents since it uses the former) returns a dictionary with document data. It should return the Document object representing this document instead.

500 error when redirecting to login page after resetting the password

After resetting the password, if you click the link to login, you get a 500 error.

Fix MANIFEST

I forgot to add README.markdown and CHANGELOG.markdown to MANIFEST.in and remove old files.

Fetching all the properties at once

currently when we want to acess properties of a PyPLN documents, we need to fetch each on a separate request. This impractical.

There should be an diferent url to fetch the properties data.

For example today when we point the browser to a document properties url, such as this http://fgv.pypln.org/documents/57894/properties/ we get back a json object with an array of property urls for that document:

{
    "properties": [
        "http://fgv.pypln.org/documents/57894/properties/average_sentence_length/",
        "http://fgv.pypln.org/documents/57894/properties/average_sentence_repertoire/",
        "http://fgv.pypln.org/documents/57894/properties/contents/",
        "http://fgv.pypln.org/documents/57894/properties/file_id/",
        "http://fgv.pypln.org/documents/57894/properties/file_metadata/",
        "http://fgv.pypln.org/documents/57894/properties/filename/",
        "http://fgv.pypln.org/documents/57894/properties/forced_decoding/",
        "http://fgv.pypln.org/documents/57894/properties/freqdist/",
        "http://fgv.pypln.org/documents/57894/properties/language/",
        "http://fgv.pypln.org/documents/57894/properties/lemmas/",
        "http://fgv.pypln.org/documents/57894/properties/length/",
        "http://fgv.pypln.org/documents/57894/properties/md5/",
        "http://fgv.pypln.org/documents/57894/properties/mimetype/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_1/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_2/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_3/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_4/",
        "http://fgv.pypln.org/documents/57894/properties/noun_phrases/",
        "http://fgv.pypln.org/documents/57894/properties/palavras_raw/",
        "http://fgv.pypln.org/documents/57894/properties/palavras_raw_ran/",
        "http://fgv.pypln.org/documents/57894/properties/pos/",
        "http://fgv.pypln.org/documents/57894/properties/repertoire/",
        "http://fgv.pypln.org/documents/57894/properties/semantic_tags/",
        "http://fgv.pypln.org/documents/57894/properties/sentences/",
        "http://fgv.pypln.org/documents/57894/properties/tagset/",
        "http://fgv.pypln.org/documents/57894/properties/text/",
        "http://fgv.pypln.org/documents/57894/properties/tokens/",
        "http://fgv.pypln.org/documents/57894/properties/upload_date/",
        "http://fgv.pypln.org/documents/57894/properties/wordcloud/"
    ]
}

I propose we add a new API endpoint which could have the form of either:

http://fgv.pypln.org/documents/57894/properties_data/

http://fgv.pypln.org/documents/57894/properties/gzip

which would return a gzipped JSON with all the data.

Should `Corpus.documents` return `Document` objects?

As the idea of the package is to use PyPLN in a pythonic way, maybe the best thing to do with documents attribute on Corpus object is to return a list of Document objects. I don't know if this would impact in performance, since:

A Corpus could have a huge number of documents and a little string with document URL is way lighter than a Document object; and
The method Document.from_url requires one request being made to create the object from the URL.

A possible approach would be creating a LazyDocument class, that stores only its URL and in the first time the user tries to access one attribute it does the request as Document.from_url does (this class should have the same methods and attributes as Document).

Retrieve all documents

Should be able to retrieve all documents when HTTP server returns paginated results (currently we have 100 results per page).