Coder Social home page Coder Social logo

pypln.api's People

Contributors

fccoelho avatar flavioamieiro avatar israelst avatar turicas avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypln.api's Issues

Check problem when uploading many documents

There is a problem when uploading many documents - if we try to upload, for example, 100 documents, some of these documents are not shown in Web interface (sometimes 3 are lost...but this number changes). No exception is raised in this process.

Should exist a test to verify if the documents were really uploaded.

Should be able to delete corpora

Create a method Corpus.delete, with option on only deleting the corpus or also all documents inside it. Something like this:

def delete(self, delete_documents=False):
    if delete_documents:
        for document_url in self.documents:
            self.session.delete(document_url)
    return self.session.delete(self.url).ok

Get corpus object by corpus name

Currently if I have a corpus named "test" and I want to access its object, I need some code like this:

from pypln.api import PyPLN

pypln = PyPLN('http://demo.pypln.org', ('username', 'myprecious'))
corpora = pypln.corpora()
test_corpus = [corpus for corpus in corpora if corpus.name == 'test']

We should provide better methods of retrieving corpora (better if the API provides special methods for it).
Another possible helper method is something like "get_corpus_or_create", for example:

from pypln.api import PyPLN

pypln = PyPLN('http://demo.pypln.org', ('username', 'myprecious'))
test_corpus = pypln.get_corpus_or_create('test')

We could also see other helper methods in Django's ORM to inspire us and improve library usability.

Tutorial fails if PDF filename contains non-ascii characters

This was added as NAMD/pypln.web#93 but it really belongs to this repository. Bellow is the original text.

Sending Cancré et al._2000.pdf...

UnicodeDecodeError Traceback (most recent call last)
in ()
4 files={'blob':fp}
5 resp = requests.post('http://demo.pypln.org/documents/', data=data,
----> 6 files=files, auth=credentials)
7 print(resp.status_code)
8

/usr/local/lib/python2.7/dist-packages/requests/api.pyc in post(url, data, **kwargs)
86 """
87
---> 88 return request('post', url, data=data, **kwargs)
89
90

/usr/local/lib/python2.7/dist-packages/requests/api.pyc in request(method, url, **kwargs)
42
43 session = sessions.Session()
---> 44 return session.request(method=method, url=url, **kwargs)
45
46

/usr/local/lib/python2.7/dist-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
322
323 # Prepare the Request.
--> 324 prep = req.prepare()
325
326 # Send the request.

/usr/local/lib/python2.7/dist-packages/requests/models.pyc in prepare(self)
223 p.prepare_headers(self.headers)
224 p.prepare_cookies(self.cookies)
--> 225 p.prepare_body(self.data, self.files)
226 p.prepare_auth(self.auth, self.url)
227 # Note that prepare_auth must be last to enable authentication schemes

/usr/local/lib/python2.7/dist-packages/requests/models.pyc in prepare_body(self, data, files)
383 # Multi-part file uploads.
384 if files:
--> 385 (body, content_type) = self._encode_files(files, data)
386 else:
387 if data:

/usr/local/lib/python2.7/dist-packages/requests/models.pyc in _encode_files(files, data)
131 new_fields.append((k, new_v))
132
--> 133 body, content_type = encode_multipart_formdata(new_fields)
134
135 return body, content_type

/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/filepost.pyc in encode_multipart_formdata(fields, boundary)
73 content_type = get_content_type(filename)
74 writer(body).write('Content-Disposition: form-data; name="%s"; '
---> 75 'filename="%s"\r\n' % (fieldname, filename))
76 body.write(b('Content-Type: %s\r\n\r\n' %
77 (content_type,)))

/usr/lib/python2.7/codecs.pyc in write(self, object)
349 """ Writes the object's contents encoded to self.stream.
350 """
--> 351 data, consumed = self.encode(object, self.errors)
352 self.stream.write(data)
353

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 60: ordinal not in range(128)

registering another account with the same email

When you try to register another user with an email already associated to another account, you get no error or warning, but account does not get created... At least I could not login to it after receiving the confirmation email.

fetching multiple documents from a single request

The overhead of making a request for each document one wants to download from a large corpus is too large to be acceptable. A batch download should be available which requires a single request. The set of documents could be returned as a tar-ball.

This issue supersedes #29

change default behavior of add_documents

Currently corpus.add_documents returns two lists [documents with were successfully added] and [documents which failed].

The problem with this is that we loose positional information of which were added and which weren't.

For example if I am sending a list of documents, I want to know which ones in my original sequence succeeded. So I suggest that the first list is composed of Document objects or None according to the success in adding

Create CLI tool

I'd be nice to interact with the API using the command line (without needing to use curl or wget directly).
I've started a little script that can evolve to something we can ship with the library:

#!/usr/bin/env python
# coding: utf-8

import argparse
import sys

from pypln.api import PyPLN, Document


API_BASE = 'http://demo.pypln.org/'

def print_document(document):
    print('Filename: {}\n  {} bytes\n  URL: {}\n  Properties: {}'
            .format(document.blob, document.size, document.url,
                    document.properties_url))

def main():
    args = argparse.ArgumentParser()
    args.add_argument('username')
    args.add_argument('password')
    args.add_argument('--list-corpora', action='store_true')
    args.add_argument('--list-documents', action='store_true')
    args.add_argument('--list-documents-from-corpus', type=unicode)
    argv = args.parse_args()

    username = argv.username
    password = argv.password
    credentials = (username, password)

    pypln = PyPLN(API_BASE, credentials)
    if argv.list_corpora:
        for corpus in pypln.corpora():
            print('Corpus name: {}, {} documents'
                    .format(corpus.name, len(corpus.documents)))
    elif argv.list_documents:
        for document in pypln.documents():
            print_document(document)
    elif argv.list_documents_from_corpus:
        corpus_name = argv.list_documents_from_corpus
        found_corpora = [corpus for corpus in pypln.corpora()
                         if corpus.name == corpus_name]
        if not len(found_corpora):
            sys.stderr.write('ERROR: corpus "{}" not found.\n'
                    .format(corpus_name))
            exit(2)
        else:
            corpus = found_corpora[0]
            print('Retrieving documents from corpus "{}" ({} found)...\n'
                    .format(corpus_name, len(corpus.documents)))
            for document_url in corpus.documents:
                document = Document.from_url(document_url, credentials)
                print_document(document)
    else:
        sys.stderr.write('ERROR: you should choose one option.\n')
        exit(1)


if __name__ == '__main__':
    main()

Create setup.py

And the setup.py has to be in the same namespace as pypln.{backend,web} (and pypln/__init__.py has to include

import pkg_resources
pkg_resources.declare_namespace(__name__)

).

Should be able to search

The PyPLN class should have methods for searching (full-text search) documents (in global namespace and in a specific corpus).

Fix MANIFEST

I forgot to add README.markdown and CHANGELOG.markdown to MANIFEST.in and remove old files.

Fetching all the properties at once

currently when we want to acess properties of a PyPLN documents, we need to fetch each on a separate request. This impractical.

There should be an diferent url to fetch the properties data.

For example today when we point the browser to a document properties url, such as this http://fgv.pypln.org/documents/57894/properties/ we get back a json object with an array of property urls for that document:

{
    "properties": [
        "http://fgv.pypln.org/documents/57894/properties/average_sentence_length/",
        "http://fgv.pypln.org/documents/57894/properties/average_sentence_repertoire/",
        "http://fgv.pypln.org/documents/57894/properties/contents/",
        "http://fgv.pypln.org/documents/57894/properties/file_id/",
        "http://fgv.pypln.org/documents/57894/properties/file_metadata/",
        "http://fgv.pypln.org/documents/57894/properties/filename/",
        "http://fgv.pypln.org/documents/57894/properties/forced_decoding/",
        "http://fgv.pypln.org/documents/57894/properties/freqdist/",
        "http://fgv.pypln.org/documents/57894/properties/language/",
        "http://fgv.pypln.org/documents/57894/properties/lemmas/",
        "http://fgv.pypln.org/documents/57894/properties/length/",
        "http://fgv.pypln.org/documents/57894/properties/md5/",
        "http://fgv.pypln.org/documents/57894/properties/mimetype/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_1/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_2/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_3/",
        "http://fgv.pypln.org/documents/57894/properties/momentum_4/",
        "http://fgv.pypln.org/documents/57894/properties/noun_phrases/",
        "http://fgv.pypln.org/documents/57894/properties/palavras_raw/",
        "http://fgv.pypln.org/documents/57894/properties/palavras_raw_ran/",
        "http://fgv.pypln.org/documents/57894/properties/pos/",
        "http://fgv.pypln.org/documents/57894/properties/repertoire/",
        "http://fgv.pypln.org/documents/57894/properties/semantic_tags/",
        "http://fgv.pypln.org/documents/57894/properties/sentences/",
        "http://fgv.pypln.org/documents/57894/properties/tagset/",
        "http://fgv.pypln.org/documents/57894/properties/text/",
        "http://fgv.pypln.org/documents/57894/properties/tokens/",
        "http://fgv.pypln.org/documents/57894/properties/upload_date/",
        "http://fgv.pypln.org/documents/57894/properties/wordcloud/"
    ]
}

I propose we add a new API endpoint which could have the form of either:

http://fgv.pypln.org/documents/57894/properties_data/

or

http://fgv.pypln.org/documents/57894/properties/gzip

which would return a gzipped JSON with all the data.

Should `Corpus.documents` return `Document` objects?

As the idea of the package is to use PyPLN in a pythonic way, maybe the best thing to do with documents attribute on Corpus object is to return a list of Document objects. I don't know if this would impact in performance, since:

  • A Corpus could have a huge number of documents and a little string with document URL is way lighter than a Document object; and
  • The method Document.from_url requires one request being made to create the object from the URL.

A possible approach would be creating a LazyDocument class, that stores only its URL and in the first time the user tries to access one attribute it does the request as Document.from_url does (this class should have the same methods and attributes as Document).

Retrieve all documents

Should be able to retrieve all documents when HTTP server returns paginated results (currently we have 100 results per page).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.