Coder Social home page Coder Social logo

google / budou Goto Github PK

View Code? Open in Web Editor NEW
1.2K 35.0 55.0 270 KB

Budou is an automatic organizer tool for beautiful line breaking in CJK (Chinese, Japanese, and Korean).

License: Apache License 2.0

Python 99.22% Makefile 0.78%
natural-language-processing python web-development cjk

budou's Issues

Use pickle for caching

Current caching uses shelve module and it makes unit testing tricky because the suffix of cache file name may differ by environment. By changing the cache format to pickle, we can improve the mobility of cache files and simplify unit testing.

Non-breaking space character (/u00A0) causes AssertionError

Here is the problem string: Chatbot\u00a0\u2013

Traceback (most recent call last):
  File "<console>", line 5, in <module>
  File "/usr/local/lib/python3.6/site-packages/budou/parser.py", line 78, in parse
    chunks = self.segmenter.segment(source, language)
  File "/usr/local/lib/python3.6/site-packages/budou/tinysegmentersegmenter.py", line 94, in segment
    assert source[seek] == ' '
AssertionError

assert source[seek] == ' '

Handle proper nouns

Proper nouns(固有名詞)are sometimes separated to chunks, which should be wrapped in one chunk ideally. Possible solutions would be:

  • Use "entity" property in Natural Language API's response to force every entity to be wrapped in one chunk.
  • Allow users to put a list of proper nouns (maybe .csv file) to wrap as one chunk.

Resolve html5lib's DeprecationWarning

The current implementation keeps returning the warning below.

DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.

Documentation enhancement

Below items should be covered.

  • Custom filter integration (e.g. Flask and Django)
  • Accessibility enhancement (aria-describedby)

Span, Zero-width space, or wbr elements?

I just learned this nice work. It is particularly useful for dyslexia people.

In a meeting of the Japanese DAISY project for textbooks, we discussed how hints for line breaking should be represented. The use of span elements was suggested. But people do not want to use span elements for this purpose, because DAISY textbooks already too heavily use span elements for multi-media synchronization. Thus, Keio Advanced Publishing Laboratory is inclined to adopt the zero-width space or wbr elements. Florian's personal draft is based on this assumption. See w3c/jlreq#17

Using `budou` name in Node.js port

I've been working on a complete port of your awesome library to Node.js, budou-node.

I would like to use the name budou in the npm package. I thought I would check in to see if you had any issues with that? Happy to make changes ✌️

Minor mistake in README

In this section in README, code for Traditional Chinese should be zh-Hant while zh-Hans is for Simplified Chinese.

Can't install budou using pip

Quick isolated case:

virtualenv venv
source venv/bin/activate
pip install budou

I get error traceback:

Collecting budou                                   
  Using cached budou-0.6.0.tar.gz                  
    Complete output from command python setup.py egg_info:                                             
    Traceback (most recent call last):             
      File "<string>", line 20, in <module>        
      File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 31, in <module>                                                                                   
        install_requires=read_file('requirements.txt').splitlines(),                                   
      File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 19, in read_file                                                                                  
        with open(os.path.join(os.path.dirname(__file__), name), 'r') as f:                            
    IOError: [Errno 2] No such file or directory: '/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/requirements.txt'                                                          
                                                   
    ----------------------------------------       
    Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou    

Caching feature improvement

Current caching feature uses a shelve file for caching purpose, but this approach will not work with some server-less architectures such as AppEngine environment which may launch multiple instances for front-end serving. In order to enable caching for PaaS services, updating the caching feature with factory method pattern and let each platform use its specialized implementation would be better.

Chinese language name

Natural Language API accepts 'zh' and 'zh-Hant' as supported languages, but the current implementation may pass 'zh', 'zh-TW', 'zh-CN', or 'zh-HK' to the API. They need to be aligned.

Accessibility Improvement

Some screen reader programs read Budou-enabled paragraphs chunk-by-chunk, which makes their reading speed slow. We may want to add capability to configure attributes of each SPAN tag in order to let users put ARIA tags to control a screen reader's behavior.

Here's an example which controls screen reading properly.

<p id="description" aria-label="やりたいことのそばにいる Android">
  <span class="ww" aria-describedby="description">やりたい</span>
  <span class="ww" aria-describedby="description">ことの</span>
  <span class="ww" aria-describedby="description">そばに</span>
  <span class="ww" aria-describedby="description">いる Android</span>
</p>

Set maximum length for the chunk

Some Japanese katakana terms are too long to fit in one line, which may occur layout degradation. Setting maximum length of each chunk would be a solution for this issue.

budou.py returns error when input a text is recognized as 'zh'

Got HttpError when I input a text which is recognized as 'zh'.
Budou must handle CJK texts...

For example:
result = parser.parse(u'再会', 'wordwrap')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 113, in parse
chunks = self._get_source_chunks(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 178, in _get_source_chunks
tokens = self._get_annotations(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 150, in _get_annotations
response = request.execute()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 838, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://language.googleapis.com/v1beta1/documents:annotateText?alt=json returned "The language zh is not supported for syntax analysis.">

Copy icon removes spaces (breaks Korean)

The copy icon on the tool removes spaces from the text, which effectively breaks Korean text.

image

Copy pasting manually:
<span><span class="ww">취소에</span> <span class="ww">대해</span> <span class="ww">궁금한</span> <span class="ww">점이</span> <span class="ww">있으면</span> <span class="ww">가족</span> <span class="ww">그룹</span> <span class="ww">관리자에게</span> <span class="ww">문의하세요.</span></span>

Using the copy button (spaces removed):
<span><span class="ww">취소에</span><span class="ww">대해</span><span class="ww">궁금한</span><span class="ww">점이</span><span class="ww">있으면</span><span class="ww">가족</span><span class="ww">그룹</span><span class="ww">관리자에게</span><span class="ww">문의하세요.</span></span>

[run error] set GOOGLE_APPLICATION_CREDENTIALS

I installed budou using pip, and I ran an example like 'budou 안녕하세요'.
But I got this error when I ran this program first time.
:: [google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://developers.google.com/accounts/docs/application-default-credentials.]
Is there more things to do than just 'pip install budou'?

Process brackets properly

Current implementation concatenates all PUNCT marks(、。「」etc.)to the previous chunk, but it is not appropriate for some marks such as open brackets.

WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth

When upgrading google-api-python-client and oauth2client, there is this warning from the GCP cloud:

"WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth"

It would be great to be able to specify if the build call in this line should use the cache or no:

service = googleapiclient.discovery.build(

I was thinking something like this:

class NLAPISegmenter(Segmenter):
...
def __init__(self, cache_filename, credentials_path, use_entity, use_cache, cache_discovery):
  ...
  self._authenticate(cache_discovery)
...

def _authenticate(self, cache_discovery):
...
service = googleapiclient.discovery.build(
        'language', 'v1beta2', http=authed_http, cache_discovery=cache_discovery)

English characters should be ignored

The output of
Google XXXX YYYY へこんにちは
should be
Google XXXX <span>YYYY へ</span><span>こんにちは</span>
instead of
<span>Google </span><span>XXXX </span><span>YYYY へ</span><span>こんにちは</span>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.