google / budou Goto Github PK

Budou is an automatic organizer tool for beautiful line breaking in CJK (Chinese, Japanese, and Korean).

License: Apache License 2.0

Python 99.22% Makefile 0.78%

natural-language-processing python web-development cjk

budou's Issues

Use pickle for caching

Current caching uses shelve module and it makes unit testing tricky because the suffix of cache file name may differ by environment. By changing the cache format to pickle, we can improve the mobility of cache files and simplify unit testing.

"？" is not included in a chunk

Consider to use cElementTree

Particles should be concatenated with the previous word even if there's a space in between.

input: "Budou はいいぞ．"
expected: Budou はいいぞ．
actual: Budou はいいぞ．

Non-breaking space character (/u00A0) causes AssertionError

Here is the problem string: Chatbot\u00a0\u2013

Traceback (most recent call last):
  File "<console>", line 5, in <module>
  File "/usr/local/lib/python3.6/site-packages/budou/parser.py", line 78, in parse
    chunks = self.segmenter.segment(source, language)
  File "/usr/local/lib/python3.6/site-packages/budou/tinysegmentersegmenter.py", line 94, in segment
    assert source[seek] == ' '
AssertionError

budou/budou/tinysegmentersegmenter.py

Line 94 in 87d9b81

assert source[seek] == ' '

Adding Chinese support

https://cloud.google.com/natural-language/docs/languages

Handle proper nouns

Proper nouns（固有名詞）are sometimes separated to chunks, which should be wrapped in one chunk ideally. Possible solutions would be:

Use "entity" property in Natural Language API's response to force every entity to be wrapped in one chunk.
Allow users to put a list of proper nouns (maybe .csv file) to wrap as one chunk.

Resolve html5lib's DeprecationWarning

The current implementation keeps returning the warning below.

DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.

Documentation enhancement

Below items should be covered.

Custom filter integration (e.g. Flask and Django)
Accessibility enhancement (aria-describedby)

`aria-describedby` is stripped by `_html_serialize`

When trying to setup a11y with budou, setting aria-describedby as an attribute is being stripped by html5lib on line https://github.com/google/budou/blob/master/budou/budou.py#L439

Span, Zero-width space, or wbr elements?

I just learned this nice work. It is particularly useful for dyslexia people.

In a meeting of the Japanese DAISY project for textbooks, we discussed how hints for line breaking should be represented. The use of span elements was suggested. But people do not want to use span elements for this purpose, because DAISY textbooks already too heavily use span elements for multi-media synchronization. Thus, Keio Advanced Publishing Laboratory is inclined to adopt the zero-width space or wbr elements. Florian's personal draft is based on this assumption. See w3c/jlreq#17

Missing dependency: shelve

Hi @tushuhei -- we're finally getting around to importing budou into https://github.com/grow/grow and noticed that the shelve dependency isn't enumerated in this project's setup.py and so we're getting "ImportError: No module named shelve". Is the correct fix to add it to setup.py?

Using `budou` name in Node.js port

I've been working on a complete port of your awesome library to Node.js, budou-node.

I would like to use the name budou in the npm package. I thought I would check in to see if you had any issues with that? Happy to make changes ✌️

Minor mistake in README

In this section in README, code for Traditional Chinese should be zh-Hant while zh-Hans is for Simplified Chinese.

Some characters are not recognized non-CJK characters to skip

example
input: 今日は [@foo]tushuhei.com/hoge[/@foo]天気です。
output:

<span><span class="ww">今日は [@</span>foo]tushuhei.com/hoge[/@foo]<span class="ww">天気です。</span></span>

It seems characters like [, @ are included in a chunk by error.

Can't install budou using pip

Quick isolated case:

virtualenv venv
source venv/bin/activate
pip install budou

I get error traceback:

Collecting budou                                   
  Using cached budou-0.6.0.tar.gz                  
    Complete output from command python setup.py egg_info:                                             
    Traceback (most recent call last):             
      File "<string>", line 20, in <module>        
      File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 31, in <module>                                                                                   
        install_requires=read_file('requirements.txt').splitlines(),                                   
      File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 19, in read_file                                                                                  
        with open(os.path.join(os.path.dirname(__file__), name), 'r') as f:                            
    IOError: [Errno 2] No such file or directory: '/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/requirements.txt'                                                          
                                                   
    ----------------------------------------       
    Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou

Add Jieba support

Add Jieba backend segmenter to add another segmenter option for Chinese.

Caching feature improvement

Current caching feature uses a shelve file for caching purpose, but this approach will not work with some server-less architectures such as AppEngine environment which may launch multiple instances for front-end serving. In order to enable caching for PaaS services, updating the caching feature with factory method pattern and let each platform use its specialized implementation would be better.

Compatibility with grow

When I try to use Budou with Grow using https://github.com/grow/grow-ext-budou, grow run fails with ImportError: cannot import name api. This looks like the issue of architecture in Budou.

Chinese language name

Natural Language API accepts 'zh' and 'zh-Hant' as supported languages, but the current implementation may pass 'zh', 'zh-TW', 'zh-CN', or 'zh-HK' to the API. They need to be aligned.

Accessibility Improvement

Some screen reader programs read Budou-enabled paragraphs chunk-by-chunk, which makes their reading speed slow. We may want to add capability to configure attributes of each SPAN tag in order to let users put ARIA tags to control a screen reader's behavior.

Here's an example which controls screen reading properly.

<p id="description" aria-label="やりたいことのそばにいる Android">
  <span class="ww" aria-describedby="description">やりたい</span>
  <span class="ww" aria-describedby="description">ことの</span>
  <span class="ww" aria-describedby="description">そばに</span>
  <span class="ww" aria-describedby="description">いる Android</span>
</p>

Set maximum length for the chunk

Some Japanese katakana terms are too long to fit in one line, which may occur layout degradation. Setting maximum length of each chunk would be a solution for this issue.

budou.py returns error when input a text is recognized as 'zh'

Got HttpError when I input a text which is recognized as 'zh'.
Budou must handle CJK texts...

For example:
result = parser.parse(u'再会', 'wordwrap')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 113, in parse
chunks = self._get_source_chunks(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 178, in _get_source_chunks
tokens = self._get_annotations(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 150, in _get_annotations
response = request.execute()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 838, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://language.googleapis.com/v1beta1/documents:annotateText?alt=json returned "The language zh is not supported for syntax analysis.">

Copy icon removes spaces (breaks Korean)

The copy icon on the tool removes spaces from the text, which effectively breaks Korean text.

Copy pasting manually:
취소에 대해 궁금한 점이 있으면 가족 그룹 관리자에게 문의하세요.

Using the copy button (spaces removed):
취소에대해궁금한점이있으면가족그룹관리자에게문의하세요.

[run error] set GOOGLE_APPLICATION_CREDENTIALS

I installed budou using pip, and I ran an example like 'budou 안녕하세요'.
But I got this error when I ran this program first time.
:: [google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://developers.google.com/accounts/docs/application-default-credentials.]
Is there more things to do than just 'pip install budou'?

not proper grouping of comma (読点)

in this case, comma (読点) should be grouped with former word (最終日).

Process brackets properly

Current implementation concatenates all PUNCT marks（、。「」etc.）to the previous chunk, but it is not appropriate for some marks such as open brackets.

WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth

When upgrading google-api-python-client and oauth2client, there is this warning from the GCP cloud:

"WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth"

It would be great to be able to specify if the build call in this line should use the cache or no:

budou/budou/nlapisegmenter.py

Line 141 in ddfe8d2

service = googleapiclient.discovery.build(

I was thinking something like this:

class NLAPISegmenter(Segmenter):
...
def __init__(self, cache_filename, credentials_path, use_entity, use_cache, cache_discovery):
  ...
  self._authenticate(cache_discovery)
...

def _authenticate(self, cache_discovery):
...
service = googleapiclient.discovery.build(
        'language', 'v1beta2', http=authed_http, cache_discovery=cache_discovery)

English characters should be ignored

The output of
Google XXXX YYYY へこんにちは
should be
Google XXXX YYYY へこんにちは
instead of
Google XXXX YYYY へこんにちは

google / budou Goto Github PK

budou's Issues

Recommend Projects

Recommend Topics

Recommend Org