google / budou Goto Github PK
View Code? Open in Web Editor NEWBudou is an automatic organizer tool for beautiful line breaking in CJK (Chinese, Japanese, and Korean).
License: Apache License 2.0
Budou is an automatic organizer tool for beautiful line breaking in CJK (Chinese, Japanese, and Korean).
License: Apache License 2.0
Current caching uses shelve module and it makes unit testing tricky because the suffix of cache file name may differ by environment. By changing the cache format to pickle, we can improve the mobility of cache files and simplify unit testing.
input: "Budou はいいぞ."
expected: <span class="ww">Budou は</span><span class="ww">いいぞ.</span>
actual: <span class="ww">Budou</span> は<span class="ww">いいぞ.</span>
Here is the problem string: Chatbot\u00a0\u2013
Traceback (most recent call last):
File "<console>", line 5, in <module>
File "/usr/local/lib/python3.6/site-packages/budou/parser.py", line 78, in parse
chunks = self.segmenter.segment(source, language)
File "/usr/local/lib/python3.6/site-packages/budou/tinysegmentersegmenter.py", line 94, in segment
assert source[seek] == ' '
AssertionError
budou/budou/tinysegmentersegmenter.py
Line 94 in 87d9b81
Proper nouns(固有名詞)are sometimes separated to chunks, which should be wrapped in one chunk ideally. Possible solutions would be:
The current implementation keeps returning the warning below.
DeprecationWarning: This method will be removed in future versions. Use 'list(elem)' or iteration over elem instead.
Below items should be covered.
When trying to setup a11y with budou, setting aria-describedby
as an attribute is being stripped by html5lib on line https://github.com/google/budou/blob/master/budou/budou.py#L439
I just learned this nice work. It is particularly useful for dyslexia people.
In a meeting of the Japanese DAISY project for textbooks, we discussed how hints for line breaking should be represented. The use of span elements was suggested. But people do not want to use span elements for this purpose, because DAISY textbooks already too heavily use span elements for multi-media synchronization. Thus, Keio Advanced Publishing Laboratory is inclined to adopt the zero-width space or wbr elements. Florian's personal draft is based on this assumption. See w3c/jlreq#17
Hi @tushuhei -- we're finally getting around to importing budou into https://github.com/grow/grow and noticed that the shelve
dependency isn't enumerated in this project's setup.py
and so we're getting "ImportError: No module named shelve". Is the correct fix to add it to setup.py
?
I've been working on a complete port of your awesome library to Node.js, budou-node.
I would like to use the name budou
in the npm package. I thought I would check in to see if you had any issues with that? Happy to make changes ✌️
In this section in README, code for Traditional Chinese should be zh-Hant
while zh-Hans
is for Simplified Chinese.
Quick isolated case:
virtualenv venv
source venv/bin/activate
pip install budou
I get error traceback:
Collecting budou
Using cached budou-0.6.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 31, in <module>
install_requires=read_file('requirements.txt').splitlines(),
File "/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/setup.py", line 19, in read_file
with open(os.path.join(os.path.dirname(__file__), name), 'r') as f:
IOError: [Errno 2] No such file or directory: '/private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou/requirements.txt'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/01/w01_1_fx077_0zpzqxm2092m0000gn/T/pip-build-9UqicV/budou
Add Jieba backend segmenter to add another segmenter option for Chinese.
Current caching feature uses a shelve file for caching purpose, but this approach will not work with some server-less architectures such as AppEngine environment which may launch multiple instances for front-end serving. In order to enable caching for PaaS services, updating the caching feature with factory method pattern and let each platform use its specialized implementation would be better.
When I try to use Budou with Grow using https://github.com/grow/grow-ext-budou, grow run
fails with ImportError: cannot import name api
. This looks like the issue of architecture in Budou.
Natural Language API accepts 'zh' and 'zh-Hant' as supported languages, but the current implementation may pass 'zh', 'zh-TW', 'zh-CN', or 'zh-HK' to the API. They need to be aligned.
Some screen reader programs read Budou-enabled paragraphs chunk-by-chunk, which makes their reading speed slow. We may want to add capability to configure attributes of each SPAN tag in order to let users put ARIA tags to control a screen reader's behavior.
Here's an example which controls screen reading properly.
<p id="description" aria-label="やりたいことのそばにいる Android">
<span class="ww" aria-describedby="description">やりたい</span>
<span class="ww" aria-describedby="description">ことの</span>
<span class="ww" aria-describedby="description">そばに</span>
<span class="ww" aria-describedby="description">いる Android</span>
</p>
Some Japanese katakana terms are too long to fit in one line, which may occur layout degradation. Setting maximum length of each chunk would be a solution for this issue.
Got HttpError when I input a text which is recognized as 'zh'.
Budou must handle CJK texts...
For example:
result = parser.parse(u'再会', 'wordwrap')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 113, in parse
chunks = self._get_source_chunks(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 178, in _get_source_chunks
tokens = self._get_annotations(input_text)
File "/usr/local/lib/python2.7/dist-packages/budou/budou.py", line 150, in _get_annotations
response = request.execute()
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 838, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://language.googleapis.com/v1beta1/documents:annotateText?alt=json returned "The language zh is not supported for syntax analysis.">
The copy icon on the tool removes spaces from the text, which effectively breaks Korean text.
Copy pasting manually:
<span><span class="ww">취소에</span> <span class="ww">대해</span> <span class="ww">궁금한</span> <span class="ww">점이</span> <span class="ww">있으면</span> <span class="ww">가족</span> <span class="ww">그룹</span> <span class="ww">관리자에게</span> <span class="ww">문의하세요.</span></span>
Using the copy button (spaces removed):
<span><span class="ww">취소에</span><span class="ww">대해</span><span class="ww">궁금한</span><span class="ww">점이</span><span class="ww">있으면</span><span class="ww">가족</span><span class="ww">그룹</span><span class="ww">관리자에게</span><span class="ww">문의하세요.</span></span>
I installed budou using pip, and I ran an example like 'budou 안녕하세요'.
But I got this error when I ran this program first time.
:: [google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://developers.google.com/accounts/docs/application-default-credentials.]
Is there more things to do than just 'pip install budou'?
Current implementation concatenates all PUNCT marks(、。「」etc.)to the previous chunk, but it is not appropriate for some marks such as open brackets.
When upgrading google-api-python-client
and oauth2client
, there is this warning from the GCP cloud:
"WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth"
It would be great to be able to specify if the build
call in this line should use the cache or no:
Line 141 in ddfe8d2
I was thinking something like this:
class NLAPISegmenter(Segmenter):
...
def __init__(self, cache_filename, credentials_path, use_entity, use_cache, cache_discovery):
...
self._authenticate(cache_discovery)
...
def _authenticate(self, cache_discovery):
...
service = googleapiclient.discovery.build(
'language', 'v1beta2', http=authed_http, cache_discovery=cache_discovery)
The output of
Google XXXX YYYY へこんにちは
should be
Google XXXX <span>YYYY へ</span><span>こんにちは</span>
instead of
<span>Google </span><span>XXXX </span><span>YYYY へ</span><span>こんにちは</span>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.