Comments (2)
I found that this issue is caused by Natural Language API when the API recognizes a Japanese text as a Chinese text and can be avoided by setting a language parameter. I wrote some codes to pass a language parameter to the API. Please let me have a permission to do pull request :)
Diff:
diff --git a/budou/budou.py b/budou/budou.py
index d9d3e53..df802d6 100644
--- a/budou/budou.py
+++ b/budou/budou.py
@@ -90,13 +90,15 @@ class Budou(object):
service = discovery.build('language', 'v1beta1', http=http)
return cls(service)
- def parse(self, source, classname=DEFAULT_CLASS_NAME, use_cache=True):
+ def parse(self, source, classname=DEFAULT_CLASS_NAME, use_cache=True,
+ language=''):
"""Parses input HTML code into word chunks and organized code.
Args:
source: HTML code to be processed (unicode).
classname: A class name of each word chunk in the HTML code (string).
user_cache: Whether to use cache (boolean).
+ language: A language used to parse text (string).
Returns:
A dictionary with the list of word chunks and organized HTML code.
@@ -110,7 +112,7 @@ class Budou(object):
source = self._preprocess(source)
dom = html.fragment_fromstring(source, create_parent='body')
input_text = dom.text_content()
- chunks = self._get_source_chunks(input_text)
+ chunks = self._get_source_chunks(input_text, language)
chunks = self._concatenate_punctuations(chunks)
chunks = self._concatenate_by_label(chunks, True)
chunks = self._concatenate_by_label(chunks, False)
@@ -132,7 +134,7 @@ class Budou(object):
return hashlib.md5(key_source.encode('utf8')).hexdigest()
- def _get_annotations(self, text, encoding='UTF32'):
+ def _get_annotations(self, text, language='', encoding='UTF32'):
"""Returns the list of annotations from the given text."""
body = {
'document': {
@@ -145,6 +147,9 @@ class Budou(object):
'encodingType': encoding,
}
+ if language:
+ body['document']['language'] = language
+
request = self.service.documents().annotateText(body=body)
response = request.execute()
return response.get('tokens', [])
@@ -163,18 +168,19 @@ class Budou(object):
source = re.sub(r'\s\s+', u' ', source)
return source
- def _get_source_chunks(self, input_text):
+ def _get_source_chunks(self, input_text, language=''):
"""Returns the words chunks.
Args:
input_text: An input text to annotate (unicode).
+ language: A language used to parse text (string).
Returns:
A list of word chunk objects (list).
"""
chunks = []
sentence_length = 0
- tokens = self._get_annotations(input_text)
+ tokens = self._get_annotations(input_text, language)
for token in tokens:
word = token['text']['content']
begin_offset = token['text']['beginOffset']
Result:
>>> import budou
>>> parser = budou.authenticate('xxxxx.json')
>>> parser.parse(u'再会')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build/bdist.macosx-10.6-x86_64/egg/budou/budou.py", line 114, in parse
File "build/bdist.macosx-10.6-x86_64/egg/budou/budou.py", line 183, in _get_source_chunks
File "build/bdist.macosx-10.6-x86_64/egg/budou/budou.py", line 154, in _get_annotations
File "/Users/yaboo/resources/anaconda/lib/python2.7/site-packages/oauth2client/util.py", line 137, in positional_wrapper
return wrapped(*args, **kwargs)
File "/Users/yaboo/resources/anaconda/lib/python2.7/site-packages/googleapiclient/http.py", line 838, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://language.googleapis.com/v1beta1/documents:annotateText?alt=json returned "The language zh is not supported for syntax analysis.">
>>> parser.parse(u'再会', language='ja')
{'chunks': [Chunk(word=u'\u518d\u4f1a', pos=u'NOUN', label=u'ROOT', forward=False)], 'html_code': u'<span class="ww">\u518d\u4f1a</span>'}
from budou.
Thank you for contribution. I merged your change to specify language parameter.
from budou.
Related Issues (20)
- Resolve html5lib's DeprecationWarning
- Consider to use cElementTree HOT 1
- Copy icon removes spaces (breaks Korean)
- Some characters are not recognized non-CJK characters to skip
- Non-breaking space character (/u00A0) causes AssertionError
- Adding Chinese support
- Use pickle for caching
- "?" is not included in a chunk HOT 1
- English characters should be ignored
- Can't install budou using pip HOT 4
- Set maximum length for the chunk HOT 1
- Minor mistake in README HOT 1
- Compatibility with grow HOT 1
- `aria-describedby` is stripped by `_html_serialize`
- Using `budou` name in Node.js port HOT 1
- Add Jieba support
- [run error] set GOOGLE_APPLICATION_CREDENTIALS HOT 1
- Span, Zero-width space, or wbr elements? HOT 3
- Chinese language name
- WARNING:googleapiclient.discovery_cache:file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from budou.