Coder Social home page Coder Social logo

Comments (2)

yaboo-oyabu avatar yaboo-oyabu commented on July 28, 2024

I found that this issue is caused by Natural Language API when the API recognizes a Japanese text as a Chinese text and can be avoided by setting a language parameter. I wrote some codes to pass a language parameter to the API. Please let me have a permission to do pull request :)

Diff:

diff --git a/budou/budou.py b/budou/budou.py
index d9d3e53..df802d6 100644
--- a/budou/budou.py
+++ b/budou/budou.py
@@ -90,13 +90,15 @@ class Budou(object):
     service = discovery.build('language', 'v1beta1', http=http)
     return cls(service)

-  def parse(self, source, classname=DEFAULT_CLASS_NAME, use_cache=True):
+  def parse(self, source, classname=DEFAULT_CLASS_NAME, use_cache=True,
+            language=''):
     """Parses input HTML code into word chunks and organized code.

     Args:
       source: HTML code to be processed (unicode).
       classname: A class name of each word chunk in the HTML code (string).
       user_cache: Whether to use cache (boolean).
+      language: A language used to parse text (string).

     Returns:
       A dictionary with the list of word chunks and organized HTML code.
@@ -110,7 +112,7 @@ class Budou(object):
     source = self._preprocess(source)
     dom = html.fragment_fromstring(source, create_parent='body')
     input_text = dom.text_content()
-    chunks = self._get_source_chunks(input_text)
+    chunks = self._get_source_chunks(input_text, language)
     chunks = self._concatenate_punctuations(chunks)
     chunks = self._concatenate_by_label(chunks, True)
     chunks = self._concatenate_by_label(chunks, False)
@@ -132,7 +134,7 @@ class Budou(object):
     return hashlib.md5(key_source.encode('utf8')).hexdigest()


-  def _get_annotations(self, text, encoding='UTF32'):
+  def _get_annotations(self, text, language='', encoding='UTF32'):
     """Returns the list of annotations from the given text."""
     body = {
         'document': {
@@ -145,6 +147,9 @@ class Budou(object):
         'encodingType': encoding,
     }

+    if language:
+        body['document']['language'] = language
+
     request = self.service.documents().annotateText(body=body)
     response = request.execute()
     return response.get('tokens', [])
@@ -163,18 +168,19 @@ class Budou(object):
     source = re.sub(r'\s\s+', u' ', source)
     return source

-  def _get_source_chunks(self, input_text):
+  def _get_source_chunks(self, input_text, language=''):
     """Returns the words chunks.

     Args:
       input_text: An input text to annotate (unicode).
+      language: A language used to parse text (string).

     Returns:
       A list of word chunk objects (list).
     """
     chunks = []
     sentence_length = 0
-    tokens = self._get_annotations(input_text)
+    tokens = self._get_annotations(input_text, language)
     for token in tokens:
       word = token['text']['content']
       begin_offset = token['text']['beginOffset']

Result:

>>> import budou
>>> parser = budou.authenticate('xxxxx.json')
>>> parser.parse(u'再会')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.6-x86_64/egg/budou/budou.py", line 114, in parse
  File "build/bdist.macosx-10.6-x86_64/egg/budou/budou.py", line 183, in _get_source_chunks
  File "build/bdist.macosx-10.6-x86_64/egg/budou/budou.py", line 154, in _get_annotations
  File "/Users/yaboo/resources/anaconda/lib/python2.7/site-packages/oauth2client/util.py", line 137, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/Users/yaboo/resources/anaconda/lib/python2.7/site-packages/googleapiclient/http.py", line 838, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://language.googleapis.com/v1beta1/documents:annotateText?alt=json returned "The language zh is not supported for syntax analysis.">
>>> parser.parse(u'再会', language='ja')
{'chunks': [Chunk(word=u'\u518d\u4f1a', pos=u'NOUN', label=u'ROOT', forward=False)], 'html_code': u'<span class="ww">\u518d\u4f1a</span>'}

from budou.

tushuhei avatar tushuhei commented on July 28, 2024

Thank you for contribution. I merged your change to specify language parameter.

from budou.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.