soimort / python-romkan Goto Github PK

:sa: A Romaji/Kana conversion library for Python

Home Page: http://www.soimort.org/python-romkan/

License: Other

Python 100.00%

python-romkan's Introduction

python-romkan

python-romkan is a Romaji/Kana conversion library for Python, which is used to convert a Japanese Romaji (ローマ字) string to a Japanese Kana (仮名) string or vice versa.

It is the Pythonic port of Ruby/Romkan, originally authored by Satoru Takabayashi and ported by Masato Hagiwara.

python-romkan works on Python 2 and Python 3 (fully tested on Python 2.6, 2.7, 3.2, 3.3 and PyPy). It handles both Katakana (片仮名) and Hiragana (平仮名) with the Hepburn (ヘボン式) romanization system, as well as the modern Kunrei-shiki (訓令式) romanization system.

Project homepage: http://www.soimort.org/python-romkan

Fork me on GitHub: https://github.com/soimort/python-romkan

Installation

1. Install via Pip:

$ pip install romkan

2. Install via EasyInstall:

$ easy_install romkan

3. Install from Git:

$ git clone git://github.com/soimort/python-romkan.git
$ python setup.py install

Usage

Python 3.x:

$ python
>>> import romkan
>>> print(romkan.to_roma("にんじゃ"))
ninja
>>> print(romkan.to_hepburn("にんじゃ"))
ninja
>>> print(romkan.to_kunrei("にんじゃ"))
ninzya
>>> print(romkan.to_hiragana("ninja"))
にんじゃ
>>> print(romkan.to_katakana("ninja"))
ニンジャ

Python 2.x:

$ python2
>>> import romkan
>>> print romkan.to_roma(u"にんじゃ")
ninja
>>> print romkan.to_hepburn(u"にんじゃ")
ninja
>>> print romkan.to_kunrei(u"にんじゃ")
ninzya
>>> print romkan.to_hiragana("ninja")
にんじゃ
>>> print romkan.to_katakana("ninja")
ニンジャ

API Reference

to_katakana(string)

Convert a Romaji (ローマ字) to a Katakana (片仮名).

to_hiragana(string)

Convert a Romaji (ローマ字) to a Hiragana (平仮名).

to_kana(string)

Convert a Romaji (ローマ字) to a Katakana (片仮名). (same as to_katakana)

to_hepburn(string)

Convert a Kana (仮名) or a Kunrei-shiki Romaji (訓令式ローマ字) to a Hepburn Romaji (ヘボン式ローマ字).

to_kunrei(string)

Convert a Kana (仮名) or a Hepburn Romaji (ヘボン式ローマ字) to a Kunrei-shiki Romaji (訓令式ローマ字).

to_roma(string)

Convert a Kana (仮名) to a Hepburn Romaji (ヘボン式ローマ字).

License

python-romkan is licensed under the BSD license.

python-romkan's People

Contributors

Stargazers

Watchers

Forkers

melissaboiko hermanschaaf blagarde jjones-jr lovejavaee maikroeder janzd ishitatakeshi shimomichi emreed1 pseitz ovwane perfectquan drinkingsheep cybort w3ss zuik noaione tejstead scalalearn kacchan822 moniquemurphy dcdevac slashharken mudsu bgolda tsubakibotpad lyf79288556 jdk6979 hiroshiba crlotwhite isabella232 elvisnguyen bel-shazzar recman55 thecommanderfort nomissbowling ilius brunoais raghurama123 buyuancui

python-romkan's Issues

fails on double 'n'

In both traditional and modified Hepburn romanization, "annai" should be rendered as あんない, but romkan renders it as あんあい.

import romkan
romkan.to_hiragana("annai")
'あんあい'

Failed to install when default encoding is not UTF-8

Traceback (most recent call last):
  File "setup.py", line 12, in <module>
    README = open(os.path.join(here, 'README.rst')).read()
  File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 181: ordinal not in range(128)

Exceptions for "kannji" and "kannzi"

Test cases for the following functions currently check the output of "kannji" and "kannzi":

to_katakana(...) - Convert a Romaji (ローマ字) to a Katakana (片仮名).
to_kana(...) - Convert a Romaji (ローマ字) to a Katakana (片仮名). (same as to_katakana)
to_hiragana(...) - Convert a Romaji (ローマ字) to a Hiragana (平仮名).
to_hepburn(...) - Convert a Kana (仮名) or a Kunrei-shiki Romaji (訓令式ローマ字) to a Hepburn Romaji (ヘボン式ローマ字).

These two forms ("kannji" and "kannzi"), are valid in neither Hepburn nor Kunrei romanization schemes therefore they cannot be considered valid "Romaji" input. when When given these inputs, instead of returning values ("カンジ", "かんじ", and "kanji"), the four above functions should raise an exception.

If we wanted to preserve the ability to convert from these forms, support for a third form of romanization should be added - "Wāpuro rōmaji" (word processor romaji):
http://en.wikipedia.org/wiki/W%C4%81puro_r%C5%8Dmaji

Macrons for some long vowels

The Hepburn romaniser seems to follow the Modified Hepburn version.
http://en.wikipedia.org/wiki/Hepburn_romanization

e.g. It produces 'shinpai' instead of 'shimpai'.

Modified Hepburn has complex rules for placing macrons to indicate that some vowels are long, which the romaniser does not currently do.

e.g.
Expected results

ちゅうい  chūi
みずうみ  mizuumi

Actual results:

ちゅうい  chuui
みずうみ  mizuumi

Another example is with 'tōkyō', which currently gets output as 'toukyou'.

Add reversible romanization method

Neither of the current romanization methods are fully reversible.

Effort seems to have been put towards making to_kunrei reversible, even though this forces it to not strictly follow the Kunrei scheme:
to_kunrei:

ち -> ti
てぃ  -> texi (should also be 'ti')

(See "ティーム" vs. "チーム", http://en.wikipedia.org/wiki/Kunrei-shiki_romanization)

A reversible method is a useful thing to have, however:

The current function probably should be named differently because it doesn't follow the Kunrei scheme exactly (how about "reversible", or "romkan"?)
The current to_kunrei function is nearly there, but still not perfectly reversible:

to_kunrei:

ぢ -> dyi
でぃ  -> dyi

The Hepburn function is also not reversible, but this is a known property of the scheme.

to_hepburn:

ず -> zu
づ -> zu

Support for Python 2.5

Install of romkan 0.2.1 fails

pip install romkan failed in windows 10 with python 3.9.6 with the following error message:

README = open(os.path.join(here, 'README.rst')).read()
       File "C:\Python396\lib\encodings\cp1250.py", line 23, in decode
         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 182

I could overcome this issue by setting PYTHONUTF8=1 environment variable.
(In the code, encoding="utf-8" should be added to open().)

Also, there was a deprecation notice:
DeprecationWarning: the imp module is deprecated in favour of importlib

to_kana() doesn't consistently return Hepburn or Kunrei

Hello,

I have already reported a couple of other issues and a PR, but I haven't yet even taken the time to thank you for this neat package... Thank you!!

I am opening this issue because I am a bit confused with which inverse romanization I should expect to_kana(str) to return.

These lines suggest that your intent was for it to return the Hepburn version if possible, otherwise the Kunrei version:
https://github.com/soimort/python-romkan/blob/master/src/romkan/common.py#L373-376

Later however, ROMKAN.update( {"ti": "チ"} ) explicitly prescribes Kunrei over Hepburn:
https://github.com/soimort/python-romkan/blob/master/src/romkan/common.py#L382-383
( "チ" is Kunrei, "ティ" is Hepburn)

What is the rationale behind this?
Is the intent to emulate keyboard input method ("wapuro" style) inverse romanization?

Thanks!
Baptiste

Roman repeatedly fails in recoding romaji fu into Kana at all.

This appears to be a fundamental failing withing the way fu is specifically handled.

I'all be trying to work out an alternate "fix" so that the module works as intended.

importlib load_source Error while running setup.py

Hi
I am facing this error after running the setup.py file. Please help

python setup.py install
Traceback (most recent call last):
File "C:\Users\lahhe\Documents\python-romkan-master\setup.py", line 17, in
VERSION = imp.load_source('version', os.path.join(here, 'src/%s/version.py' % PACKAGE_NAME)).version
^^^^^^^^^^^^^^^
AttributeError: module 'importlib' has no attribute 'load_source'