Coder Social home page Coder Social logo

dahlia / seonbi Goto Github PK

View Code? Open in Web Editor NEW
121.0 3.0 7.0 68.31 MB

SmartyPants for Korean language

Home Page: https://hackage.haskell.org/package/seonbi

License: GNU Lesser General Public License v2.1

Dockerfile 0.44% Haskell 84.24% Python 4.08% Makefile 0.35% XQuery 0.20% HTML 2.54% Shell 0.26% Lua 0.21% TypeScript 7.15% PowerShell 0.53%
korean document converter haskell hanja hangul

seonbi's Introduction

Seonbi: SmartyPants for Korean language

(TL;DR: See the demo web app.)

Seonbi (선비) is an HTML preprocessor that makes typographic adjustments to an HTML so that the result uses accurate punctuations according to the modern Korean orthography. (It's similar to what SmartyPants does for text written in English.)

It also transforms ko-Kore text (國漢文混用; Korean mixed script) into ko-Hang text (한글전용; Hangul-only script).

Seonbi provides a Haskell library, a CLI, and an HTTP API; any of them can perform the following transformations:

  • All hanja words (e.g., 漢字) into corresponding hangul-only words (e.g., 한자)
  • Straight quotes and apostrophes (" & ') into curly quotes HTML entities (, , , & )
  • Three consecutive periods (... or 。。。) into an ellipsis entity ()
  • Classical (Chinese-style) stops ( & ) into modern (English-style) stops (. & ,)
  • Pairs of less-than and greater-than inequality symbols (< & >) into pairs of proper angle quotes ( & )
  • Pairs of two consecutive inequality symbols (<< & >>) into pairs of proper double angle quotes ( & )
  • A hyphen (-) or hangul vowel eu () surrounded by spaces, or two/three consecutive hyphens (-- or ---) into a proper em dash ()
  • A less-than inequality symbol followed by a hyphen or an equality symbol (<-, <=) into arrows to the left (, )
  • A hyphen or an equality symbol followed by a greater-than inequality symbol (->, =>) into arrows to the right (, )
  • A hyphen or an equality symbol wrapped by inequality symbols (<->, <=>) into bi-directional arrows (, )

Each transformations can be partially turned on and off, and some transformations have many options.

All transformations work with both plain texts and rich text tree. In a similar way to SmartyPants, it does not modify characters within several sensitive HTML elements like <pre>/<code>/<script>/<kbd>. Chinese/Japanese stops or hanzi/kanji characters inside elements with lang="zh"/lang="ja"1 are never transformed.

End-user apps

Technically, Seonbi is basically exposed as a software component, which is also known as API (application programming interface), to be used as a module of other softwares.

However, as these official interfaces are not for humans but machines, it's not easy to use for end-users whom haven't experienced software programming. For such end-users, here's the list of end-user apps:

Installation

Seonbi provides official executable binaries for Linux (x86_64), macOS (Apple silicon & Intel), and Windows (64-bit). You can download them from the releases page.

If you prefer Scoop on Windows use the Seonbi official bucket:

scoop bucket add seonbi https://github.com/dahlia/seonbi.git
scoop install seonbi

It is also distributed as a Docker image:

$ echo '訓民正音' | docker run -i dahlia/seonbi:latest seonbi
훈민정음

If you'd like to use it on GitHub Actions there is action dahlia/seonbi/setup:

- uses: dahlia/seonbi/setup
- run: seonbi -o output.html input.html

If you want to use it as a Haskell library install the seonbi package using Stack or Cabal.

CLI

The seonbi command basically takes the input HTML as standard input, and then transforms it into the output HTML as standard output:

seonbi < input.html > output.html

You could pass a filename as an argument instead (and it is - by default):

seonbi input.html > output.html

There is -o/--output option as well:

seonbi -o output.html input.html

Although it automatically detects text encoding of the input file, you could explicitly specify -e/--encoding:

seonbi -e euc-kr -o output.html input.html

Although there are several style options, e.g., -q/--quote, -c/--cite, -r/--render-hanja, in most cases, giving -p/--preset is enough:

echo '平壤 冷麵' | seonbi -p ko-kr  # 평양 냉면
echo '平壤 冷麵' | seonbi -p ko-kp  # 평양 랭면

Read -h/--help for details:

seonbi --help

HTTP API

The seonbi-api command starts an HTTP server that takes POST requests with an HTML source with transformation options, and responds with a transformed result HTML. You can decide a hostname and a port number with -H/--host and -p/--port options:

seonbi-api -H 0.0.0.0 -p 3800

The following is an example request:

POST / HTTP/1.1
Content-Type: application/json
Host: localhost:3800

{
  "preset": "ko-kr",
  "contentType": "text/html",
  "sourceHtml": "<p>하늘과 바람과 별과 詩</p>"
}

The HTTP API server would respond like this:

HTTP/1.1 200 OK
Content-Type: application/json
Server: Seonbi/0.3.0

{
  "success": true,
  "contentType": "text/html",
  "resultHtml": "<p>하늘과 바람과 별과 시</p>"
}

If a web app needs to use the HTTP API server, CORS should be configured through --allow-origin/-o option:

seonbi-api -o https://example.com

To learn more about parameters interactively, try the demo web app.

Haskell API

All functions and types lie inside Text.Seonbi module and its submodules. The highest-level API is Text.Seonbi.Facade module.

See also the API docs or Hackage.

Deno API

There is a simple client library for Deno as well. See also the scripts/deno/ directory.

License

Distributed under LGPL 2.1 or later.

Etymology

Seonbi (선비) means a classical scholar during Joseon periods (14c–19c). Today there's a meme that calls a person who feels morally superior or has elitism seonbi in the Korean internet. So seonbi and smarty pants have some things in common.

Footnotes

  1. Technically, only Korean contents and language-unspecified elements are transformed. Elements having lang attribute with language tags referring to any Korean language are treated as Korean contents, e.g., ko, ko-Hang, kor-KP, kor-Kore.

seonbi's People

Contributors

dahlia avatar difro avatar dolsup avatar item4 avatar moreal avatar suminb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

seonbi's Issues

Initial Sound Law (頭音法則) is applied to non-initial hanja

Apparently Initial Sound Law (頭音法則) is applied to non-initial hanja character of Sino-Korean words that are not listed in the dictionary:

$ echo 可利 | seonbi --preset=ko-kr
가이
$ echo 營利 | seonbi --preset=ko-kr
영리

However, it works correctly again when the dictionary is entirely opted out:

$ echo 可利 | seonbi --no-kr-stdict
가리
$ echo 營利 | seonbi --no-kr-stdict
영리

I guess it looks up each single character in the dictionary instead of Unicode's kHangul property.

Each group of tabs in the demo app should be orthogonal

Two groups of tabs in the demo app

Currently the demo app has two groups of tabs, and there is a bug that if a tab in a group is selected the first tab of the other group is selected; it seems a change on a tab group resets the state of the other group. These two groups of tabs should be independent from each other.

Prevent non-Korean contents from being processed

HTML elements can contain an attribute lang to indicate the language of its content. If an HTML element is guaranteed that its content does not consist of Korean, i.e., it has a lang attribute and the language tag does not start with ko, Seonbi should not process the element and its contents.

For example, the following input:

<p>小說 <cite lang="ja">徳川家康</cite>는 韓國에서 <cite>大望</cite>이라는
海賊版으로 옮겨진 것이 더 有名했다.</p>

… should be processed like:

<p>소설 <cite lang="ja">徳川家康</cite>는 한국에서 <cite>대망</cite>이라는
해적판으로 옮겨진 것이 더 유명했다.</p>

Allow different `HanjaWordRenderer`s to be applied for different dictionaries

User custom dictionaries can be used for irregular readings such as place names or names of people (e.g., 차이잉원 instead of 채영문 for 蔡英文). In case of unpopular names, an author may want to show their original names in Chinese characters to make them clear. However, they may not want to show Chinese characters of every Sino-Korean words at a time.

To keep a balance between clarity and conciseness, Seonbi should allow different HanjaWordRenderers to be applied for vocabularies from different dictionaries.

Standard Korean Language Dictionary is not included if flag(static) is turned on

Standard Korean Language Dictionary (data/ko-kr-stdict.tsv) is not included if flag(static) is turned on. It's because Text.Seonbi.Facade.southKoreanDictionary loads it using Cabal's data-files metadata at runtime, but these data files are not included into the executable binary.

This needs to be fixed so that if flag(static) is turned on at compile-time the code to load the data file uses file-embed instead of Cabal's data-files metadata. Note that file-embed makes data files embedded into the executable binary.

Packaging for various platforms

For better access, it would be good to have platform-specific packages:

  • Chocolatey (or Scoop) for Windows
  • Homebrew (or Homebrew Cask) for macOS
  • Snap (or Flatpak) for Linux
  • Action(s) for GitHub Actions (e.g., setup-seonbi)

Don't have to make all of them at once. Before starting each one, file a dedicate issue for it.

Bundle required DLLs for Windows binaries

Automatically built (through GitHub Actions) executable binaries for Windows neither bundle required DLL files (e.g., libstdc++-6.dll) nor are static-linked. We should bundle them or static-link required libraries to .exe files.

Keep dyno awake with Uptime Robot

Summary

  • If an app has a free web dyno, and that dyno receives no web traffic in a 30-minute period, it will sleep.
  • Sometimes seonbi API hangs or too slow because of dyno sleeping.
    • It also affects when you call the seonbi API with iOS Shortcuts.
  • Using Uptime Robot, you can make http requests every five minutes, and also it is free!
    • It makes your dyno keep awake.

Reference

Pandoc JSON filter

Making Seonbi to support every markup language in the world is nonsense. Instead, it's much more efficient to let it work as a Pandoc JSON filter, so that the following usage is possible:

pandoc -f asciidoc -t json ko-Kore.adoc | seonbi-pandoc | pandoc -f json -t asciidoc -o ko-Hang.adoc

… or in short:

pandoc --filter seonbi-pandoc -f asciidoc -t asciidoc -o ko-Hang.adoc ko-Kore.adoc 

If there are multiple readings for a word prefer the longest definition

There are Sino-Korean words that have more than one hangul reading in the Standard Korean Language Dictionary (標準國語大辭典). If a word has multiple readings the most widely used reading should be chosen. In order to approximate “most-adoptedness” we could use a heuristic which assumes a reading with the longest definition is the most widely used reading. For example, there are two readings for 乾燥:

간조2 (乾燥)

[명사]

  1. ‘건조4’의 원말.

건조4 (乾燥)

[명사]

  1. 말라서 습기가 없음.
  2. 물기나 습기가 말라서 없어짐. 또는 물기나 습기를 말려서 없앰.
  3. 분위기, 정신, 표현, 환경 따위가 여유나 윤기 없이 딱딱함.

Since the definition of 건조4 is longer than 간조2's one, we could choose the reading 건조 for 乾燥.

Plain text mode besides (X)HTML mode

Seonbi currently provides two modes for input/output formatting:

  • HTML
  • XHTML

However, sometimes we need to apply adjustments on text before it's compiled to HTML, or it won't be compiled to HTML at all. For such demands, it's good to have one more mode: plain text.

Fortunately, it apparently is not that difficult to implement:

  1. Escape the whole input to HTML character entities (or simply put it to a CDATA section).
  2. Make adjustments on it.
  3. Unescape the whole output.

`softprops/action-gh-release@v1` does not work as expected (at least with 0.3-maintenance)

See also: https://github.com/dahlia/seonbi/actions/runs/3942160465/jobs/6750632265

2023-01-18T00:39:09.4401113Z ##[group]Run softprops/action-gh-release@v1
2023-01-18T00:39:09.4401351Z with:
2023-01-18T00:39:09.4401674Z   token: ***
2023-01-18T00:39:09.4401866Z   name: Seonbi 0.3.4
2023-01-18T00:39:09.4402071Z   files: /tmp/sdist/* /tmp/dists/*
2023-01-18T00:39:09.4402283Z ##[endgroup]
2023-01-18T00:39:09.5278174Z 🤔 Pattern '/tmp/sdist/* /tmp/dists/*' does not match any files.
2023-01-18T00:39:09.9290528Z 🤔 /tmp/sdist/* /tmp/dists/* not include valid file.
2023-01-18T00:39:09.9299664Z 🎉 Release ready at https://github.com/dahlia/seonbi/releases/tag/0.3.4

The following issue is apparently relevant: softprops/action-gh-release#280

Homoglyph puntuations

South Korean official documents tend to confuse homoglyphs, like U+318D HANGUL LETTER ARAEA (ㆍ) with U+00B7 MIDDLE DOT (·). It would be good if Seonbi can recognize punctuations made with homoglyphs.

  • U+318D HANGUL LETTER ARAEA (ㆍ) → U+00B7 MIDDLE DOT (·)

Normalize variant characters to orthodox characters

Sometimes Korean texts contain some variants characters (異體字; e.g., 俗字, 略字), or directly quote Simplified Chinese characters (簡化字) or Japanese shinjitai (新字體). As dictionaries tend to index words in a single writing system, these characters need to be normalized to orthodox characters (康熙字典體) so that they are able to match to dictionary indices.

E.g., *** or 毛沢東 → 毛澤東.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.