cihai / unihan-etl Goto Github PK

View Code? Open in Web Editor NEW

51.0 6.0 13.0 1.99 MB

Export UNIHAN's database to csv, json or yaml

Home Page: https://unihan-etl.git-pull.com

License: MIT License

Python 98.56% Makefile 1.44%

unihan cjk chinese japanese korean chinese-dictionary chinese-words unihan-database dictionary unicode

unihan-etl's Introduction

unihan-etl ·

An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.

unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.

This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.

As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).

The UNIHAN database

The UNIHAN database organizes data across multiple files, exemplified below:

U+3400	kCantonese		jau1
U+3400	kDefinition		(same as U+4E18 丘) hillock or mound
U+3400	kMandarin		qiū
U+3401	kCantonese		tim2
U+3401	kDefinition		to lick; to taste, a mat, bamboo bark
U+3401	kHanyuPinyin		10019.020:tiàn
U+3401	kMandarin		tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

U+5EFE	kHanyuPinyin		10513.110,10514.010,10514.020:gǒng
U+5364	kHanyuPinyin		10093.130:xī,lǔ 74609.020:lǔ,xī

kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, "Flat" output

CSV (default)

$ unihan-etl

char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

With $ unihan-etl -F yaml --no-expand:

- char: 㐀
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 㐁
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

To preview in the CLI, try tabview or csvlens.

JSON

$ unihan-etl -F json --no-expand

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kCantonese": "jau1",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kCantonese": "tim2",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

Tools:

View in CLI: python-fx, jless or fx.
Filter via CLI: jq, jql, gojq.

YAML

$ unihan-etl -F yaml --no-expand

- char: 㐀
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 㐁
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

Filter via the CLI with yq.

"Structured" output

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV?

Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.

JSON

$ unihan-etl -F json

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": ["(same as U+4E18 丘) hillock or mound"],
    "kCantonese": ["jau1"],
    "kMandarin": {
      "zh-Hans": "qiū",
      "zh-Hant": "qiū"
    }
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
    "kCantonese": ["tim2"],
    "kHanyuPinyin": [
      {
        "locations": [
          {
            "volume": 1,
            "page": 19,
            "character": 2,
            "virtual": 0
          }
        ],
        "readings": ["tiàn"]
      }
    ],
    "kMandarin": {
      "zh-Hans": "tiàn",
      "zh-Hant": "tiàn"
    }
  }
]

YAML

$ unihan-etl -F yaml

- char: 㐀
  kCantonese:
    - jau1
  kDefinition:
    - (same as U+4E18 丘) hillock or mound
  kMandarin:
    zh-Hans: qiū
    zh-Hant: qiū
  ucn: U+3400
- char: 㐁
  kCantonese:
    - tim2
  kDefinition:
    - to lick
    - to taste, a mat, bamboo bark
  kHanyuPinyin:
    - locations:
        - character: 2
          page: 19
          virtual: 0
          volume: 1
      readings:
        - tiàn
  kMandarin:
    zh-Hans: tiàn
    zh-Hant: tiàn
  ucn: U+3401

Features

automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN's database design
export to JSON, CSV and YAML (requires pyyaml) via -F
configurable to export specific fields via -f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports >= 3.7 and pypy

If you encounter a problem or have a question, please create an issue.

Installation

To download and build your own UNIHAN export:

$ pip install --user unihan-etl

or by pipx:

$ pipx install unihan-etl

Developmental releases

pip:

$ pip install --user --upgrade --pre unihan-etl

pipx:

$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession

Usage

unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.

To output CSV, the default format:

$ unihan-etl

To output JSON:

$ unihan-etl -F json

To output YAML:

$ pip install --user pyyaml
$ unihan-etl -F yaml

To only output the kDefinition field in a csv:

$ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-etl -f kCantonese kDefinition

To output to a custom file:

$ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout

# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/

# output dir
{XDG data dir}/unihan_etl/
  unihan.json
  unihan.csv
  unihan.yaml   # (requires pyyaml)

# package dir
unihan_etl/
  core.py    # argparse, download, extract, transform UNIHAN's data
  options.py    # configuration object
  constants.py  # immutable data vars (field to filename mappings, etc)
  expansion.py  # extracting details baked inside of fields
  types.py      # type annotations
  util.py       # utility / helper functions

# test suite
tests/*

API

The package is python underneath the hood, you can utilize its full API. Example:

>>> from unihan_etl.core import Packager
>>> pkgr = Packager()
>>> hasattr(pkgr.options, 'destination')
True

Developing

$ git clone https://github.com/cihai/unihan-etl.git

$ cd unihan-etl

Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).

More information

unihan-etl's People

Contributors

Stargazers

Watchers

Forkers

pabloromanh brettz9 fossabot gitter-badger qiuchaofan linnaea anilktechie xavier-taylor kevin51jiang cantoboard kianmeng ybhwang xianbang2021

unihan-etl's Issues

Use `dataclass` config object

Dataclasses (standard library) via PEP 557 are typed, inheritable, configurable, accessible via dot-notation and fit the role of config objects perfectly.

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

pytest plugin w/ session-based cihai download

For use in this project (unihan-etl) as well as unihandb, cihai, cihai-cli

Session scope

Support for Unicode 11.0 fields, such as kKoreanEducationHanja / kKoreanName

Hello. Recent Unicode 11.0 standard release introduced several new fields, such as kKoreanEducationHanja / kKoreanName / etc.

(Refer to https://www.unicode.org/reports/tr38/index.html#kKoreanName)

It would be very nice if unihan-etl supports such new fields.

`doctest` examples

https://docs.python.org/3/library/doctest.html

Tested documentation examples

Update UNIHAN 11.0.0 -> 15.1.0 (2019 -> 2023)

As of 86865e0 we support UNIHAN rev 25 via Unicode 11.0.0 (https://www.unicode.org/reports/tr38/tr38-25.html#History)

UNIHAN rev 35 (Unicode 15.1.0, 2023-09-01) https://www.unicode.org/reports/tr38/ - (Changes)
UNIHAN rev 33 (Unicode 15.0.0, 2022-09-12) https://www.unicode.org/reports/tr38/tr38-33.html - (Changes)
UNIHAN rev 31 (Unicode 14.0.0, 2021-08-26) https://www.unicode.org/reports/tr38/tr38-31.html - (Changes)
UNIHAN rev 29 (Unicode 13.0.0, 2020-03-05) https://www.unicode.org/reports/tr38/tr38-29.html - (Changes)
UNIHAN rev 27 (Unicode 13.0.0, 2019-02-15) https://www.unicode.org/reports/tr38/tr38-27.html - (Changes)

Docs

Changelog

Changes to UNIHAN

Version	Properties Added	Properties Removed
15.1.0	kJapanese, kMojiJoho, kSMSZD2003Index, kSMSZD2003Readings, kVietnameseNumeric, kZhuangNumeric	kHKSCS, kIRGDaiKanwaZiten, kKPS0, kKPS1, kKSC0, kKSC1, kRSKangXi
15.0.0	kAlternateTotalStrokes
14.0.0	kStrange
13.0.0	kIRG_SSource, kIRG_UKSource, kSpoofingVariant, kTGHZ2013, kUnihanCore2020	kRSJapanese, kRSKanWa, kRSKorean
12.0.0		kDefaultSortKey (private property)
11.0.0	kJinmeiyoKanji, kJoyoKanji, kKoreanEducationHanja, kKoreanName, kTGH

Next generation (needs funding and time)

P.S. I am out of contact with anyone from UNIHAN, is someone else already on the same effort as me? Can this effort be shared in any way?

This project can do much more to unlock the breadth and depth of UNIHAN:

Sustainability (from an informatic standpoint)
Correctness

Digging deeper into the Database design, more needs to be done to ensure extraction and interlation are provided in a structured and detailed way.
Documentation
Typings
Speed and performance
Potentially
- Checking and code generation
  
  Perhaps https://www.unicode.org/reports/tr38/ can be crawled and used to verify correctness, and to an extent, in the future, we can generate.
- Cross-language compatibility
- Language-based speedups, e.g. rust json / yaml / csv parsing. Perhaps the whole core can be a rust-based package with language interconnections

UNIHAN can be made even more accessible to the masses - I am the one that can make happen, but it would take time and above all: Funding. This would need to be my 100% focus of my free time outside of work for months, or even longer.

Create `unihan_etl.testing` module as a pytest plugin

Implement it like this:

pytest_plugins = ("unihan_etl.testing.fixtures",)

example from https://github.com/chrisjsewell/sphinx-pytest, via sphinx.tests

`krsUnicode`: Upcoming 3 apostrophe change

@tony Just a heads-up that you'll likely need to handle 3 apostrophes here when Unicode 16.0 comes out. Details on why here: https://www.unicode.org/review/pri483/feedback.html#ID20240328172102

In re: #315 (comment)

Support for Unicode 10.0 (2017-06-15)

https://www.unicode.org/reports/tr38/tr38-23.html

https://www.unicode.org/reports/tr38/tr38-23.html#History

Modifications

Revision 23

Reissued for Unicode 10.0.0.
Updated the regular expressions for the kIICore, kIRG_GSource, kIRG_HSource, kIRG_JSource, kIRG_KSource, and kRSKangXi fields.
Updated terminology to reflect the difference between the IRG's U-source and the set of characters submitted by the UTC (now referred to as the "UTC-source").
Added references to CJK Unified Ideographs Extension F block.
Revision 22 being a proposed update, only changes between revisions 21 and 23 are noted here.

Revision 21

Reissued for Unicode 9.0.0.
Updated the descriptions and regular expressions for the kIRG_GSource and kIRG_JSource fields.
Updated the regular expression for the kHangul field.
Updated the description of the kKorean field to clarify its current status.
Updated the description and regular expression for the kRSUnicode field to allow for negative residual stroke values.

Support UNIHAN revision 31 (Unicode 14.0.0)

As of 86865e0 we support UNIHAN rev 25 via Unicode 11.0.0 (https://www.unicode.org/reports/tr38/tr38-25.html#History)

UNIHAN rev 31 (Unicode 14.0.0, 2021-08-26) https://www.unicode.org/reports/tr38/ - (Changes)
UNIHAN rev 29 (Unicode 13.0.0, 2020-03-05) https://www.unicode.org/reports/tr38/tr38-29.html - (Changes)
UNIHAN rev 27 (Unicode 13.0.0, 2019-02-15) https://www.unicode.org/reports/tr38/tr38-27.html - (Changes)

Pytest plugin, this project: Cache UNIHAN data in GitHub CI

Repository: Git rewrite / force push

On today (September 2nd, 2023) there was a forced rewrite by me to the repository to remove exported.csv , a 21kB export of UNIHAN data (generated by unihan-etl, this very app). It was accidentally added to the repo.

The tool used: https://github.com/newren/git-filter-repo/

Command:

git filter-repo --invert-paths --path exported.csv --force

Clean bad data?

Hi,

I have found an instance of bad data in the database. I guess there could be more. Should the UniHan data be automatically cleaned before importing?

from cihai.core import Cihai
from cihai.bootstrap import bootstrap_unihan

cihan = Cihai()
if not cihan.is_bootstrapped:
    bootstrap_unihan(cihan.metadata)

cihan.reflect_db()

c = cihan.lookup_char('任').first()
print(c.kZVariant)

$ python bad_data.py
U+4EFC<kHKGlyph

option to save as a dictionary instead of a list

I have found that I always need to convert the data into a dictionary (instead of the default list) when I'm using it. Because of this, I decided to always store the file in dictionary format. My method for doing so is a bit hacky, and it would be great to have a --structure <dict|list> or even --dictionary parameter to do this within unihan_etl.

Here's my current code. It relies on the undocumented python formatting option:

from unihan_etl.process import Packager as unihan_packager
from unihan_etl.process import export_json

def unihan_download(unihan_file):
    # destination argument is required even though the packager will not write the file
    p = unihan_packager.from_cli(["-F", "json", "--destination", unihan_file])
    p.download()
    # instruct packager to return data instead of writing to file
    p.options["format"] = "python"
    unihan = p.export()

    # convert from list to dictionary
    unihan_dict = {entry["char"]: entry for entry in unihan}

    export_json(unihan_dict, unihan_file)

kRSUnicode bug

That was a really fast response :D

This is actually my bad; the latest unihan_etl already has a fix for this in place, and I mistakenly thought I had updated.

The issue is a typo in the kRSUnicode field for 亀: https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E4%BA%80. It has two apostrophes, which does not follow the syntax specified in the standard. unihan_etl has already updated its parsing to allow the second apostrophe.

I did have to update my code for some unihan_etl changes, but nothing crazy.

datapackage question

Hi, thank you for your project! It's wonderful and easy to use.

Just a question for you: I see from your old TODOs and from your comments in the cihai project that you originally planned to make this into a data package. I'm also interested in the idea of datapackages, but haven't completely wrapped my head around how they work. Did you abandon the idea of creating a datapackage because it didn't seem worthwhile, or because there was some specific shortcoming? The ecosystem seems to have evolved quite a bit recently. I would love to start working on CJKV data package work, but I want to hear someone else's experiences with it first before I commit to starting.

Thanks for your time! (You can close this if you don't feel like answering; it's not really an "issue", per se).

Move files to `pathlib`

pathlib ensures more cleaner, consistent, standard-library file operations

Create a GH repo of Unihan.zip's (of varying versions)

Of varying UNIHAN release.

Perhaps even a setup-unihan GitHub action

Node wrapper?

Hi,

Looks like an excellent project! A potentially great companion to https://github.com/mathiasbynens/node-unicode-data for us Node users...

I am intending to look into making a Node wrapper for your library that can be used with npm.

If I get to this, would you be interested in including it within your repository if it ends up being fairly light-weight? For now, I'm planning to use "node-unihan-etl" as an npm name in case you decide to try to get "unihan-etl", but let me know if you'd be ok with me going ahead with the latter.

Find out what to do with `zhon`

Python 3.10+ are going to strain with these escape sequences

See tsroten/zhon#34

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40: DeprecationWarning: invalid escape sequence '\]'
    non_stops = """"#$%&'()*+,-/:;<=>@[\]^_`{|}~"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:153
  ~/.cache/...0/lib/python3.10/site-packages/zhon/pinyin.py:153: DeprecationWarning: invalid escape sequence '\]'
    """[%(stops)s]['"\]\}\)]*"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154: DeprecationWarning: invalid escape sequence '\-'
    ) % {'word': word, 'non_stops': non_stops.replace('-', '\-'),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

P.S. @tsroten - thank you for the package!

cihai / unihan-etl Goto Github PK

unihan-etl's Introduction

unihan-etl ·

The UNIHAN database

Tabular, "Flat" output

CSV (default)

JSON

YAML

"Structured" output

JSON

YAML

Features

Installation

Developmental releases

Usage

Code layout

API

Developing

More information

unihan-etl's People

Contributors

Stargazers

Watchers

Forkers

unihan-etl's Issues

Docs

Changes to UNIHAN

Modifications

Revision 23

Revision 21

Recommend Projects

Recommend Topics

Recommend Org