Coder Social home page Coder Social logo

cihai / unihan-etl Goto Github PK

View Code? Open in Web Editor NEW
51.0 6.0 13.0 1.99 MB

Export UNIHAN's database to csv, json or yaml

Home Page: https://unihan-etl.git-pull.com

License: MIT License

Python 98.56% Makefile 1.44%
unihan cjk chinese japanese korean chinese-dictionary chinese-words unihan-database dictionary unicode

unihan-etl's Introduction

unihan-etl · Python Package License Code Coverage

An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.

unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.

This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.

As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).

The UNIHAN database

The UNIHAN database organizes data across multiple files, exemplified below:

U+3400	kCantonese		jau1
U+3400	kDefinition		(same as U+4E18 丘) hillock or mound
U+3400	kMandarin		qiū
U+3401	kCantonese		tim2
U+3401	kDefinition		to lick; to taste, a mat, bamboo bark
U+3401	kHanyuPinyin		10019.020:tiàn
U+3401	kMandarin		tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

U+5EFE	kHanyuPinyin		10513.110,10514.010,10514.020:gǒng
U+5364	kHanyuPinyin		10093.130:xī,lǔ 74609.020:lǔ,xī

kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, "Flat" output

CSV (default)

$ unihan-etl
char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

With $ unihan-etl -F yaml --no-expand:

- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

To preview in the CLI, try tabview or csvlens.

JSON

$ unihan-etl -F json --no-expand
[
  {
    "char": "",
    "ucn": "U+3400",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kCantonese": "jau1",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "",
    "ucn": "U+3401",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kCantonese": "tim2",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

Tools:

YAML

$ unihan-etl -F yaml --no-expand
- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

Filter via the CLI with yq.

"Structured" output

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV?

Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.

JSON

$ unihan-etl -F json
[
  {
    "char": "",
    "ucn": "U+3400",
    "kDefinition": ["(same as U+4E18 丘) hillock or mound"],
    "kCantonese": ["jau1"],
    "kMandarin": {
      "zh-Hans": "qiū",
      "zh-Hant": "qiū"
    }
  },
  {
    "char": "",
    "ucn": "U+3401",
    "kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
    "kCantonese": ["tim2"],
    "kHanyuPinyin": [
      {
        "locations": [
          {
            "volume": 1,
            "page": 19,
            "character": 2,
            "virtual": 0
          }
        ],
        "readings": ["tiàn"]
      }
    ],
    "kMandarin": {
      "zh-Hans": "tiàn",
      "zh-Hant": "tiàn"
    }
  }
]

YAML

$ unihan-etl -F yaml
- char: 
  kCantonese:
    - jau1
  kDefinition:
    - (same as U+4E18 丘) hillock or mound
  kMandarin:
    zh-Hans: qiū
    zh-Hant: qiū
  ucn: U+3400
- char: 
  kCantonese:
    - tim2
  kDefinition:
    - to lick
    - to taste, a mat, bamboo bark
  kHanyuPinyin:
    - locations:
        - character: 2
          page: 19
          virtual: 0
          volume: 1
      readings:
        - tiàn
  kMandarin:
    zh-Hans: tiàn
    zh-Hant: tiàn
  ucn: U+3401

Features

  • automatically downloads UNIHAN from the internet
  • strives for accuracy with the specifications described in UNIHAN's database design
  • export to JSON, CSV and YAML (requires pyyaml) via -F
  • configurable to export specific fields via -f
  • accounts for encoding conflicts due to the Unicode-heavy content
  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
  • core component and dependency of cihai, a CJK library
  • data package support
  • expansion of multi-value delimited fields in YAML, JSON and python dictionaries
  • supports >= 3.7 and pypy

If you encounter a problem or have a question, please create an issue.

Installation

To download and build your own UNIHAN export:

$ pip install --user unihan-etl

or by pipx:

$ pipx install unihan-etl

Developmental releases

pip:

$ pip install --user --upgrade --pre unihan-etl

pipx:

$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession

Usage

unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.

To output CSV, the default format:

$ unihan-etl

To output JSON:

$ unihan-etl -F json

To output YAML:

$ pip install --user pyyaml
$ unihan-etl -F yaml

To only output the kDefinition field in a csv:

$ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-etl -f kCantonese kDefinition

To output to a custom file:

$ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout

# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/

# output dir
{XDG data dir}/unihan_etl/
  unihan.json
  unihan.csv
  unihan.yaml   # (requires pyyaml)

# package dir
unihan_etl/
  core.py    # argparse, download, extract, transform UNIHAN's data
  options.py    # configuration object
  constants.py  # immutable data vars (field to filename mappings, etc)
  expansion.py  # extracting details baked inside of fields
  types.py      # type annotations
  util.py       # utility / helper functions

# test suite
tests/*

API

The package is python underneath the hood, you can utilize its full API. Example:

>>> from unihan_etl.core import Packager
>>> pkgr = Packager()
>>> hasattr(pkgr.options, 'destination')
True

Developing

$ git clone https://github.com/cihai/unihan-etl.git
$ cd unihan-etl

Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).

More information

Docs Build Status

unihan-etl's People

Contributors

dependabot-preview[bot] avatar pre-commit-ci[bot] avatar pyup-bot avatar tony avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

unihan-etl's Issues

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

Update UNIHAN 11.0.0 -> 15.1.0 (2019 -> 2023)

As of 86865e0 we support UNIHAN rev 25 via Unicode 11.0.0 (https://www.unicode.org/reports/tr38/tr38-25.html#History)

Docs

Changelog

Changes to UNIHAN

Version Properties Added Properties Removed
15.1.0 kJapanese, kMojiJoho, kSMSZD2003Index, kSMSZD2003Readings, kVietnameseNumeric, kZhuangNumeric kHKSCS, kIRGDaiKanwaZiten, kKPS0, kKPS1, kKSC0, kKSC1, kRSKangXi
15.0.0 kAlternateTotalStrokes  
14.0.0 kStrange  
13.0.0 kIRG_SSource, kIRG_UKSource, kSpoofingVariant, kTGHZ2013, kUnihanCore2020 kRSJapanese, kRSKanWa, kRSKorean
12.0.0   kDefaultSortKey (private property)
11.0.0 kJinmeiyoKanji, kJoyoKanji, kKoreanEducationHanja, kKoreanName, kTGH  

Next generation (needs funding and time)

P.S. I am out of contact with anyone from UNIHAN, is someone else already on the same effort as me? Can this effort be shared in any way?

This project can do much more to unlock the breadth and depth of UNIHAN:

  • Sustainability (from an informatic standpoint)

  • Correctness

    Digging deeper into the Database design, more needs to be done to ensure extraction and interlation are provided in a structured and detailed way.

  • Documentation

  • Typings

  • Speed and performance

  • Potentially

    • Checking and code generation

      Perhaps https://www.unicode.org/reports/tr38/ can be crawled and used to verify correctness, and to an extent, in the future, we can generate.

    • Cross-language compatibility

    • Language-based speedups, e.g. rust json / yaml / csv parsing. Perhaps the whole core can be a rust-based package with language interconnections

UNIHAN can be made even more accessible to the masses - I am the one that can make happen, but it would take time and above all: Funding. This would need to be my 100% focus of my free time outside of work for months, or even longer.

Support for Unicode 10.0 (2017-06-15)

https://www.unicode.org/reports/tr38/tr38-23.html

https://www.unicode.org/reports/tr38/tr38-23.html#History

Modifications

Revision 23

  • Reissued for Unicode 10.0.0.
  • Updated the regular expressions for the kIICore, kIRG_GSource, kIRG_HSource, kIRG_JSource, kIRG_KSource, and kRSKangXi fields.
  • Updated terminology to reflect the difference between the IRG's U-source and the set of characters submitted by the UTC (now referred to as the "UTC-source").
  • Added references to CJK Unified Ideographs Extension F block.
  • Revision 22 being a proposed update, only changes between revisions 21 and 23 are noted here.

Revision 21

  • Reissued for Unicode 9.0.0.
  • Updated the descriptions and regular expressions for the kIRG_GSource and kIRG_JSource fields.
  • Updated the regular expression for the kHangul field.
  • Updated the description of the kKorean field to clarify its current status.
  • Updated the description and regular expression for the kRSUnicode field to allow for negative residual stroke values.

Clean bad data?

Hi,

I have found an instance of bad data in the database. I guess there could be more. Should the UniHan data be automatically cleaned before importing?

from cihai.core import Cihai
from cihai.bootstrap import bootstrap_unihan

cihan = Cihai()
if not cihan.is_bootstrapped:
    bootstrap_unihan(cihan.metadata)

cihan.reflect_db()

c = cihan.lookup_char('任').first()
print(c.kZVariant)
$ python bad_data.py
U+4EFC<kHKGlyph

option to save as a dictionary instead of a list

I have found that I always need to convert the data into a dictionary (instead of the default list) when I'm using it. Because of this, I decided to always store the file in dictionary format. My method for doing so is a bit hacky, and it would be great to have a --structure <dict|list> or even --dictionary parameter to do this within unihan_etl.

Here's my current code. It relies on the undocumented python formatting option:

from unihan_etl.process import Packager as unihan_packager
from unihan_etl.process import export_json

def unihan_download(unihan_file):
    # destination argument is required even though the packager will not write the file
    p = unihan_packager.from_cli(["-F", "json", "--destination", unihan_file])
    p.download()
    # instruct packager to return data instead of writing to file
    p.options["format"] = "python"
    unihan = p.export()

    # convert from list to dictionary
    unihan_dict = {entry["char"]: entry for entry in unihan}

    export_json(unihan_dict, unihan_file)

kRSUnicode bug

That was a really fast response :D

This is actually my bad; the latest unihan_etl already has a fix for this in place, and I mistakenly thought I had updated.

The issue is a typo in the kRSUnicode field for 亀: https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%E4%BA%80. It has two apostrophes, which does not follow the syntax specified in the standard. unihan_etl has already updated its parsing to allow the second apostrophe.

I did have to update my code for some unihan_etl changes, but nothing crazy.

See also: #233 (comment)

datapackage question

Hi, thank you for your project! It's wonderful and easy to use.

Just a question for you: I see from your old TODOs and from your comments in the cihai project that you originally planned to make this into a data package. I'm also interested in the idea of datapackages, but haven't completely wrapped my head around how they work. Did you abandon the idea of creating a datapackage because it didn't seem worthwhile, or because there was some specific shortcoming? The ecosystem seems to have evolved quite a bit recently. I would love to start working on CJKV data package work, but I want to hear someone else's experiences with it first before I commit to starting.

Thanks for your time! (You can close this if you don't feel like answering; it's not really an "issue", per se).

Node wrapper?

Hi,

Looks like an excellent project! A potentially great companion to https://github.com/mathiasbynens/node-unicode-data for us Node users...

I am intending to look into making a Node wrapper for your library that can be used with npm.

If I get to this, would you be interested in including it within your repository if it ends up being fairly light-weight? For now, I'm planning to use "node-unihan-etl" as an npm name in case you decide to try to get "unihan-etl", but let me know if you'd be ok with me going ahead with the latter.

Find out what to do with `zhon`

Python 3.10+ are going to strain with these escape sequences

See tsroten/zhon#34

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:40: DeprecationWarning: invalid escape sequence '\]'
    non_stops = """"#$%&'()*+,-/:;<=>@[\]^_`{|}~"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:153
  ~/.cache/...0/lib/python3.10/site-packages/zhon/pinyin.py:153: DeprecationWarning: invalid escape sequence '\]'
    """[%(stops)s]['"\]\}\)]*"""

~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154
  ~/.cache/.../lib/python3.10/site-packages/zhon/pinyin.py:154: DeprecationWarning: invalid escape sequence '\-'
    ) % {'word': word, 'non_stops': non_stops.replace('-', '\-'),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

P.S. @tsroten - thank you for the package!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.