pzhaonet / pinyin Goto Github PK
View Code? Open in Web Editor NEWan R package for converting Chineses characters into pinyin
License: MIT License
an R package for converting Chineses characters into pinyin
License: MIT License
I think I've learned enough to make another PR to improve this package. I'll draft a new PR when I got time.
Things I aim to achieve in the PR:
A new function to convert 汉语 to pinyin with speed in mind.
lazyload default dictionary data in the package. Users can directly call the function after install, without having to manually load the dictionary.
pinyin::py('幸')
#> 幸
#> "niè"
Created on 2019-03-25 by the reprex package (v0.2.1)
devtools::session_info()
#> - Session info ----------------------------------------------------------
#> setting value
#> version R version 3.5.3 (2019-03-11)
#> os Windows 7 x64 SP 1
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Chinese (Simplified)_People's Republic of China.936
#> ctype Chinese (Simplified)_People's Republic of China.936
#> tz Asia/Taipei
#> date 2019-03-25
#>
#> - Packages --------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.2)
#> backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.2)
#> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.2)
#> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.2)
#> data.table 1.11.8 2018-09-30 [1] CRAN (R 3.5.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.2)
#> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2)
#> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.2)
#> evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.2)
#> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.2)
#> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.2)
#> highr 0.7 2018-06-09 [1] CRAN (R 3.5.2)
#> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.2)
#> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.2)
#> pinyin 1.1.5 2018-12-17 [1] CRAN (R 3.5.2)
#> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.2)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.2)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.2)
#> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.2)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.2)
#> R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.2)
#> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.2)
#> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.2)
#> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2)
#> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
#> splitstackshape 1.4.6 2018-07-23 [1] CRAN (R 3.5.2)
#> stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.2)
#> stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.2)
#> testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.2)
#> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.2)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.2)
#> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.2)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.2)
#>
#> [1] C:/Users/lijiaxiang/Documents/R/win-library/3.5
#> [2] C:/Program Files/R/R-3.5.3/library
榊 弐 * 荘 碖 畑 怾 鬷 丒 苶 麿 饹 昻 肀 雸 嚒 渋 掻 戦 噷 抜 験 鞆 嗮 笹 淪 発 歯 呠 壖 訳 杦 嬢 気 臓 囖 袮 騨 粫 広 敻 続 営 対 栃 歳 弾 乄 夐 啽 読 莻 逤 襨 脦 犠 塩 灐 捘 匁 頩 潠 伈 郉 諳 櫉 糀 壊 垰 渇 嚰 堧 灀 敨 粌 桟 稕 琣 膗 眰 埨 塀 韖 扖 獣 渓 噺 诇 転 辺 鶻 婲 謉 餎 嫾 簗 売 腉 脌 餪 齵 朩 簱 苆 栄 捼 濏 蹹 乭 妵 桜 礖 炞 峅 垪 賶 襙 椙 鯳 畳 聣 鈄 乻 旕 廃 瓰 糓 靎 靏 鵆 袸 唜 遖 掜 拰 繷 啂 夞 喸 溌 兺 褄 囕 芿 杁 鳰 圸 籂 瓧 琑 溹 粏 畓 斢 瓲 杤 閕 圷 顕 壱 珱 愥 凪 俧 専 沝 訰 枠
As below:
py(charfq_na_save$character, dic = pydic(dic = c("pinyin2")))
榊 弐 * 荘 碖 畑 怾 鬷 丒 苶 麿 饹 昻 肀 雸 嚒 渋 掻 戦 噷 抜 験 鞆 嗮
"榊" "弐" "*" "荘" "碖" "畑" "怾" "鬷" "丒" "苶" "麿" "饹" "昻" "肀" "雸" "嚒" "渋" "掻" "戦" "噷" "抜" "験" "鞆" "嗮"
笹 淪 発 歯 呠 壖 訳 杦 嬢 気 臓 囖 袮 騨 粫 広 敻 続 営 対 栃 歳 弾 乄
"笹" "淪" "発" "歯" "呠" "壖" "訳" "杦" "嬢" "気" "臓" "囖" "袮" "騨" "粫" "広" "敻" "続" "営" "対" "栃" "歳" "弾" "乄"
夐 啽 読 莻 逤 襨 脦 犠 塩 灐 捘 匁 頩 潠 伈 郉 諳 櫉 糀 壊 垰 渇 嚰 堧
"夐" "啽" "読" "莻" "逤" "襨" "脦" "犠" "塩" "灐" "捘" "匁" "頩" "潠" "伈" "郉" "諳" "櫉" "糀" "壊" "垰" "渇" "嚰" "堧"
灀 敨 粌 桟 稕 琣 膗 眰 埨 塀 韖 扖 獣 渓 噺 诇 転 辺 鶻 婲 謉 餎 嫾 簗
"灀" "敨" "粌" "桟" "稕" "琣" "膗" "眰" "埨" "塀" "韖" "扖" "獣" "渓" "噺" "诇" "転" "辺" "鶻" "婲" "謉" "餎" "嫾" "簗"
売 腉 脌 餪 齵 朩 簱 苆 栄 捼 濏 蹹 乭 妵 桜 礖 炞 峅 垪 賶 襙 椙 鯳 畳
"売" "腉" "脌" "餪" "齵" "朩" "簱" "苆" "栄" "捼" "濏" "蹹" "乭" "妵" "桜" "礖" "炞" "峅" "垪" "賶" "襙" "椙" "鯳" "畳"
聣 鈄 乻 旕 廃 瓰 糓 靎 靏 鵆 袸 唜 遖 掜 拰 繷 啂 夞 喸 溌 兺 褄 囕 芿
"聣" "鈄" "乻" "旕" "廃" "瓰" "糓" "靎" "靏" "鵆" "袸" "唜" "遖" "掜" "拰" "繷" "啂" "夞" "喸" "溌" "兺" "褄" "囕" "芿"
杁 鳰 圸 籂 瓧 琑 溹 粏 畓 斢 瓲 杤 閕 圷 顕 壱 珱 愥 凪 俧 専 沝 訰 枠
"杁" "鳰" "圸" "籂" "瓧" "琑" "溹" "粏" "畓" "斢" "瓲" "杤" "閕" "圷" "顕" "壱" "珱" "愥" "凪" "俧" "専" "沝" "訰" "枠"
py(charfq_na_save$character)
榊 弐 * 荘 碖 畑 怾 鬷 丒 苶 麿 饹 昻
"shen" "èr" "cào" "zhuānɡ" "lún" "tián" "zhǐ" "zěnɡ" "chǒu" "nié" "mo" "le" "ánɡ"
肀 雸 嚒 渋 掻 戦 噷 抜 験 鞆 嗮 笹 淪
"yù" "án" "me" "se" "sāo" "zhàn" "hēn" "bá" "yǎn" "binɡ" "sài" "ti" "ɡuān"
発 歯 呠 壖 訳 杦 嬢 気 臓 囖 袮 騨 粫
"fā" "chǐ" "pěn" "ruán" "yì" "jiu" "niánɡ" "qì" "zànɡ" "luō" "ni" "tuó" "ér"
広 敻 続 営 対 栃 歳 弾 乄 夐 啽 読 莻
"ɡuǎnɡ" "xiòng" "xu" "yíng" "duì" "li" "suì" "dàn" "wǔ" "xiòng" "ān" "dú" "ɡònɡ"
逤 襨 脦 犠 塩 灐 捘 匁 頩 潠 伈 郉 諳
"suò" "duì" "de" "xi" "yán" "ying" "zùn" "wén" "pīn" "sùn" "lǐn" "xíng" "ān"
櫉 糀 壊 垰 渇 嚰 堧 灀 敨 粌 桟 稕 琣
"chú" "huɑ" "huài" "kɑ" "kě" "mó" "nuò" "shuànɡ" "tǒu" "yin" "zhàn" "zhǔn" "běnɡ"
膗 眰 埨 塀 韖 扖 獣 渓 噺 诇 転 辺 鶻
"chuái" "diè" "lǔn" "pin" "rǒu" "ru" "shou" "xi" "xin" "xiòng" "zhuǎn" "biān" "ɡú"
婲 謉 餎 嫾 簗 売 腉 脌 餪 齵 朩 簱 苆
"huɑ" "duǐ" "le" "lián" "liɑnɡ" "mài" "nái" "nin" "nuǎn" "óu" "děnɡ" "qi" "qie"
栄 捼 濏 蹹 乭 妵 桜 礖 炞 峅 垪 賶 襙
"róng" "ré" "se" "tá" "shí" "tǒu" "yīng" "yù" "biɑn" "biɑn" "binɡ" "cànɡ" "cào"
椙 鯳 畳 聣 鈄 乻 旕 廃 瓰 糓 靎 靏 鵆
"chɑnɡ" "di" "dié" "ní" "dǒu" "yú" "yú" "fèi" "fēn" "ɡǔ" "hè" "hè" "henɡ"
袸 唜 遖 掜 拰 繷 啂 夞 喸 溌 兺 褄 囕
"jiàn" "mò" "nɑn" "nái" "nǐn" "nǒnɡ" "ɡòu" "wài" "bǔ" "pō" "fēn" "qi" "lǎn"
芿 杁 鳰 圸 籂 瓧 琑 溹 粏 畓 斢 瓲 杤
"rènɡ" "ru" "ru" "shɑn" "shi" "shí" "suo" "sè" "tɑi" "duō" "tiǎo" "wɑ" "wɑn"
閕 圷 顕 壱 珱 愥 凪 俧 専 沝 訰 枠
"xiā" "xiɑ" "xiǎn" "yī" "ying" "ying" "zhi" "zhi" "zhuān" "zhuǐ" "zhūn" "zui"
py("超市商城", dic = pydic(method = "toneless", dic = "pinyin"))
# 超市商城
# "chao_fu_shang_cheng"
py("银行", dic = pydic(method = "toneless", multi = FALSE, dic = "pinyin"))
# 银行
#"yin_xing/hang/hang/heng"
py("银行", dic = pydic(method = "toneless", multi = TRUE, dic = "pinyin"))
# 银行
#"yin_xing/hang/hang/heng"
py("公园", dic = pydic(method = "toneless", dic = "pinyin"))
# 公园
#"gong_wan"
Hi Peng,
When I try to convert the Chinese characters "大同", the number 5 kick in unexpectedly. Is this a bug or made intentionally?
> devtools::install_github('pzhaonet/pinyin')
> require('pinyin')
> mypy = pydic(method = 'toneless', dic = 'pinyin2')
> py("大同市", dic = mypy, sep = '')
大同
"datong5"
Thank you!
The pinyin for "更" (geng4/geng1) is correct in "pinyin" but not in "pinyin2":
library(pinyin)
suppressWarnings(py("更", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin")))
#> 更
#> "geng4/geng1"
py("更", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin2"))
#> 更
#> "geng4"
Created on 2021-05-17 by the reprex package (v2.0.0)
The pinyin for "迹" (ji4) is correct in "pinyin2" but not in "pinyin":
library(pinyin)
suppressWarnings(py("迹", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin")))
#> 迹
#> "ji1"
py("迹", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin2"))
#> 迹
#> "ji4"
Created on 2021-05-17 by the reprex package (v2.0.0)
In this way, seemingly either of these two dictionaries is error-proof.
I.e., 國語注音符號第一式
hello zhao,
can pinyin get the first letter of the pinyin according to the string of Chinese characters ?
for example,
Qingdao --> qd
wang, xiaoer --> w, xe
thanks.
Thanks for developing such an using package!
But I found that 广西 is translated into "anxi",
and "鸟" is translated into "Diao".
library(pinyin)
mypy <- pydic()
py("广西", sep = "", dic = mypy) # 转换
广西
"ānxī"
py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换
春眠不觉晓,处处闻啼鸟
"chūn_mián_bú_jiào_xiǎo_,_chǔ_chǔ_wén_tí_diǎo"
I am not sure if it due to my windows Locale:
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sp_1.3-2 shinyBS_0.61 shiny_1.4.0 maps_3.3.0 chinamap_0.2.0 plotly_4.9.1 lubridate_1.7.4
[8] forecast_8.11 forcats_0.4.0 ggrepel_0.8.1 tidyr_1.0.2 dplyr_0.8.4 nCov2019_0.0.6 pinyin_1.1.7
[15] Hmisc_4.3-1 ggplot2_3.2.1 Formula_1.2-3 survival_3.1-8 lattice_0.20-38
loaded via a namespace (and not attached):
[1] nlme_3.1-142 sf_0.8-1 xts_0.12-0 RColorBrewer_1.1-2 httr_1.4.1
[6] tools_3.6.2 backports_1.1.5 R6_2.4.1 rpart_4.1-15 KernSmooth_2.23-16
[11] DBI_1.1.0 splitstackshape_1.4.8 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12
[16] withr_2.1.2 tidyselect_1.0.0 gridExtra_2.3 curl_4.3 compiler_3.6.2
[21] htmlTable_1.13.3 labeling_0.3 tseries_0.10-47 scales_1.1.0 checkmate_2.0.0
[26] lmtest_0.9-37 fracdiff_1.5-1 classInt_0.4-2 quadprog_1.5-8 stringr_1.4.0
[31] digest_0.6.23 foreign_0.8-72 base64enc_0.1-3 jpeg_0.1-8.1 pkgconfig_2.0.3
[36] htmltools_0.4.0 fastmap_1.0.1 htmlwidgets_1.5.1 rlang_0.4.4 TTR_0.23-6
[41] rstudioapi_0.11 quantmod_0.4-15 farver_2.0.3 zoo_1.8-7 jsonlite_1.6.1
[46] acepack_1.4.1 magrittr_1.5 Matrix_1.2-18 Rcpp_1.0.3 munsell_0.5.0
[51] lifecycle_0.1.0 stringi_1.4.5 grid_3.6.2 parallel_3.6.2 promises_1.1.0
[56] crayon_1.3.4 splines_3.6.2 knitr_1.28 pillar_1.4.3 urca_1.3-0
[61] glue_1.3.1 latticeExtra_0.6-29 remotes_2.1.0 data.table_1.12.8 png_0.1-7
[66] vctrs_0.2.2 httpuv_1.5.2 gtable_0.3.0 purrr_0.3.3 assertthat_0.2.1
[71] xfun_0.12 mime_0.9 xtable_1.8-4 e1071_1.7-3 later_1.0.0
[76] class_7.3-15 viridisLite_0.3.0 timeDate_3043.102 tibble_2.1.3 units_0.6-5
[81] cluster_2.1.0 ellipsis_0.3.0
Just as follows:
pinyin("归还", sep = " ")
The result is "ɡuī fú"
.
谢谢你开发的这个工具,非常好用,准备给领导推荐。
发现一个问题,多音字怎么转换啊,比如你举的例子:
library('pinyin')
mypy <- pydic() # 载入默认字典
py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换
结果是:
"chūn_mián_bú_jiào_xiǎo_,_chǔ_chǔ_wén_tí_diǎo"
觉 是多音字,应该读作 jué,怎么解决?谢谢!
The py()
function seems not to be working if I input a vector of Chinese strings. For example:
> library('pinyin')
> mypy = pydic(method = 'toneless')
> py(c("我", "一定", "是个", "天才"), dic = mypy)
[1] "wo"
Sometimes, I have several columns in a data.frame that need to be converted into English letters.
I wrote a small function that can make it work, which depends on a couple of functions from dplyr
.
> testd = data.frame(stringsAsFactors=FALSE,
x1 = c('我', '一定', '是个', '天才'),
x2 = c('我', '确', '是个', '天才'))
> print(testd)
x1 x2
1 我 我
2 一定 确
3 是个 是个
4 天才 天才
> require(tidyverse)
> conv_py = function(data, var_name){
+ for(i in var_name){
+ data[[i]] = map(data[[i]], function(x){py(x, dic = mypy)}) %>%
+ gsub("_", "", .) %>%
+ unlist()
+ }
+ return(data)
+ }
> conv_py(testd, c("x1", "x2"))
x1 x2
1 wo wo
2 yiding que
3 shigan shigan
4 tiancai tiancai
But there seems to be an obvious bug here: "是个" has been parsed into "shigan", which cannot be correct.
In summary:
conv_py()
or alike functions into your updated package. I found converting a vector of Chinese characters a very common problem in data manipulation.A basic app that I'd use for other purpose in future, but you might be interested since you have a to-do for shinyapp https://github.com/boltomli/MyShinyApps/tree/master/hanying
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.