Coder Social home page Coder Social logo

pinyin's Introduction

pinyin: an R package for converting Chinese characters into pinyin, and more

CRAN downloads

Introduction

This is an R package for converting Chinese characters into pinyin, four-corner codes, five-stroke codes, and more.

An brief introduction to pinyin can be found in Wikipedia:

Pinyin, or Hànyǔ Pīnyīn, is the official romanization system for Standard Chinese in mainland China, Malaysia, Singapore, and Taiwan. It is often used to teach Standard Chinese, which is normally written using Chinese characters. The system includes four diacritics denoting tones. Pinyin without tone marks is used to spell Chinese names and words in languages written with the Latin alphabet, and also in certain computer input methods to enter Chinese characters.

The pinyin system was developed in the 1950s by many linguists, including Zhou Youguang, based on earlier forms of romanization of Chinese. It was published by the Chinese government in 1958 and revised several times. The International Organization for Standardization (ISO) adopted pinyin as an international standard in 1982, followed by the United Nations in 1986. The system was adopted as the official standard in Taiwan in 2009, where it is used for romanization alone (in part to make areas more English-friendly) rather than for educational and computer-input purposes.

Since this package deals with Chinese characters, it is presumed that the users speak Chinese. Therefore I wrote the instruction in Chinese. In case that some users do not speak Chinese and want to use this package as well, please feel free to contact me via email, although the R codes in this document are self-explanatory.

这个 R 语言包粗暴地用拼音取名为 pinyin,作用是把汉字转换成拼音。从 v1.1.3 开始,增加了将汉字转换成四角号码或五笔字型的功能。从 v1.1.4 开始,用户可以指定自己的字典,随意转换。

Installation 安装方法 [ān zhuānɡ fānɡ fǎ]

install.packages('pinyin')
# or
devtools::install_github("pzhaonet/pinyin")

安装时可能会出现一些关于 locale 的警告,净吓唬人,无视。

Funtions 函数 [hán shù]

函数的用法当然可以看帮助信息就行了。可惜帮助信息里好像没法写中文,而一个处理中文的包的帮助信息和示例却写不了中文,十分遗憾。好在这里可以用中文解释一下。

pinyin 1.1.4 版包含 3 个主函数:

  • pydic() 用来载入内置的拼音字典(包括拼音,四角,五笔)。

  • 如果内置字典不能满足用户需要,用户可以用load_dic() 来载入自定义字典。这里提供了四角,五笔 86、五笔 98 三个自定义字典。当然,用户可以自制字典,只需按上述几个字典的格式来制作即可。

  • py() 用来将指定字符通过查询所载入的字典来转换成对应的拼音、四角或五笔符号。

使用 pydic()载入拼音字典时,可以选择以下参数:

  • 转换成标准的全拼 (默认 method = 'quanpin'),或
  • 以数字表示声调 (method = 'tone') , 或
  • 不含声调(method = 'toneless'),
  • 也可以选择仅保留每个字的首字母(only_first_letter = TRUE),
  • 要不要显示多音字的多个读音(multi = FALSE),
  • 有两个字典可选(dic = c("pinyin", "pinyin2"))。

使用py() 转换汉字时,

  • 可以自定义相邻两字拼音的分隔符号(sep = '_'),
  • 如果汉字字符串里边包含非汉字字符,可以选择将这些字符保留原样(nonezh_replace = NULL)还是转换成指定字符(如nonezh_replace = '-')。

使用load_dic()载入自定义字典时,目前有三个可用字典(欢迎提交新字典):

另外还有 3 个订制函数,是 py() 的延伸和示例:

  • file.rename2py()用来对文件重命名,将文件名里的汉字按载入的字典转换。
  • file2py() 用来将指定文件夹里的一个或多个文本文件里的汉字按载入的字典全部转换。
  • bookdown2py()是专门为 bookdown 包服务的,作用是为章节的中文标题自动添加个对应的字符 ID {#biaotipinyin},避免在生成网页文件时文件名里出现一大堆乱码,并且解决标题里中英文混合的问题。--- 当然这事儿手动完全可以处理,只是手动处理的过程毫无乐趣可言罢了。

Examples 示例 [shì lì]

以参数的默认值进行转换:

library('pinyin')
mypy <- pydic() # 载入默认字典
py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换

dic_sj <- 'https://github.com/pzhaonet/pinyin/raw/master/inst2/sijiao.txt' #自定义字典链接
mysj <- load_dic(dic_file = dic_sj) # 载入自定义字典
py("春眠不觉晓,处处闻啼鸟", sep = '_', dic = mysj) # 转换

Updates

  • 2018-12-16. version 1.1.5. Compatible with the deprecated function pinyin(). Support vector calculation.
  • 2018-10-15. version 1.1.4. Support self-defined dictionaries.
  • 2018-10-09. version 1.1.3. Faster. Users can preload the library. A simple library was added. Four-corner codes are supported. A co-author joined.
  • 2018-01-17. version 1.1.1. Remove the vignettes.
  • 2017-06-19. Version 1.1.0. On CRAN. Fixed some bugs.
  • 2017-06-19. Version 1.0.2. Released on CRAN!
  • 2017-06-01. Version 1.0.0. zh2py() has been removed. Now the main function is pinyin(). Submitted to CRAN!
  • 2017-05-29. Version 0.2.0. zh2py(multi = TRUE) to display multiple procounciations of a Chinese character.
  • 2017-05-29. Version 0.1.0. A new function file2py() was created according to Dong's comment.
  • 2017-05-26. Version 0.0.0. Preliminary.

To do

pinyin's People

Contributors

pzhaonet avatar qu-cheng avatar tcgriffith avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pinyin's Issues

多音字是个难题

谢谢你开发的这个工具,非常好用,准备给领导推荐。
发现一个问题,多音字怎么转换啊,比如你举的例子:

library('pinyin')
mypy <- pydic() # 载入默认字典
py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换

结果是:
"chūn_mián_bú_jiào_xiǎo_,_chǔ_chǔ_wén_tí_diǎo"

觉 是多音字,应该读作 jué,怎么解决?谢谢!

Output Error: the Chinese Chr `幸` return incorrect pinyin.

pinyin::py('')
#>    幸 
#> "niè"

Created on 2019-03-25 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                                              
#>  version  R version 3.5.3 (2019-03-11)                       
#>  os       Windows 7 x64 SP 1                                 
#>  system   x86_64, mingw32                                    
#>  ui       RTerm                                              
#>  language (EN)                                               
#>  collate  Chinese (Simplified)_People's Republic of China.936
#>  ctype    Chinese (Simplified)_People's Republic of China.936
#>  tz       Asia/Taipei                                        
#>  date     2019-03-25                                         
#> 
#> - Packages --------------------------------------------------------------
#>  package         * version date       lib source        
#>  assertthat        0.2.0   2017-04-11 [1] CRAN (R 3.5.2)
#>  backports         1.1.3   2018-12-14 [1] CRAN (R 3.5.2)
#>  callr             3.1.1   2018-12-21 [1] CRAN (R 3.5.2)
#>  cli               1.0.1   2018-09-25 [1] CRAN (R 3.5.2)
#>  crayon            1.3.4   2017-09-16 [1] CRAN (R 3.5.2)
#>  data.table        1.11.8  2018-09-30 [1] CRAN (R 3.5.2)
#>  desc              1.2.0   2018-05-01 [1] CRAN (R 3.5.2)
#>  devtools          2.0.1   2018-10-26 [1] CRAN (R 3.5.2)
#>  digest            0.6.18  2018-10-10 [1] CRAN (R 3.5.2)
#>  evaluate          0.12    2018-10-09 [1] CRAN (R 3.5.2)
#>  fs                1.2.6   2018-08-23 [1] CRAN (R 3.5.2)
#>  glue              1.3.0   2018-07-17 [1] CRAN (R 3.5.2)
#>  highr             0.7     2018-06-09 [1] CRAN (R 3.5.2)
#>  htmltools         0.3.6   2017-04-28 [1] CRAN (R 3.5.2)
#>  knitr             1.21    2018-12-10 [1] CRAN (R 3.5.2)
#>  magrittr          1.5     2014-11-22 [1] CRAN (R 3.5.2)
#>  memoise           1.1.0   2017-04-21 [1] CRAN (R 3.5.2)
#>  pinyin            1.1.5   2018-12-17 [1] CRAN (R 3.5.2)
#>  pkgbuild          1.0.2   2018-10-16 [1] CRAN (R 3.5.2)
#>  pkgload           1.0.2   2018-10-29 [1] CRAN (R 3.5.2)
#>  prettyunits       1.0.2   2015-07-13 [1] CRAN (R 3.5.2)
#>  processx          3.2.1   2018-12-05 [1] CRAN (R 3.5.2)
#>  ps                1.3.0   2018-12-21 [1] CRAN (R 3.5.2)
#>  R6                2.3.0   2018-10-04 [1] CRAN (R 3.5.2)
#>  Rcpp              1.0.0   2018-11-07 [1] CRAN (R 3.5.2)
#>  remotes           2.0.2   2018-10-30 [1] CRAN (R 3.5.2)
#>  rlang             0.3.1   2019-01-08 [1] CRAN (R 3.5.2)
#>  rmarkdown         1.11    2018-12-08 [1] CRAN (R 3.5.3)
#>  rprojroot         1.3-2   2018-01-03 [1] CRAN (R 3.5.2)
#>  sessioninfo       1.1.1   2018-11-05 [1] CRAN (R 3.5.2)
#>  splitstackshape   1.4.6   2018-07-23 [1] CRAN (R 3.5.2)
#>  stringi           1.2.4   2018-07-20 [1] CRAN (R 3.5.2)
#>  stringr           1.3.1   2018-05-10 [1] CRAN (R 3.5.2)
#>  testthat          2.0.1   2018-10-13 [1] CRAN (R 3.5.2)
#>  usethis           1.4.0   2018-08-14 [1] CRAN (R 3.5.2)
#>  withr             2.1.2   2018-03-15 [1] CRAN (R 3.5.2)
#>  xfun              0.4     2018-10-23 [1] CRAN (R 3.5.2)
#>  yaml              2.2.0   2018-07-25 [1] CRAN (R 3.5.2)
#> 
#> [1] C:/Users/lijiaxiang/Documents/R/win-library/3.5
#> [2] C:/Program Files/R/R-3.5.3/library

report some bugs

py("超市商城", dic = pydic(method = "toneless", dic = "pinyin"))
#             超市商城 
# "chao_fu_shang_cheng"
py("银行", dic = pydic(method = "toneless", multi = FALSE, dic = "pinyin"))
#                     银行 
#"yin_xing/hang/hang/heng" 

py("银行", dic = pydic(method = "toneless", multi = TRUE, dic = "pinyin"))
#                     银行 
#"yin_xing/hang/hang/heng" 
py("公园", dic = pydic(method = "toneless", dic = "pinyin"))
#      公园 
#"gong_wan"

rework on this package

I think I've learned enough to make another PR to improve this package. I'll draft a new PR when I got time.

Things I aim to achieve in the PR:

  • A new function to convert 汉语 to pinyin with speed in mind.

  • lazyload default dictionary data in the package. Users can directly call the function after install, without having to manually load the dictionary.

Accuracy of pinyin

Thanks for developing such an using package!
But I found that 广西 is translated into "anxi",
and "鸟" is translated into "Diao".

library(pinyin)
mypy <- pydic()
py("广西", sep = "", dic = mypy) # 转换

广西
"ānxī"

py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换

春眠不觉晓,处处闻啼鸟
"chūn_mián_bú_jiào_xiǎo_,_chǔ_chǔ_wén_tí_diǎo"

I am not sure if it due to my windows Locale:
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] sp_1.3-2 shinyBS_0.61 shiny_1.4.0 maps_3.3.0 chinamap_0.2.0 plotly_4.9.1 lubridate_1.7.4
[8] forecast_8.11 forcats_0.4.0 ggrepel_0.8.1 tidyr_1.0.2 dplyr_0.8.4 nCov2019_0.0.6 pinyin_1.1.7
[15] Hmisc_4.3-1 ggplot2_3.2.1 Formula_1.2-3 survival_3.1-8 lattice_0.20-38

loaded via a namespace (and not attached):
[1] nlme_3.1-142 sf_0.8-1 xts_0.12-0 RColorBrewer_1.1-2 httr_1.4.1
[6] tools_3.6.2 backports_1.1.5 R6_2.4.1 rpart_4.1-15 KernSmooth_2.23-16
[11] DBI_1.1.0 splitstackshape_1.4.8 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12
[16] withr_2.1.2 tidyselect_1.0.0 gridExtra_2.3 curl_4.3 compiler_3.6.2
[21] htmlTable_1.13.3 labeling_0.3 tseries_0.10-47 scales_1.1.0 checkmate_2.0.0
[26] lmtest_0.9-37 fracdiff_1.5-1 classInt_0.4-2 quadprog_1.5-8 stringr_1.4.0
[31] digest_0.6.23 foreign_0.8-72 base64enc_0.1-3 jpeg_0.1-8.1 pkgconfig_2.0.3
[36] htmltools_0.4.0 fastmap_1.0.1 htmlwidgets_1.5.1 rlang_0.4.4 TTR_0.23-6
[41] rstudioapi_0.11 quantmod_0.4-15 farver_2.0.3 zoo_1.8-7 jsonlite_1.6.1
[46] acepack_1.4.1 magrittr_1.5 Matrix_1.2-18 Rcpp_1.0.3 munsell_0.5.0
[51] lifecycle_0.1.0 stringi_1.4.5 grid_3.6.2 parallel_3.6.2 promises_1.1.0
[56] crayon_1.3.4 splines_3.6.2 knitr_1.28 pillar_1.4.3 urca_1.3-0
[61] glue_1.3.1 latticeExtra_0.6-29 remotes_2.1.0 data.table_1.12.8 png_0.1-7
[66] vctrs_0.2.2 httpuv_1.5.2 gtable_0.3.0 purrr_0.3.3 assertthat_0.2.1
[71] xfun_0.12 mime_0.9 xtable_1.8-4 e1071_1.7-3 later_1.0.0
[76] class_7.3-15 viridisLite_0.3.0 timeDate_3043.102 tibble_2.1.3 units_0.6-5
[81] cluster_2.1.0 ellipsis_0.3.0

how to get the first letter of the pinyin

hello zhao,
can pinyin get the first letter of the pinyin according to the string of Chinese characters ?
for example,
Qingdao --> qd
wang, xiaoer --> w, xe

thanks.

unexpected number when converting "大同市"

Hi Peng,

When I try to convert the Chinese characters "大同", the number 5 kick in unexpectedly. Is this a bug or made intentionally?

> devtools::install_github('pzhaonet/pinyin')
> require('pinyin')
> mypy = pydic(method = 'toneless', dic = 'pinyin2')
> py("大同市",  dic = mypy, sep = '')
     大同 
"datong5"

Thank you!

missing characters in dictionary pinyin2

榊 弐 * 荘 碖 畑 怾 鬷 丒 苶 麿 饹 昻 肀 雸 嚒 渋 掻 戦 噷 抜 験 鞆 嗮 笹 淪 発 歯 呠 壖 訳 杦 嬢 気 臓 囖 袮 騨 粫 広 敻 続 営 対 栃 歳 弾 乄 夐 啽 読 莻 逤 襨 脦 犠 塩 灐 捘 匁 頩 潠 伈 郉 諳 櫉 糀 壊 垰 渇 嚰 堧 灀 敨 粌 桟 稕 琣 膗 眰 埨 塀 韖 扖 獣 渓 噺 诇 転 辺 鶻 婲 謉 餎 嫾 簗 売 腉 脌 餪 齵 朩 簱 苆 栄 捼 濏 蹹 乭 妵 桜 礖 炞 峅 垪 賶 襙 椙 鯳 畳 聣 鈄 乻 旕 廃 瓰 糓 靎 靏 鵆 袸 唜 遖 掜 拰 繷 啂 夞 喸 溌 兺 褄 囕 芿 杁 鳰 圸 籂 瓧 琑 溹 粏 畓 斢 瓲 杤 閕 圷 顕 壱 珱 愥 凪 俧 専 沝 訰 枠

As below:

py(charfq_na_save$character, dic = pydic(dic = c("pinyin2")))
  榊   弐   *   荘   碖   畑   怾   鬷   丒   苶   麿   饹   昻   肀   雸   嚒   渋   掻   戦   噷   抜   験   鞆   嗮 
"榊" "弐" "*" "荘" "碖" "畑" "怾" "鬷" "丒" "苶" "麿" "饹" "昻" "肀" "雸" "嚒" "渋" "掻" "戦" "噷" "抜" "験" "鞆" "嗮" 
  笹   淪   発   歯   呠   壖   訳   杦   嬢   気   臓   囖   袮   騨   粫   広   敻   続   営   対   栃   歳   弾   乄 
"笹" "淪" "発" "歯" "呠" "壖" "訳" "杦" "嬢" "気" "臓" "囖" "袮" "騨" "粫" "広" "敻" "続" "営" "対" "栃" "歳" "弾" "乄" 
  夐   啽   読   莻   逤   襨   脦   犠   塩   灐   捘   匁   頩   潠   伈   郉   諳   櫉   糀   壊   垰   渇   嚰   堧 
"夐" "啽" "読" "莻" "逤" "襨" "脦" "犠" "塩" "灐" "捘" "匁" "頩" "潠" "伈" "郉" "諳" "櫉" "糀" "壊" "垰" "渇" "嚰" "堧" 
  灀   敨   粌   桟   稕   琣   膗   眰   埨   塀   韖   扖   獣   渓   噺   诇   転   辺   鶻   婲   謉   餎   嫾   簗 
"灀" "敨" "粌" "桟" "稕" "琣" "膗" "眰" "埨" "塀" "韖" "扖" "獣" "渓" "噺" "诇" "転" "辺" "鶻" "婲" "謉" "餎" "嫾" "簗" 
  売   腉   脌   餪   齵   朩   簱   苆   栄   捼   濏   蹹   乭   妵   桜   礖   炞   峅   垪   賶   襙   椙   鯳   畳 
"売" "腉" "脌" "餪" "齵" "朩" "簱" "苆" "栄" "捼" "濏" "蹹" "乭" "妵" "桜" "礖" "炞" "峅" "垪" "賶" "襙" "椙" "鯳" "畳" 
  聣   鈄   乻   旕   廃   瓰   糓   靎   靏   鵆   袸   唜   遖   掜   拰   繷   啂   夞   喸   溌   兺   褄   囕   芿 
"聣" "鈄" "乻" "旕" "廃" "瓰" "糓" "靎" "靏" "鵆" "袸" "唜" "遖" "掜" "拰" "繷" "啂" "夞" "喸" "溌" "兺" "褄" "囕" "芿" 
  杁   鳰   圸   籂   瓧   琑   溹   粏   畓   斢   瓲   杤   閕   圷   顕   壱   珱   愥   凪   俧   専   沝   訰   枠 
"杁" "鳰" "圸" "籂" "瓧" "琑" "溹" "粏" "畓" "斢" "瓲" "杤" "閕" "圷" "顕" "壱" "珱" "愥" "凪" "俧" "専" "沝" "訰" "枠" 
py(charfq_na_save$character)
      榊       弐       *       荘       碖       畑       怾       鬷       丒       苶       麿       饹       昻 
  "shen"     "èr"    "cào" "zhuānɡ"    "lún"   "tián"    "zhǐ"   "zěnɡ"   "chǒu"    "nié"     "mo"     "le"    "ánɡ" 
      肀       雸       嚒       渋       掻       戦       噷       抜       験       鞆       嗮       笹       淪 
    "yù"     "án"     "me"     "se"    "sāo"   "zhàn"    "hēn"     "bá"    "yǎn"   "binɡ"    "sài"     "ti"   "ɡuān" 
      発       歯       呠       壖       訳       杦       嬢       気       臓       囖       袮       騨       粫 
    "fā"    "chǐ"    "pěn"   "ruán"     "yì"    "jiu"  "niánɡ"     "qì"   "zànɡ"    "luō"     "ni"    "tuó"     "ér" 
      広       敻       続       営       対       栃       歳       弾       乄       夐       啽       読       莻 
 "ɡuǎnɡ"  "xiòng"     "xu"   "yíng"    "duì"     "li"    "suì"    "dàn"     "wǔ"  "xiòng"     "ān"     "dú"   "ɡònɡ" 
      逤       襨       脦       犠       塩       灐       捘       匁       頩       潠       伈       郉       諳 
   "suò"    "duì"     "de"     "xi"    "yán"   "ying"    "zùn"    "wén"    "pīn"    "sùn"    "lǐn"   "xíng"     "ān" 
      櫉       糀       壊       垰       渇       嚰       堧       灀       敨       粌       桟       稕       琣 
   "chú"    "huɑ"   "huài"     "kɑ"     "kě"     "mó"    "nuò" "shuànɡ"    "tǒu"    "yin"   "zhàn"   "zhǔn"   "běnɡ" 
      膗       眰       埨       塀       韖       扖       獣       渓       噺       诇       転       辺       鶻 
 "chuái"    "diè"    "lǔn"    "pin"    "rǒu"     "ru"   "shou"     "xi"    "xin"  "xiòng"  "zhuǎn"   "biān"     "ɡú" 
      婲       謉       餎       嫾       簗       売       腉       脌       餪       齵       朩       簱       苆 
   "huɑ"    "duǐ"     "le"   "lián"  "liɑnɡ"    "mài"    "nái"    "nin"   "nuǎn"     "óu"   "děnɡ"     "qi"    "qie" 
      栄       捼       濏       蹹       乭       妵       桜       礖       炞       峅       垪       賶       襙 
  "róng"     "ré"     "se"     "tá"    "shí"    "tǒu"   "yīng"     "yù"   "biɑn"   "biɑn"   "binɡ"   "cànɡ"    "cào" 
      椙       鯳       畳       聣       鈄       乻       旕       廃       瓰       糓       靎       靏       鵆 
 "chɑnɡ"     "di"    "dié"     "ní"    "dǒu"     "yú"     "yú"    "fèi"    "fēn"     "ɡǔ"     "hè"     "hè"   "henɡ" 
      袸       唜       遖       掜       拰       繷       啂       夞       喸       溌       兺       褄       囕 
  "jiàn"     "mò"    "nɑn"    "nái"    "nǐn"   "nǒnɡ"    "ɡòu"    "wài"     "bǔ"     "pō"    "fēn"     "qi"    "lǎn" 
      芿       杁       鳰       圸       籂       瓧       琑       溹       粏       畓       斢       瓲       杤 
  "rènɡ"     "ru"     "ru"   "shɑn"    "shi"    "shí"    "suo"     "sè"    "tɑi"    "duō"   "tiǎo"     "wɑ"    "wɑn" 
      閕       圷       顕       壱       珱       愥       凪       俧       専       沝       訰       枠 
   "xiā"    "xiɑ"   "xiǎn"     "yī"   "ying"   "ying"    "zhi"    "zhi"  "zhuān"   "zhuǐ"   "zhūn"    "zui"

a vector of Mandarin Chinese strings into pinyin

The py() function seems not to be working if I input a vector of Chinese strings. For example:

> library('pinyin')
> mypy = pydic(method = 'toneless')
> py(c("我", "一定", "是个", "天才"),  dic = mypy)
[1] "wo"

Sometimes, I have several columns in a data.frame that need to be converted into English letters.
I wrote a small function that can make it work, which depends on a couple of functions from dplyr.

> testd = data.frame(stringsAsFactors=FALSE,
          x1 = c('我', '一定', '是个', '天才'),
          x2 = c('我', '确', '是个', '天才'))
> print(testd)
    x1   x2
1   我   我
2 一定   确
3 是个 是个
4 天才 天才
> require(tidyverse)
> conv_py = function(data, var_name){
+   for(i in var_name){
+     data[[i]] = map(data[[i]], function(x){py(x, dic = mypy)}) %>%
+       gsub("_", "", .) %>%
+       unlist()
+   }
+   return(data)
+ }

> conv_py(testd, c("x1", "x2"))
       x1      x2
1      wo      wo
2  yiding     que
3  shigan  shigan
4 tiancai tiancai

But there seems to be an obvious bug here: "是个" has been parsed into "shigan", which cannot be correct.

In summary:

  • See if you want to add this conv_py() or alike functions into your updated package. I found converting a vector of Chinese characters a very common problem in data manipulation.
  • fix the obvious "是个" into "shigan" bug in the package, which is probably not your fault. I guess it is from the problem in the dictionary.

Better dictionary

The pinyin for "更" (geng4/geng1) is correct in "pinyin" but not in "pinyin2":

library(pinyin)
suppressWarnings(py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin")))
#>            更 
#> "geng4/geng1"
py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin2"))
#>      更 
#> "geng4"

Created on 2021-05-17 by the reprex package (v2.0.0)

The pinyin for "迹" (ji4) is correct in "pinyin2" but not in "pinyin":

library(pinyin)
suppressWarnings(py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin")))
#>    迹 
#> "ji1"
py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin2"))
#>    迹 
#> "ji4"

Created on 2021-05-17 by the reprex package (v2.0.0)

In this way, seemingly either of these two dictionaries is error-proof.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.