Coder Social home page Coder Social logo

pinyin's Issues

rework on this package

I think I've learned enough to make another PR to improve this package. I'll draft a new PR when I got time.

Things I aim to achieve in the PR:

  • A new function to convert 汉语 to pinyin with speed in mind.

  • lazyload default dictionary data in the package. Users can directly call the function after install, without having to manually load the dictionary.

Output Error: the Chinese Chr `幸` return incorrect pinyin.

pinyin::py('')
#>    幸 
#> "niè"

Created on 2019-03-25 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                                              
#>  version  R version 3.5.3 (2019-03-11)                       
#>  os       Windows 7 x64 SP 1                                 
#>  system   x86_64, mingw32                                    
#>  ui       RTerm                                              
#>  language (EN)                                               
#>  collate  Chinese (Simplified)_People's Republic of China.936
#>  ctype    Chinese (Simplified)_People's Republic of China.936
#>  tz       Asia/Taipei                                        
#>  date     2019-03-25                                         
#> 
#> - Packages --------------------------------------------------------------
#>  package         * version date       lib source        
#>  assertthat        0.2.0   2017-04-11 [1] CRAN (R 3.5.2)
#>  backports         1.1.3   2018-12-14 [1] CRAN (R 3.5.2)
#>  callr             3.1.1   2018-12-21 [1] CRAN (R 3.5.2)
#>  cli               1.0.1   2018-09-25 [1] CRAN (R 3.5.2)
#>  crayon            1.3.4   2017-09-16 [1] CRAN (R 3.5.2)
#>  data.table        1.11.8  2018-09-30 [1] CRAN (R 3.5.2)
#>  desc              1.2.0   2018-05-01 [1] CRAN (R 3.5.2)
#>  devtools          2.0.1   2018-10-26 [1] CRAN (R 3.5.2)
#>  digest            0.6.18  2018-10-10 [1] CRAN (R 3.5.2)
#>  evaluate          0.12    2018-10-09 [1] CRAN (R 3.5.2)
#>  fs                1.2.6   2018-08-23 [1] CRAN (R 3.5.2)
#>  glue              1.3.0   2018-07-17 [1] CRAN (R 3.5.2)
#>  highr             0.7     2018-06-09 [1] CRAN (R 3.5.2)
#>  htmltools         0.3.6   2017-04-28 [1] CRAN (R 3.5.2)
#>  knitr             1.21    2018-12-10 [1] CRAN (R 3.5.2)
#>  magrittr          1.5     2014-11-22 [1] CRAN (R 3.5.2)
#>  memoise           1.1.0   2017-04-21 [1] CRAN (R 3.5.2)
#>  pinyin            1.1.5   2018-12-17 [1] CRAN (R 3.5.2)
#>  pkgbuild          1.0.2   2018-10-16 [1] CRAN (R 3.5.2)
#>  pkgload           1.0.2   2018-10-29 [1] CRAN (R 3.5.2)
#>  prettyunits       1.0.2   2015-07-13 [1] CRAN (R 3.5.2)
#>  processx          3.2.1   2018-12-05 [1] CRAN (R 3.5.2)
#>  ps                1.3.0   2018-12-21 [1] CRAN (R 3.5.2)
#>  R6                2.3.0   2018-10-04 [1] CRAN (R 3.5.2)
#>  Rcpp              1.0.0   2018-11-07 [1] CRAN (R 3.5.2)
#>  remotes           2.0.2   2018-10-30 [1] CRAN (R 3.5.2)
#>  rlang             0.3.1   2019-01-08 [1] CRAN (R 3.5.2)
#>  rmarkdown         1.11    2018-12-08 [1] CRAN (R 3.5.3)
#>  rprojroot         1.3-2   2018-01-03 [1] CRAN (R 3.5.2)
#>  sessioninfo       1.1.1   2018-11-05 [1] CRAN (R 3.5.2)
#>  splitstackshape   1.4.6   2018-07-23 [1] CRAN (R 3.5.2)
#>  stringi           1.2.4   2018-07-20 [1] CRAN (R 3.5.2)
#>  stringr           1.3.1   2018-05-10 [1] CRAN (R 3.5.2)
#>  testthat          2.0.1   2018-10-13 [1] CRAN (R 3.5.2)
#>  usethis           1.4.0   2018-08-14 [1] CRAN (R 3.5.2)
#>  withr             2.1.2   2018-03-15 [1] CRAN (R 3.5.2)
#>  xfun              0.4     2018-10-23 [1] CRAN (R 3.5.2)
#>  yaml              2.2.0   2018-07-25 [1] CRAN (R 3.5.2)
#> 
#> [1] C:/Users/lijiaxiang/Documents/R/win-library/3.5
#> [2] C:/Program Files/R/R-3.5.3/library

missing characters in dictionary pinyin2

榊 弐 * 荘 碖 畑 怾 鬷 丒 苶 麿 饹 昻 肀 雸 嚒 渋 掻 戦 噷 抜 験 鞆 嗮 笹 淪 発 歯 呠 壖 訳 杦 嬢 気 臓 囖 袮 騨 粫 広 敻 続 営 対 栃 歳 弾 乄 夐 啽 読 莻 逤 襨 脦 犠 塩 灐 捘 匁 頩 潠 伈 郉 諳 櫉 糀 壊 垰 渇 嚰 堧 灀 敨 粌 桟 稕 琣 膗 眰 埨 塀 韖 扖 獣 渓 噺 诇 転 辺 鶻 婲 謉 餎 嫾 簗 売 腉 脌 餪 齵 朩 簱 苆 栄 捼 濏 蹹 乭 妵 桜 礖 炞 峅 垪 賶 襙 椙 鯳 畳 聣 鈄 乻 旕 廃 瓰 糓 靎 靏 鵆 袸 唜 遖 掜 拰 繷 啂 夞 喸 溌 兺 褄 囕 芿 杁 鳰 圸 籂 瓧 琑 溹 粏 畓 斢 瓲 杤 閕 圷 顕 壱 珱 愥 凪 俧 専 沝 訰 枠

As below:

py(charfq_na_save$character, dic = pydic(dic = c("pinyin2")))
  榊   弐   *   荘   碖   畑   怾   鬷   丒   苶   麿   饹   昻   肀   雸   嚒   渋   掻   戦   噷   抜   験   鞆   嗮 
"榊" "弐" "*" "荘" "碖" "畑" "怾" "鬷" "丒" "苶" "麿" "饹" "昻" "肀" "雸" "嚒" "渋" "掻" "戦" "噷" "抜" "験" "鞆" "嗮" 
  笹   淪   発   歯   呠   壖   訳   杦   嬢   気   臓   囖   袮   騨   粫   広   敻   続   営   対   栃   歳   弾   乄 
"笹" "淪" "発" "歯" "呠" "壖" "訳" "杦" "嬢" "気" "臓" "囖" "袮" "騨" "粫" "広" "敻" "続" "営" "対" "栃" "歳" "弾" "乄" 
  夐   啽   読   莻   逤   襨   脦   犠   塩   灐   捘   匁   頩   潠   伈   郉   諳   櫉   糀   壊   垰   渇   嚰   堧 
"夐" "啽" "読" "莻" "逤" "襨" "脦" "犠" "塩" "灐" "捘" "匁" "頩" "潠" "伈" "郉" "諳" "櫉" "糀" "壊" "垰" "渇" "嚰" "堧" 
  灀   敨   粌   桟   稕   琣   膗   眰   埨   塀   韖   扖   獣   渓   噺   诇   転   辺   鶻   婲   謉   餎   嫾   簗 
"灀" "敨" "粌" "桟" "稕" "琣" "膗" "眰" "埨" "塀" "韖" "扖" "獣" "渓" "噺" "诇" "転" "辺" "鶻" "婲" "謉" "餎" "嫾" "簗" 
  売   腉   脌   餪   齵   朩   簱   苆   栄   捼   濏   蹹   乭   妵   桜   礖   炞   峅   垪   賶   襙   椙   鯳   畳 
"売" "腉" "脌" "餪" "齵" "朩" "簱" "苆" "栄" "捼" "濏" "蹹" "乭" "妵" "桜" "礖" "炞" "峅" "垪" "賶" "襙" "椙" "鯳" "畳" 
  聣   鈄   乻   旕   廃   瓰   糓   靎   靏   鵆   袸   唜   遖   掜   拰   繷   啂   夞   喸   溌   兺   褄   囕   芿 
"聣" "鈄" "乻" "旕" "廃" "瓰" "糓" "靎" "靏" "鵆" "袸" "唜" "遖" "掜" "拰" "繷" "啂" "夞" "喸" "溌" "兺" "褄" "囕" "芿" 
  杁   鳰   圸   籂   瓧   琑   溹   粏   畓   斢   瓲   杤   閕   圷   顕   壱   珱   愥   凪   俧   専   沝   訰   枠 
"杁" "鳰" "圸" "籂" "瓧" "琑" "溹" "粏" "畓" "斢" "瓲" "杤" "閕" "圷" "顕" "壱" "珱" "愥" "凪" "俧" "専" "沝" "訰" "枠" 
py(charfq_na_save$character)
      榊       弐       *       荘       碖       畑       怾       鬷       丒       苶       麿       饹       昻 
  "shen"     "èr"    "cào" "zhuānɡ"    "lún"   "tián"    "zhǐ"   "zěnɡ"   "chǒu"    "nié"     "mo"     "le"    "ánɡ" 
      肀       雸       嚒       渋       掻       戦       噷       抜       験       鞆       嗮       笹       淪 
    "yù"     "án"     "me"     "se"    "sāo"   "zhàn"    "hēn"     "bá"    "yǎn"   "binɡ"    "sài"     "ti"   "ɡuān" 
      発       歯       呠       壖       訳       杦       嬢       気       臓       囖       袮       騨       粫 
    "fā"    "chǐ"    "pěn"   "ruán"     "yì"    "jiu"  "niánɡ"     "qì"   "zànɡ"    "luō"     "ni"    "tuó"     "ér" 
      広       敻       続       営       対       栃       歳       弾       乄       夐       啽       読       莻 
 "ɡuǎnɡ"  "xiòng"     "xu"   "yíng"    "duì"     "li"    "suì"    "dàn"     "wǔ"  "xiòng"     "ān"     "dú"   "ɡònɡ" 
      逤       襨       脦       犠       塩       灐       捘       匁       頩       潠       伈       郉       諳 
   "suò"    "duì"     "de"     "xi"    "yán"   "ying"    "zùn"    "wén"    "pīn"    "sùn"    "lǐn"   "xíng"     "ān" 
      櫉       糀       壊       垰       渇       嚰       堧       灀       敨       粌       桟       稕       琣 
   "chú"    "huɑ"   "huài"     "kɑ"     "kě"     "mó"    "nuò" "shuànɡ"    "tǒu"    "yin"   "zhàn"   "zhǔn"   "běnɡ" 
      膗       眰       埨       塀       韖       扖       獣       渓       噺       诇       転       辺       鶻 
 "chuái"    "diè"    "lǔn"    "pin"    "rǒu"     "ru"   "shou"     "xi"    "xin"  "xiòng"  "zhuǎn"   "biān"     "ɡú" 
      婲       謉       餎       嫾       簗       売       腉       脌       餪       齵       朩       簱       苆 
   "huɑ"    "duǐ"     "le"   "lián"  "liɑnɡ"    "mài"    "nái"    "nin"   "nuǎn"     "óu"   "děnɡ"     "qi"    "qie" 
      栄       捼       濏       蹹       乭       妵       桜       礖       炞       峅       垪       賶       襙 
  "róng"     "ré"     "se"     "tá"    "shí"    "tǒu"   "yīng"     "yù"   "biɑn"   "biɑn"   "binɡ"   "cànɡ"    "cào" 
      椙       鯳       畳       聣       鈄       乻       旕       廃       瓰       糓       靎       靏       鵆 
 "chɑnɡ"     "di"    "dié"     "ní"    "dǒu"     "yú"     "yú"    "fèi"    "fēn"     "ɡǔ"     "hè"     "hè"   "henɡ" 
      袸       唜       遖       掜       拰       繷       啂       夞       喸       溌       兺       褄       囕 
  "jiàn"     "mò"    "nɑn"    "nái"    "nǐn"   "nǒnɡ"    "ɡòu"    "wài"     "bǔ"     "pō"    "fēn"     "qi"    "lǎn" 
      芿       杁       鳰       圸       籂       瓧       琑       溹       粏       畓       斢       瓲       杤 
  "rènɡ"     "ru"     "ru"   "shɑn"    "shi"    "shí"    "suo"     "sè"    "tɑi"    "duō"   "tiǎo"     "wɑ"    "wɑn" 
      閕       圷       顕       壱       珱       愥       凪       俧       専       沝       訰       枠 
   "xiā"    "xiɑ"   "xiǎn"     "yī"   "ying"   "ying"    "zhi"    "zhi"  "zhuān"   "zhuǐ"   "zhūn"    "zui"

report some bugs

py("超市商城", dic = pydic(method = "toneless", dic = "pinyin"))
#             超市商城 
# "chao_fu_shang_cheng"
py("银行", dic = pydic(method = "toneless", multi = FALSE, dic = "pinyin"))
#                     银行 
#"yin_xing/hang/hang/heng" 

py("银行", dic = pydic(method = "toneless", multi = TRUE, dic = "pinyin"))
#                     银行 
#"yin_xing/hang/hang/heng" 
py("公园", dic = pydic(method = "toneless", dic = "pinyin"))
#      公园 
#"gong_wan"

unexpected number when converting "大同市"

Hi Peng,

When I try to convert the Chinese characters "大同", the number 5 kick in unexpectedly. Is this a bug or made intentionally?

> devtools::install_github('pzhaonet/pinyin')
> require('pinyin')
> mypy = pydic(method = 'toneless', dic = 'pinyin2')
> py("大同市",  dic = mypy, sep = '')
     大同 
"datong5"

Thank you!

Better dictionary

The pinyin for "更" (geng4/geng1) is correct in "pinyin" but not in "pinyin2":

library(pinyin)
suppressWarnings(py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin")))
#>            更 
#> "geng4/geng1"
py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin2"))
#>      更 
#> "geng4"

Created on 2021-05-17 by the reprex package (v2.0.0)

The pinyin for "迹" (ji4) is correct in "pinyin2" but not in "pinyin":

library(pinyin)
suppressWarnings(py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin")))
#>    迹 
#> "ji1"
py("", dic = pydic(method = "tone", multi = TRUE, dic = "pinyin2"))
#>    迹 
#> "ji4"

Created on 2021-05-17 by the reprex package (v2.0.0)

In this way, seemingly either of these two dictionaries is error-proof.

how to get the first letter of the pinyin

hello zhao,
can pinyin get the first letter of the pinyin according to the string of Chinese characters ?
for example,
Qingdao --> qd
wang, xiaoer --> w, xe

thanks.

Accuracy of pinyin

Thanks for developing such an using package!
But I found that 广西 is translated into "anxi",
and "鸟" is translated into "Diao".

library(pinyin)
mypy <- pydic()
py("广西", sep = "", dic = mypy) # 转换

广西
"ānxī"

py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换

春眠不觉晓,处处闻啼鸟
"chūn_mián_bú_jiào_xiǎo_,_chǔ_chǔ_wén_tí_diǎo"

I am not sure if it due to my windows Locale:
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] sp_1.3-2 shinyBS_0.61 shiny_1.4.0 maps_3.3.0 chinamap_0.2.0 plotly_4.9.1 lubridate_1.7.4
[8] forecast_8.11 forcats_0.4.0 ggrepel_0.8.1 tidyr_1.0.2 dplyr_0.8.4 nCov2019_0.0.6 pinyin_1.1.7
[15] Hmisc_4.3-1 ggplot2_3.2.1 Formula_1.2-3 survival_3.1-8 lattice_0.20-38

loaded via a namespace (and not attached):
[1] nlme_3.1-142 sf_0.8-1 xts_0.12-0 RColorBrewer_1.1-2 httr_1.4.1
[6] tools_3.6.2 backports_1.1.5 R6_2.4.1 rpart_4.1-15 KernSmooth_2.23-16
[11] DBI_1.1.0 splitstackshape_1.4.8 lazyeval_0.2.2 colorspace_1.4-1 nnet_7.3-12
[16] withr_2.1.2 tidyselect_1.0.0 gridExtra_2.3 curl_4.3 compiler_3.6.2
[21] htmlTable_1.13.3 labeling_0.3 tseries_0.10-47 scales_1.1.0 checkmate_2.0.0
[26] lmtest_0.9-37 fracdiff_1.5-1 classInt_0.4-2 quadprog_1.5-8 stringr_1.4.0
[31] digest_0.6.23 foreign_0.8-72 base64enc_0.1-3 jpeg_0.1-8.1 pkgconfig_2.0.3
[36] htmltools_0.4.0 fastmap_1.0.1 htmlwidgets_1.5.1 rlang_0.4.4 TTR_0.23-6
[41] rstudioapi_0.11 quantmod_0.4-15 farver_2.0.3 zoo_1.8-7 jsonlite_1.6.1
[46] acepack_1.4.1 magrittr_1.5 Matrix_1.2-18 Rcpp_1.0.3 munsell_0.5.0
[51] lifecycle_0.1.0 stringi_1.4.5 grid_3.6.2 parallel_3.6.2 promises_1.1.0
[56] crayon_1.3.4 splines_3.6.2 knitr_1.28 pillar_1.4.3 urca_1.3-0
[61] glue_1.3.1 latticeExtra_0.6-29 remotes_2.1.0 data.table_1.12.8 png_0.1-7
[66] vctrs_0.2.2 httpuv_1.5.2 gtable_0.3.0 purrr_0.3.3 assertthat_0.2.1
[71] xfun_0.12 mime_0.9 xtable_1.8-4 e1071_1.7-3 later_1.0.0
[76] class_7.3-15 viridisLite_0.3.0 timeDate_3043.102 tibble_2.1.3 units_0.6-5
[81] cluster_2.1.0 ellipsis_0.3.0

多音字是个难题

谢谢你开发的这个工具,非常好用,准备给领导推荐。
发现一个问题,多音字怎么转换啊,比如你举的例子:

library('pinyin')
mypy <- pydic() # 载入默认字典
py("春眠不觉晓,处处闻啼鸟", dic = mypy) # 转换

结果是:
"chūn_mián_bú_jiào_xiǎo_,_chǔ_chǔ_wén_tí_diǎo"

觉 是多音字,应该读作 jué,怎么解决?谢谢!

a vector of Mandarin Chinese strings into pinyin

The py() function seems not to be working if I input a vector of Chinese strings. For example:

> library('pinyin')
> mypy = pydic(method = 'toneless')
> py(c("我", "一定", "是个", "天才"),  dic = mypy)
[1] "wo"

Sometimes, I have several columns in a data.frame that need to be converted into English letters.
I wrote a small function that can make it work, which depends on a couple of functions from dplyr.

> testd = data.frame(stringsAsFactors=FALSE,
          x1 = c('我', '一定', '是个', '天才'),
          x2 = c('我', '确', '是个', '天才'))
> print(testd)
    x1   x2
1   我   我
2 一定   确
3 是个 是个
4 天才 天才
> require(tidyverse)
> conv_py = function(data, var_name){
+   for(i in var_name){
+     data[[i]] = map(data[[i]], function(x){py(x, dic = mypy)}) %>%
+       gsub("_", "", .) %>%
+       unlist()
+   }
+   return(data)
+ }

> conv_py(testd, c("x1", "x2"))
       x1      x2
1      wo      wo
2  yiding     que
3  shigan  shigan
4 tiancai tiancai

But there seems to be an obvious bug here: "是个" has been parsed into "shigan", which cannot be correct.

In summary:

  • See if you want to add this conv_py() or alike functions into your updated package. I found converting a vector of Chinese characters a very common problem in data manipulation.
  • fix the obvious "是个" into "shigan" bug in the package, which is probably not your fault. I guess it is from the problem in the dictionary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.