haven-jeon / konlp Goto Github PK
View Code? Open in Web Editor NEWR package for Korean NLP
Home Page: http://cran.r-project.org/web/packages/KoNLP/index.html
R package for Korean NLP
Home Page: http://cran.r-project.org/web/packages/KoNLP/index.html
User can add extra terms on user dic. so need to add reload function which unnecessary to full reload by using "library(KoNLP)"
library(KoNLP) 실행 시
R Session Aborted
R encountered a fatal error. 에러 발생합니다.
version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
######################################
R, RStudio, install된 패키지들까지 싹 다 지웠다가 다시 설치해도 똑같네요..
Hi,
I am getting this kind of error while processing a huge vector(1mln texts)
java.lang.ArrayIndexOutOfBoundsException: 10000
at kr.ac.kaist.swrc.jhannanum.plugin.MajorPlugin.PosTagger.HmmPosTagger.KoNLPHMMTagger.new_mnode(KoNLPHMMTagger.java:349)
at kr.ac.kaist.swrc.jhannanum.plugin.MajorPlugin.PosTagger.HmmPosTagger.KoNLPHMMTagger.tagPOS(KoNLPHMMTagger.java:140)
at kr.ac.kaist.swrc.jhannanum.hannanum.Workflow.analyzeInSingleThread(Workflow.java:870)
at kr.ac.kaist.swrc.jhannanum.hannanum.Workflow.analyze(Workflow.java:534)
at kr.pe.freesearch.jhannanum.comm.HannanumInterface.extractNoun(HannanumInterface.java:141)
is there any way to fix this problem?
KAIST :
https://github.com/haven-jeon/KoNLP/wiki/KoNLP-examples
To import from Sejong dictionary to Hannanum analyzer dictionary, this function needs.
Currently, based on Roxygen.
needs to convert to Rixygen2 Rd
is.jamo() must be inputted with UTF-8 encoding. So, this needs to check whether input was UTF-8 or not.
Felix Song
KoNLP패키지를 만들어 주셔서 감사합니다. :)
텍스트마이닝 작업 관련해서 KoNLP패키지를 이용하는 중에 질문이 생겨서 말씀드립니다.
읽어 주셔서 감사합니다!
> SimplePos09('공보관통상진흥국장전자공업국장무역조사실장제 1차관보')
java.lang.ArrayIndexOutOfBoundsException
Error in `Encoding<-`(`*tmp*`, value = "UTF-8") :
a character vector argument expected
안녕하세요? 금번 업데이트에 감명 받고 아주 잘 사용하고 있습니다. 다시 한번 감사드리며...
저 같은 경우에는 정부기관이나 연구소의 보고서를 많이 다루는데 , 그러한 자료에서 형태소 분석을 할 경우에는 종종 문제가 발생합니다. 많은 보고서가 개요체에 끝에 구두 점을 찍지 않습니다. 구두점이 없는 개요체 형식의 컨텐츠 벡터를 extracNoun 함수를 사용하여 명사를 추출할 경우 잘못된 결과를 종종 반환합니다.
그 증상 패턴을 분석해 보니, 구두점 있는 경우와 어절의 형태소가 사전에 등록되어 있는 경우에는 올바른 결과를 추출합니다. 하지만 사전에 없는 마지막 어절은 강제적으로 추출되며 마지막 한 글자를 분리하여 반환합니다. 예를 들면, "힣힣힣" => "힣힣" "힣" 이렇게 분리됩니다.
저 같은 경우는 벡터 끝에 구두점을 붙여서 임시방편으로 해결하지만, 빅데이터 같은 경우에는 의미없는 부하량이 증가되어 부담이 됩니다.
원인 파악과 해결이 가능할까요?
useNIADic()
txt.vt0 <- "저는 유능한 돌팔이입니다."
txt.vt1 <- "나는 유능한 연구원"
txt.vt2 <- "그동안 많은 일들을 해왔습니다."
txt.vt3 <- "그동안 많은 일들을 해왔음."
txt.vt4 <- "그동안 많은 일들을 해왔음"
txt.vt5 <- "그동안 많은 일들을 힣힣힣"
txt.vt6 <- "그동안 많은 일들을 힣힣힣."
extractNoun(txt.vt0)
[1] "저" "유능" "한" "돌팔이"
extractNoun(txt.vt1)
[1] "나" "유능" "한" "연구"
extractNoun(txt.vt2)
[1] "그동안" "일" "들"
extractNoun(txt.vt3)
[1] "그동안" "일" "들"
extractNoun(txt.vt4)
[1] "그동안" "일" "들" "해왔" "음"
extractNoun(txt.vt5)
[1] "그동안" "일" "들" "힣힣" "힣"
extractNoun(txt.vt6)
[1] "그동안" "일" "들" "힣힣힣"
안녕하세요?
만들어주신 패키지 정말 잘 사용하고 있습니다^^
다름이 아니라, 제가 워드 클라우드 생성하는 shiny app 을 만들어,
팀내에서 같이 쓰기 위해 shiny io에 올려보려고 했는데요,
deploy 할때 아래와 같이 에러가 발생하여 rstudio에 쪽에 문의하였더니,
패키지 개발자 분께서 오류를 고쳐주셔야 한다고 회신을 받았습니다.
확인해주심 대단히 감사하겠습니다^^
(오류 메시지는 맨 아래 부분에 있습니다)
building: Building package: KoNLP
[2016-03-27T00:46:29.686536819+0000] Execute script: packages/build/rJava.sh
trying to compile and link a JNI program
detected JNI cpp flags : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux
detected JNI linker flags : -L$(JAVA_HOME)/lib/amd64/server -ljvm
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/usr/lib/jvm/java-8-openjdk-amd64/jre/../include -I/usr/lib/jvm/java-8-openjdk-amd64/jre/../include/linux -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c conftest.c -o conftest.o
gcc -std=gnu99 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o conftest.so conftest.o -L/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server -ljvm -L/usr/lib/R/lib -lR
JAVA_HOME : /usr/lib/jvm/java-8-openjdk-amd64/jre
Java library path: $(JAVA_HOME)/lib/amd64/server
JNI cpp flags : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux
JNI linker flags : -L$(JAVA_HOME)/lib/amd64/server -ljvm
Updating Java configuration in /usr/lib/R
Done.
[2016-03-27T00:46:51.509207884+0000] Execute script: packages/build/tm.sh
When using KoNLP on low memory system.
library(KoNLP)
패키지 rJava를 로드중입니다
패키지 bitops를 로드중입니다
패키지 Sejong를 로드중입니다
Successfully Loaded Sejong Package.
Java initialized.
Error : .onLoad failed in loadNamespace() for 'KoNLP', details:
call: .jinit(parameters = c("-Dfile.encoding=UTF-8", getOption("java.parameters")))
error: Cannot create Java virtual machine (-1)
에러:'‘KoNLP’' 에 대한 패키지/네임스페이스 로드가 실패했습니다
for easy testing, pkg needs open corpus on pkg data dir
unz(get("SejongDicsZip", envir=KoNLP:::.KoNLPEnv), relpath, encoding=localCharset)
Hannanum related functions(extractNoun, Simple*) are needed pre-processing before get into functions like "gsub("[[:space:]]", " ", sentence)" to avoid "\t\t\t\n\t\r\n" on sentence.
안녕하세요.
먼저 보석같은 패키지를 만드시느라 노고가 많으신 선생님께 감사를 드립니다.
엄청난 패키지를 다른 시스템에서는 잘 쓰다가 이 단맛을 못 보니까 너무 답답해서 이렇게 이슈를 남깁니다.
KoNLP 라이브러리를 로드하고 useSejongDic()나 useNIADic()를 사용하려고 하면 자꾸 에러가 떠서 NIA Dic을 설치하려는데, 이 또한 계속 에러가 뜹니다.
제 시스템 정보와 에러 내용들을 아래에 공유해 드립니다. 혹시 해결 방안이 있는 지 궁금합니다.
#coding:cp949
Sys.setlocale("LC_ALL","Korean")
[1] "LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"Sys.info()
sysname release version nodename
"Windows" "10 x64" "build 14393" "DESKTOP-JCEI8IR"
machine login user effective_user
"x86-64" "JooYoung" "JooYoung" "JooYoung"devtools::install_github('haven-jeon/NIADic/NIADic', build_vignettes = TRUE)
Downloading GitHub repo haven-jeon/NIADic@master
from URL https://api.github.com/repos/haven-jeon/NIADic/zipball/master
Installing NIADic
"C:/PROGRA1/R/R-341.1/bin/x64/R" --no-site-file --no-environ --no-save
--no-restore --quiet CMD build
"C:\Users\JooYoung\AppData\Local\Temp\RtmpGEbZRb\devtools22017e25f40\haven-jeon-NIADic-5ef8093\NIADic"
--no-resave-data --no-manual
library(KoNLP)
Checking user defined dictionary!
useNIADic()
Backup was just finished!
Downloading package from url: https://github.com/haven-jeon/NIADic/releases/download/0.0.1/NIADic_0.0.1.tar.gz
/usr/bin/tar: Cannot connect to C: resolve failed
/usr/bin/tar: Cannot connect to C: resolve failed
Installation failed: argument is of length zero
Error in tryCatch({ : can't install NIADic package!
Please refer 'https://github.com/haven-jeon/NIADic' to install.
Calls: useNIADic -> buildDictionary -> install_NIADic -> tryCatch
In addition: Warning messages:
1: running command 'tar.exe -xf "C:\Users\JooYoung\AppData\Local\Temp\RtmpGEbZRb\file22014f65c47.tar.gz" -C "C:/Users/JooYoung/AppData/Local/Temp/RtmpGEbZRb/devtools22025cd2c59"' had status 128
2: In utils::untar(src, exdir = target, compressed = "gzip") :
'tar.exe -xf "C:\Users\JooYoung\AppData\Local\Temp\RtmpGEbZRb\file22014f65c47.tar.gz" -C "C:/Users/JooYoung/AppData/Local/Temp/RtmpGEbZRb/devtools22025cd2c59"' returned error code 128
3: running command 'tar.exe -tf "C:\Users\JooYoung\AppData\Local\Temp\RtmpGEbZRb\file22014f65c47.tar.gz"' had status 128
4: In min(slashes) : no non-missing arguments to min; returning Inf
Execution halted
phase 3에서 큐에 데이터가 들어오기를 무한 대기 하는 상태 해결
haven-jeon/HanNanum-Analyzer#5
haven-jeon/HanNanum-Analyzer@56de689
KoNLP는 single thread 기반으로 동작하기 때문에 사실상 블락킹 큐가 필요 없는 상황이나, 현 상태를 유지하면서 해당 문제 해결
need more detail documentation
if input file encoding was not applicable to R encoding env, this error will be show.
>f <- file("TextData.txt", blocking=F)
> txtLines <- readLines(f)
> nouns <- sapply(txtLines, extractNoun, USE.NAMES=F)
이하에 에러nchar(sentence) : 1는 부적절한 복수 바이트 문자입니다
so, needs encoding detection function.
preprocessing <- function(inputs){
if(!is.character(inputs)) {
warning("Input must be legitimate character!")
return(FALSE)
}
newInput <- gsub("[[:space:]]", " ", inputs)
newInput <- gsub("[[:space:]]+$", "", newInput)
newInput <- gsub("^[[:space:]]+", "", newInput)
if((nchar(newInput) == 0) |
(nchar(newInput) > 20 & length(strsplit(newInput, " ")[[1]]) <= 1)){
warning(sprintf("It's not kind of right sentence : '%s'", inputs))
return(FALSE)
}
return(newInput)
}
ex_str_A = '가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다'
If inputs is ex_str_A, it returns F.
ex_str_B = '하하 호호 가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다가나다'
But, If inputs is ex_str_B, it returns not F but ex_str_B.
Because, nchar(newInput) > 20 is T, but length(strsplit(newInput, " ")[[1]]) <= 1 is F.
So, If SimplePos09 gets ex_str_B, It could make a problem. (maybe related to memory using HannanumObj)
Keystroke convert option required.
Dear maintainers,
This concerns the packages
ALKr Amelia CSS CVST CheckDigit ChemoSpec ConjointChecks DataCombine
DataFrameConstr Delaporte DiscriMiner EpiContactTrace GOsummaries
Grid2Polygons HLMdiag HiveR IgorR KoNLP MRMR MicroStrategyR Morpho
MplusAutomation NISTnls NMF OLScurve OpenMPController OpenRepGrid
PRISMA PivotalR PropCIs Quandl R1magic RAppArmor RMendeley RMessenger
RSurvey RSvgDevice RTConnect RcmdrPlugin.MA RcmdrPlugin.MPAStats
Rjpstatdb Rook Rttf2pt1 SGP SGPdata Sejong SharpeR SuperLearner
TimeProjection TriMatch WDI XLConnect agridat backtest bagRboostR
bigrf biom bisectr bitops bmp causalsens cheddar clusteval
coarseDataTools coloc countrycode crs cumplyr cvxclustr d3Network
darch datacheck decctools devtools df2json dostats downloader dvn
eeptools events expoRkit expoTree extrafont extrafontdb faoutlier
fdasrvf fontcm forecast formatR futile.logger futile.paradigm gamlr
gazetools ggdendro ggmcmc ggplot2 ggsubplot gitter govdat growthmodels
hSDM harvestr highfrequency hts httpuv httr hysteresis imputation
installr investr ipfp jtrans kelvin kitagawa knitcitations knitr
knitrBootstrap l2boost labeledLoop lambda.r lisrelToR makeR mchof mirt
mmand mmod multilevelPSA mvc myepisodes needy networkTomography ngramr
np npRmpi nsprcomp opencpu.encode pROC pander pathdiagram pavo pbdBASE
pbdDMAT pdist phcfM pheatmap pheno2geno pitchRx plsdepot plspm plyr
pnn poppr portfolio portfolioSim profanal profr prospectr pumilioR
qdap questionr rAltmetric rImpactStory rdatamarket rdryad readbitmap
rebird rentrez repmis reports reshape2 restorepoint rfigshare
rfishbase rfisheries rgbif rio robustlmm ropensnp roxygen2 rplos
rspear rvertnet scales seacarb sig simPH simboot smss snpStatsWriter
sparsediscrim splitstackshape spsmooth sqlshare sss stringr structSSI
stylo surveydata taxize tbdiag tempdisagg tester testthat treebase
trip tripEstimation trueskill turner twitteR wethepeople zendeskR
maintained by one of you.
These contain a top-level README.md file, which is now used to generate
a corresponding README.html file on the CRAN package web pages.
Pls check whether your README.md file is in fact appropriate for this
(e.g., not assuming that the content will only be accessed from the
github project page): if not, pls use .Rbuildignore to have README.md
excluded from the versions for publication on CRAN.
Best
OutofMemory issue on mac osx with R 3.0.x
addTermsToDictionary(c("감자", "ncn", "고구마", ncn))
For large dictionary, CRAN admin requests to divide two packages.
Any user can add and delete user dictionary privately
안녕항세요.
KoNLP 공유해 주셔서 우선 감사드립니다.
그런데 KoNLP 패키지에서 SimplePos22와 MorphAnalyzer 실행 시 아래와 같은 에러 메세지가 나옵니다. (SimplePos09로는 정상적으로 결과가 나옴)
Error in .jcall(get("HannanumObj", envir = KoNLP:::.KoNLPEnv), "S", "SimplePos22", :
java.lang.OutOfMemoryError: Java heap space
Error in .jcall(get("HannanumObj", envir = KoNLP:::.KoNLPEnv), "S", "MorphAnalyzer", :
java.lang.OutOfMemoryError: Java heap space
무엇때문에 오류가 나는 지 알려주실 수 있는지요?
library(KoNLP)
패키지 rJava를 로드중입니다
패키지 bitops를 로드중입니다
Error : .onLoad failed in loadNamespace() for 'KoNLP', details:
call: .jinit(parameters = c("-Dfile.encoding=UTF-8", "-Xmx1024m"))
error: Cannot create Java virtual machine (-4)
에러:'‘KoNLP’' 에 대한 패키지/네임스페이스 로드가 실패했습니다
R > KoNLP::statDic()
$summary
word tag
Length:390143 ncn :269992
Class :character pvg : 66343
Mode :character pad : 22337
mag : 22245
mmd : 3508
ii : 1594
(Other): 4124
$head
word tag
1 가게문 ncn
2 가겟방 ncn
3 가격결정론 ncn
4 가급 ncn
5 가나무 ncn
6 가는소금 ncn
$tail
word tag
390138 힝힝대다 pvg
390139 힝힝하다 pvg
390140 힝힝하다 pad
390141 힠 ncn
390142 힡트리다 pvg
390143 전작권 ncn
In github wiki, the example argues the following:
extractNoun("롯데마트가 판매하고 있는 흑마늘 양념 치킨이 논란이 되고 있다.")
[1] "롯데마트" "판매" "흑마늘" "양념" "치킨" "논란"
But on CRAN package, the result differs:
extractNoun("롯데마트가 판매하고 있는 흑마늘 양념 치킨이 논란이 되고 있다.")
[1] "롯데마트가" "판매" "흑마늘" "양념" "치킨" "논란"
Need to fix preprocessing function.
for performance improvement
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.