This package, RcppMeCab, is a Rcpp
wrapper for the
part-of-speech morphological analyzer MeCab
. It supports
native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and
Korean) MeCab library. This package fully utilizes the power
Rcpp
brings R
computation to analyze texts
faster.
First, install MeCab
of your language-of-choice.
MeCab
from githubMeCab-Ko
from Bitbucket
repositoryMeCab
and MeCab Chinese Dic
from
MeCab-ChineseSecond, you can install RcppMeCab from CRAN with:
install.packages("RcppMeCab") # build from source
# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version
You should set the language you want to use for the analysis with the
environment variable MECAB_LANG
. The default value is
ko
and if you want to analyze Japanese or Chinese, please
set it as jp
before install the package.
install.packages("RcppMeCab") # for installing Korean version
# or, install for Japanese
Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version
install.packages("RcppMeCab", type="source") # build from source
# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version
For analyzing, you also need MeCab binary and dictionary.
For Korean:
Install mecab-ko-msvc and mecab-ko-dic-msvc
up to your 32-bit or 64-bit Windows version in C:\mecab
.
Provide directory location to RcppMeCab
function.
For Japanese:
Install mecab
binary. Provide directory location to RcppMeCab
function. For example:
pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")
This package has pos
and posParallel
function.
pos(sentence) # returns list, sentence will present on the names of the list
pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names)
pos(sentence, format = "data.frame") # the result will returned as a data frame format
pos(sentence, user_dic) # gets a compiled user dictionary
posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading
dicrc
file is located,
default value is “” or you can set your default value using
options(mecabSysDic = "")
mecab_dict_index
, default value is also “”MeCab API has DictionaryCompiler
, but it contains
die()
. Hence, calling it in Rcpp crashes down entire R
session. This will not be included in RcppMeCab
functions.
Please refer to Mecab for Japanese.
You should have model_file
if you want the library to
estimate cost automatically.
model.bin
in mecab-ko-dicYou need entire mecab-ko-dic
source if you want to
compile Korean user dictionary. User dictionary should also be prepared
in CSV file. CSV structure is found in Japanese and Korean.
Compile:
$ /usr/local/libexec/mecab/mecab-dict-index -m `model_file` -d `mecab_dic_location` -u `user_dictionary_file_name` -f `CSV file charset` -t `original dictionary charset` `target_csv
# example
$ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv
mecab-ko-msvc
has
mecab-dict-index.exe
.MeCab
binary version has
mecab-dict-index.exe
.You can use it in the same way the Linux binary compiles the dictionary.
Junhewk Kim (junhewk.kim@gmail.com)