Former kgram_freqs class is now called
sbo_kgram_freqs. The constructor kgram_freqs()
is still available as an alias to
sbo_kgram_freqs().
Former sbo_preds class is now substituted by two
classes:
- `sbo_predictor`: for interactive use
- `sbo_predtable`: for storing text predictors out of memory (e.g.
`save()` to file)sbo_predictor and sbo_predtable objects
are obtained by the homonym constructors, which are now S3 generics
accepting character input, as well as
sbo_kgram_freqs and sbo_predtable (for the
sbo_predictor() constructor) class objects. In particular,
these allow to directly train a text predictor without storing the
intermediate sbo_dictionary, and kgram_freqs
objects.
The behaviour of the dict argument in
kgram_freqs() and kgram_freqs_fast() has
changed, now accepting either a sbo_dictionary, a
character or a formula (see also ‘New
features’).
The sbo_predictor implementation dramatically
improves the speed of predict() (by a factor of x10). A
single call to predict() now allocates a few kBs of RAM
(whereas it previously allocated few MBs, c.f. issue #10).
Metadata of sbo_kgram_freqs and
sbo_pred* objects is now stored via attributes
(#11).
sbo_dictionary.word_coverage with generic constructors
and a preconfigured plot() method.kgram_freqs() and
sbo_pred*() can now be built also with a fixed target
coverage fraction of training corpus.prune() generic function for reducing -gram order
of kgram_freqs and sbo_predtable’s.summary() methods for
sbo_kgram_freqs and sbo_pred* objects;
correspondingly, the output of print() has been simplified
considerably (#5).sbo_kgram_freqs,
sbo_dictionary, sbo_predictor and
sbo_predtable can be constructed either through the
homonymous constructors, or through the aliases
kgram_freqs(), dictionary(),
predictor(), predtable().sbo now has SystemRequirements: C++11,
for correct integration with C++11 code (in particular
std::unordered_map).
Model training (with sbo_predictor()) is now
considerably faster, due to optimizations in the algorithm for building
Stupid Back-Off prediction tables.
The Stupid Back-Off algorithm is now thoroughly tested, and small
inconsistencies between the predict.kgram_freqs() and
predict.sbo_predictor() methods have been fixed,
including:
- Proper handling of unknown words
- Consistent handling of ties in prediction probabilities.Model evaluation in eval_sbo_predictor() is now
carried out by sampling a single sentence from each document in test
corpus.
Removed unnecessary dependencies from Depends and
Imports package fields.
erase argument
in preprocess() and kgram_freqs_fast(), c.f.
issue #17.kgramFreqs class, as per §1.6.4 of the
“Writing R extensions” guide.kgram_freqs_fast() for fast and memory efficient
kgram tokenization using the default text preprocessing utility.kgram_freqs(),
get_word_freqs(), preprocess(), and
predict.sbo_preds() has been entirely rewritten in
C++.tokenize_sentences() function for sentence level
tokenization.kgram_freqs() now accepts any user defined single
character EOS token, through the EOS argument.preproc argument to kgram_freqs()
and get_word_freqs(), for custom training corpus
preprocessing.dict argument of kgram_freqs() now
also accepts numeric values, allowing to build a dictionary directly
from the training corpus.predict method for sbo_kgram_freqs
class.