jonathanbratt / rbert Goto Github PK
View Code? Open in Web Editor NEWImplementation of BERT in R
License: Apache License 2.0
Implementation of BERT in R
License: Apache License 2.0
run_classifier.R
is mostly about fine-tuning BERT with a classifier head. RBERT is not quite working at this level yet. There are almost certainly bugs that would prevent this from working in the present state.
Hi,
I am trying to run RBERT in tensorflow on small dataset. I have installed Tensorflow using the miniconda environment. Below is the code which throws the error:
Sys.setenv(RETICULATE_PYTHON = "/Users/applemacbookpro/opt/miniconda3/envs/tensorflowa/bin/python")
#Make virtual environment in anaconda
reticulate::conda_list()[[1]][8] %>%
reticulate::use_condaenv(required = TRUE)
#Load the libraries
library(keras)
library(tidyverse)
library(stringr)
library(tidytext)
library(caret)
library(dplyr)
library(tm)
library(RBERT)
library(tensorflow)
library(reticulate)
#Install RBERT
devtools::install("/Users/applemacbookpro/Downloads/RBERT")
#Initiate BERT
BERT_PRETRAINED_DIR <- RBERT::download_BERT_checkpoint(model = "bert_base_uncased")
#Extract tokenized words from agency trainset
BERT_feats <- extract_features(
examples = agency_trainset$agency,
ckpt_dir = BERT_PRETRAINED_DIR,
layer_indexes = 1:12,
)
Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: Evaluation error: ValueError: Tried to convert 'size' to a tensor and failed. Error: Cannot convert a partially known TensorShape to a Tensor: (128, ?).
Traceback:
stop(structure(list(message = "RuntimeError: Evaluation error: ValueError: Tried to convert 'size' to a tensor and failed. Error: Cannot convert a partially known TensorShape to a Tensor: (128, ?).",
call = py_call_impl(callable, dots$args, dots$keywords),
cppstack = structure(list(file = "", line = -1L, stack = c("1 reticulate.so 0x000000010773d3de _ZN4Rcpp9exceptionC2EPKcb + 222",
"2 reticulate.so 0x0000000107746245 _ZN4Rcpp4stopERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEE + 53", ...
13.
python_function at call.py#21
12.
fn at <string>#4
11.
_call_model_fn at tpu_estimator.py#1524
10.
call_without_tpu at tpu_estimator.py#1250
9.
_model_fn at tpu_estimator.py#2470
8.
_call_model_fn at estimator.py#1169
7.
_call_model_fn at tpu_estimator.py#2186
6.
predict at estimator.py#551
5.
predict at tpu_estimator.py#2431
4.
raise_errors at error_handling.py#128
3.
predict at tpu_estimator.py#2437
2.
result_iterator$`next`()
1.
extract_features(examples = agency_trainset$agency, ckpt_dir = BERT_PRETRAINED_DIR,
layer_indexes = 1:12, )
Some tests require checkpoint files to be available. Start tests by downloading a checkpoint, and don't clean up till end of tests.
This takes priority over fixing for TF 1.14.
As an RBERT dev, I'd like the wordpiece tokenizer to be in its own optimizable package, so that I don't have to think about it.
I would like to retrieve BERT word-embeddings for entire paragraphs with more than 2 sentences. However, in the example below I have added a third sentence to the first "paragraph"; but it gives the error in the title.
Does this mean the the third sentence has not been processed and contributed to the overall word embedding?
And how do I do in order to get a word-embedding representing all three (and more) sentences in the first list?
text_to_process3 <- list(c("Impulse is equal to the change in momentum.",
"Changing momentum requires an impulse.", "A third sentence give a warning."),
c("An impulse is like a push.",
"Impulse is force times time."))
text_to_process3
BERT_feats <- extract_features(
examples = make_examples_simple(text_to_process3),
ckpt_dir = BERT_PRETRAINED_DIR,
layer_indexes = 1:12,
batch_size = 2L
)
Thank you for an awesome package!
As an RBERT user, it'd be convenient to be able to send text directly into extract_features, without having to first turn it into "examples." Ie, let's try to keep the example-ification internal. Hoping this won't be TOO difficult...
As an RBERT user, I'd like the tokenizer to be as fast as it can be, so that I don't have to wait for this step more than is absolutely necessary.
First thing to check: Does keras::text_tokenizer
(and friends) do what we need? If so, we should be able to save_text_tokenizer()
when the model is downloaded for #51.
The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).
Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).
On running the package for the first time and am getting the below error message - please can you offer some guidance -thanks
'rDownloading' is not recognized as an internal or external command,
operable program or batch file.
The layer_indexes
parameter in extract_features
was previously required to be a list, but currently must be an integer vector. Should allow either to avoid breaking changes.
We had to specify TF 1.11.0 to get the tests in test_modeling.R and test_run_classifier.R to work. What is it about those tests that's 1.11-specific?
Might be Windows-specific. Downloading "bert_large_uncased_wwm" failed (the actual download.file step). Removing method = "libcurl"
fixed it. I don't remember why we specify the method, need to try on different OSs. If using auto works everywhere, let's just do that.
extract_features
, download_BERT_checkpoint
, and probably some other functions use the model parameter, with a hard-coded list of models. Investigate listing those models in one place and automatically updating the formals of those functions. Ideally it should still list them all in the documentation of that function (and in RStudio autocomplete), so it'll take at least a little bit of fanciness. I'm pretty certain it's possible, though.
It doesn't complain right away, but if you run enough models, you get a message like:
WARNING:tensorflow:5 out of the last 6 calls to <function Model.make_predict_function..predict_function at 0x7fa8913872f0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
(@test-download_checkpoint.R#91) Error: testthat unit tests failed
@jonthegeek already has a fix for this, I believe.
In tokenizer and its methods, make the first argument a) universal between the methods, and b) less generic than "x".
Rather than make the user specify whether the vocabulary is cased, we should be able to infer this from the vocabulary itself with a very high degree of confidence.
The place to do this is probably in FullTokenizer (so anything upstream of that would lose do_lower_case
as a parameter).
For example, could add:
do_lower_case <- !any(grepl("^[A-Z]", inv_vocab))
after the line
inv_vocab <- names(vocab)
in tokenization.R
The above assumes that a vocabulary is cased iff it contains at least one token that begins with an uppercase letter. This ensures that we skip any special tokens like [SEP] or [CLS]. Technically, somebody could perversely construct a vocab that is cased, but no tokens start with capitals. The above code would classify such a vocabulary as "uncased", though I believe the correct classification in that case (heh) would be "WTF".
extract_features
should be able to figure out the vocab, config, and checkpoint file paths, given the directory path.
For completeness, would be good to return the bare token embeddings before any transformer layers along with the layer outputs.
Doesn't have to be perfect yet, but let's get a pass!
Sometimes it is useful to know the tokenization for a particular piece of text without running the whole BERT model. This is possible, but currently not very easy to do in RBERT. Would be nice to have a function something like:
tokenize_text("some text to tokenize", ckpt_dir)
.
Out of habit, a lot of things that should be in \code{} in our docs are instead in backticks (eg, line 21 of tokenization.R). Search through and replace those with \code{} (but be sure to manually check, because there may be legitimate backticks within the code itself).
As an RBERT user, I'd like to extract features from text without having to overthink the model backend. Allow the user to specify a "model" like in download_BERT_checkpoint
, and then infer everything else from that.
Note that this will change the order of the arguments again (adding "model" as the second argument).
If the user sends us the same text 100x, we shouldn't take the time to BERT that 100x. Uniquify then join at the end to get back the full list.
Somewhere we should add documentation about the available models. Maybe make a data object of names and urls (and use that in .get_model_url
), or just include the info in download_BERT_checkpoint
.
currently, we test with BERT_base, but we may as well use the smallest available.
As an RBERT user, I'd like downloaded checkpoints to be "stable," so that I can confidently use them for future data prep.
Work out a way to do these things:
Probably just modify make_examples_simple
to accept two-sequence examples.
I'm getting this when doing the example in the README. Any ideas or advice you can share on this? I'm running this locally on a Windows machine, so no TPU here.
> BERT_feats <- extract_features(
+ examples = make_examples_simple(text_to_process2),
+ vocab_file = vocab_file,
+ bert_config_file = bert_config_file,
+ init_checkpoint = init_checkpoint,
+ layer_indexes = 1:12,
+ batch_size = 2L
+ )
Error in py_get_attr_impl(x, name, silent) :
AttributeError: module 'tensorflow.contrib.tpu' has no attribute 'InputPipelineConfig'
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RBERT_0.1.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 compiler_3.5.2 prettyunits_1.0.2 base64enc_0.1-3 remotes_2.0.2
[6] tools_3.5.2 testthat_2.0.1 digest_0.6.20 pkgbuild_1.0.2 pkgload_1.0.2
[11] jsonlite_1.6 memoise_1.1.0 debugme_1.1.0 lattice_0.20-38 rlang_0.4.0
[16] Matrix_1.2-15 cli_1.1.0 rstudioapi_0.9.0 curl_3.3 yaml_2.2.0
[21] xfun_0.9 withr_2.1.2 knitr_1.24 desc_1.2.0 fs_1.3.1
[26] devtools_2.0.1 rprojroot_1.3-2 grid_3.5.2 reticulate_1.12 glue_1.3.1
[31] R6_2.4.0 sessioninfo_1.1.1 whisker_0.4 callr_2.0.2 purrr_0.3.2
[36] magrittr_1.5 backports_1.1.4 tfruns_1.4 usethis_1.4.0 assertthat_0.2.1
[41] tensorflow_1.14.0 crayon_1.3.4
Reprex below.
library(RBERT)
# |- Python ----
reticulate::use_condaenv("r-reticulate")
#> Warning in normalizePath(path.expand(path), winslash, mustWork):
#> path[1]="C:\Users\leungi\AppData\Local\Continuum\anaconda3\envs\fenics/
#> python.exe": The system cannot find the file specified
reticulate::py_config()
#> python: C:\Users\leungi\AppData\Local\Continuum\anaconda3\envs\r-reticulate\python.exe
#> libpython: C:/Users/leungi/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python36.dll
#> pythonhome: C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\envs\R-RETI~1
#> version: 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 14:00:49) [MSC v.1915 64 bit (AMD64)]
#> Architecture: 64bit
#> numpy: C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\envs\R-RETI~1\lib\site-packages\numpy
#> numpy_version: 1.17.2
#>
#> python versions found:
#> C:\Users\leungi\AppData\Local\Continuum\anaconda3\envs\r-reticulate\python.exe
#> C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\envs\R-RETI~1\python.exe
#> C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\python.exe
#> C:\PROGRA~2\MIB055~1\Shared\PYTHON~1\\python.exe
#> C:\Users\leungi\AppData\Local\CONTIN~1\MINICO~1\python.exe
#> C:\Users\leungi\AppData\Local\Continuum\miniconda3\python.exe
#> C:\Users\leungi\AppData\Local\Continuum\miniconda3\envs\r-tensorflow\python.exe
#> C:\Users\leungi\AppData\Local\conda\conda\envs\tensorflow_env\python.exe
# |- model ----
# path to downloaded BERT checkpoint
BERT_PRETRAINED_DIR <- file.path(
"output_data/",
"BERT_checkpoints",
"uncased_L-12_H-768_A-12"
)
if (!dir.exists(BERT_PRETRAINED_DIR)) {
# Download pre-trained BERT model.
RBERT::download_BERT_checkpoint(
model = "bert_base_uncased",
destination = "output_data/"
)
}
#> Warning in dir.create(checkpoint_dir): cannot create dir 'output_data\
#> \BERT_checkpoints', reason 'No such file or directory'
#> [1] "C:\\Users\\leungi\\AppData\\Local\\Temp\\RtmpMhv1E0\\reprex243242916591\\output_data\\BERT_checkpoints\\uncased_L-12_H-768_A-12"
vocab_file <- file.path(BERT_PRETRAINED_DIR, "vocab.txt")
init_checkpoint <- file.path(BERT_PRETRAINED_DIR, "bert_model.ckpt")
bert_config_file <- file.path(BERT_PRETRAINED_DIR, "bert_config.json")
# |- analyze ----
text_to_process <- c(
"Impulse is equal to the change in momentum.",
"Changing momentum requires an impulse.",
"An impulse is like a push.",
"Impulse is force times time."
)
BERT_feats <- RBERT::extract_features(
examples = RBERT::make_examples_simple(text_to_process),
vocab_file = vocab_file,
bert_config_file = bert_config_file,
init_checkpoint = init_checkpoint,
layer_indexes = as.list(1:12),
batch_size = 2L
)
#> [1] "*** Example ***"
#> [1] "unique_id: 1"
#> [1] "tokens:"
#> [1] "[CLS] impulse is equal to the change in momentum . [SEP]"
#> [1] "input_ids:"
#> [1] "101 14982 2003 5020 2000 1996 2689 1999 11071 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "*** Example ***"
#> [1] "unique_id: 2"
#> [1] "tokens:"
#> [1] "[CLS] changing momentum requires an impulse . [SEP]"
#> [1] "input_ids:"
#> [1] "101 5278 11071 5942 2019 14982 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "*** Example ***"
#> [1] "unique_id: 3"
#> [1] "tokens:"
#> [1] "[CLS] an impulse is like a push . [SEP]"
#> [1] "input_ids:"
#> [1] "101 2019 14982 2003 2066 1037 5245 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "*** Example ***"
#> [1] "unique_id: 4"
#> [1] "tokens:"
#> [1] "[CLS] impulse is force times time . [SEP]"
#> [1] "input_ids:"
#> [1] "101 14982 2003 2486 2335 2051 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
BERT_feats
#> $layer_outputs
#> list()
#>
#> $attention_probs
#> list()
Session Info:
devtools::session_info()
#> - Session info ----------------------------------------------------------
#> setting value
#> version R version 3.6.0 (2019-04-26)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/Chicago
#> date 2019-09-07
#>
#> - Packages --------------------------------------------------------------
#> package * version date lib
#> assertthat 0.2.1 2019-03-21 [1]
#> backports 1.1.4 2019-04-10 [1]
#> base64enc 0.1-3 2015-07-28 [1]
#> callr 3.3.1 2019-07-18 [1]
#> cli 1.1.0 2019-03-19 [1]
#> crayon 1.3.4 2017-09-16 [1]
#> desc 1.2.0 2018-05-01 [1]
#> devtools 2.1.0 2019-07-06 [1]
#> digest 0.6.20 2019-07-04 [1]
#> evaluate 0.14 2019-05-28 [1]
#> fs 1.3.1 2019-05-06 [1]
#> glue 1.3.1 2019-03-12 [1]
#> highr 0.8 2019-03-20 [1]
#> htmltools 0.3.6 2017-04-28 [1]
#> jsonlite 1.6 2018-12-07 [1]
#> knitr 1.24 2019-08-08 [1]
#> magrittr 1.5 2014-11-22 [1]
#> memoise 1.1.0 2017-04-21 [1]
#> pkgbuild 1.0.3 2019-03-20 [1]
#> pkgload 1.0.2 2018-10-29 [1]
#> prettyunits 1.0.2 2015-07-13 [1]
#> processx 3.4.1 2019-07-18 [1]
#> ps 1.3.0 2018-12-21 [1]
#> R6 2.4.0 2019-02-14 [1]
#> RBERT * 0.1.0 2019-09-07 [1]
#> Rcpp 1.0.2 2019-07-25 [1]
#> remotes 2.1.0 2019-06-24 [1]
#> reticulate 1.13.0-9000 2019-09-07 [1]
#> rlang 0.4.0 2019-06-25 [1]
#> rmarkdown 1.14 2019-07-12 [1]
#> rprojroot 1.3-2 2018-01-03 [1]
#> sessioninfo 1.1.1 2018-11-05 [1]
#> stringi 1.4.3 2019-03-12 [1]
#> stringr 1.4.0 2019-02-10 [1]
#> tensorflow * 1.14.0.9000 2019-09-07 [1]
#> testthat 2.2.1 2019-07-25 [1]
#> tfruns 1.4 2018-08-25 [1]
#> usethis 1.5.1 2019-07-04 [1]
#> whisker 0.4 2019-08-28 [1]
#> withr 2.1.2 2018-03-15 [1]
#> xfun 0.8 2019-06-25 [1]
#> yaml 2.2.0 2018-07-25 [1]
#> source
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> Github (jonathanbratt/RBERT@8cf3b21)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> Github (rstudio/reticulate@5e0df26)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.0)
#> CRAN (R 3.6.1)
#> Github (rstudio/tensorflow@5185c97)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.1)
#> CRAN (R 3.6.0)
#>
#> [1] C:/Data/R/R-3.6.0/library
currently, I attach tt_ids as an attribute to the tokenized input in tokenize_input
. It feels like a misuse of attributes, but I also don't want to, say, pass the tt_ids as part of a list along with the tokenized input. We could just figure out the tt ids later, and calculate them when we need them to run the model?
Update the "Introduction to RBERT" vignette to use a temp dir, so the code can be run as-is by someone following along. Also... delete vignettes/RBERT_intro.R? I'm not sure what that file is for.
Technically could be a separate issue, but maybe do the same /shared/ --> tmp fix in all the examples?
The filename says it all.
These were hastily written to get the branch into a working state, and should be refactored with more attention paid to speed and safety.
We really need a tiny checkpoint for tests. We currently include the smallest one we can (bert_base_uncased) via git-lfs, but I'd definitely like that to be smaller. Including it allows tests to run waaaaaaaaaaaay faster, though.
I think there are still a few tricks we could do to speed this up. Since this is the workhorse of the package, let's do everything we can to get there. Might need to look into base and/or data.table to replace some of the tidyr stuff I have in there.
I also want to look one more time at whether I'm "thinking" about the raw output from tensorflow correctly. In theory it should be sending back a max_seq_length
x embeddings
matrix for each layer for each example. That shouldn't be TOO hard to tibble-ize. If the slowness is inside the actual BERT work, ok, we can't do anything about it... but I think a lot of it is still in this function.
RBERTviz still has an attention processing step before visualization. The attention_arrays output of extract_features is purely for that use case, so just build it in.
Hi RBERT team,
Thank you all for putting this together. I'm working on transitioning to using BERT as my preferred set of embeddings, but since I program almost exclusively in R, I figured I was out of luck. Everything in this package seems tremendous, and I thank you for putting in such hard work.
As it stands, I'm having issues installing the package on my computer. When I run:
devtools::install_github(
"jonathanbratt/RBERT",
build_vignettes = TRUE
)
I'm asked to update several packages. Fine, not a big deal.
But every time we get past that, I receive this error:
"Error: Failed to install 'RBERT' from GitHub:
'local_makevars' is not an exported object from 'namespace:withr'"
I've updated withr to try and fix things (to no avail). This reminded me of an issue I ran into with using rstan before (see here for a discussion: stan-dev/rstan#857) that I was ultimately able to workaround.
I'm assuming my problems come from the fact that I'm stuck using a Windows computer. Here's my R version information if that helps:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
Is there anything I can do to work around this? Since I'm trying to use this for work, I cannot switch OS.
Update this issue with known things that have to happen before we feel ready for CRAN.
There are several tokenization conventions (e.g. the token used for padding, separating segments, etc.) that need to be specified when doing the wordpiece tokenization for BERT. Currently, some of these conventions are hard-coded in, while others are function parameters. We should decide on a consistent approach here.
Also, more clearly delineate what belongs in RBERT vs. wordpiece.
(macmillancontentscience/wordpiece#15)
This can wait 'til after we require TF2, but... it'd be nice if we ran tensorflow::tf_version
, and prompted them to install if they don't have the version we require.
Include the devtools instructions for installing RBERT.
I think we decided this would be a good thing to do.
The documentation for use_one_hot_embeddings says: "Logical; whether to use one-hot word embeddings or tf.embedding_lookup() for the word embeddings."
I can get most of that from the name of the parameter. What's the difference between those two options?
Also update .model_fn_builder_EF
to inherit this param from extract_features
(or vice versa?), so it's documented the same in both places. Actually, BertModel
also uses it, and it looks like its "home" is embedding_lookup
. The documentation there is slightly different, but neither really helps me grok what the difference is/when I'd want it to be TRUE.
As an RBERT user, I'd like the output of extract_features
to be tidy and more R-like, so that I can work with the output more conveniently.
Will have to update RBERTviz to expect the change, but I think a nested tibble would make this MUCH more convenient for R.
Note: If we add a recipes dependency, we'll inherit tibble and dplyr, so this wouldn't be a HUGE add on top of that. Importing tibble at a minimum feels like it should be fine for this. Alternatively I can keep it all in native dfs and move all the recipes stuff to a separate package, but I think we should aim to make this package as easy to work with as possible.
Just a note.
I've created this: https://github.com/bnosac/golgotha in order to easily use the BERT embeddings in some downstream predictive models and when I tried RBERT I couldn't get the multilingual model to work.
It was also a trial to see on speed of getting these embeddings and to see what these model outputs provide and how I could maybe develop this directly using libtorch bypassing python.
I'm not sure yet if this should be inside RBERT or maybe integrated into tidymodels/textrecipes, but we should make it easy to extract features from text in some sort of standard form using BERT checkpoints (within a recipes pipeline).
The original BERT checkpoints released by Google are in a TensorFlow format.
It seems that most of the related work done by other teams is in the PyTorch implementation.
In particular, pre-trained models such as RoBERTa and DistilBERT have been released for PyTorch.
Many of these models are compatible with the BERT architecture, though possibly with different parameters or vocabularies. It would be great to be able to easily load these into RBERT.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.