jonathanbratt / rbert Goto Github PK

View Code? Open in Web Editor NEW

156.0 13.0 19.0 2.85 MB

Implementation of BERT in R

License: Apache License 2.0

R 100.00%

bert tensorflow natural-language-processing nlp reticulate rstats rstudio

rbert's People

Contributors

Stargazers

Watchers

Forkers

jonthegeek iannimuliterno lazycrazyowl majed-ft johnergordon cregouby tomasvanhouten algoskynet ngfrey althafhshaik lburgwar tlydon kimwager sajjaddehnoei romxero shizelong1985 jianlianggao wkphang ejaygit

rbert's Issues

Test/debug code in run_classifier.R

run_classifier.R is mostly about fine-tuning BERT with a classifier head. RBERT is not quite working at this level yet. There are almost certainly bugs that would prevent this from working in the present state.

Error when running RBERT In Tensorflow 1.11.0: "Error in py_call_impl(callable, dots$args, dots$keywords)"

Hi,

I am trying to run RBERT in tensorflow on small dataset. I have installed Tensorflow using the miniconda environment. Below is the code which throws the error:

Sys.setenv(RETICULATE_PYTHON = "/Users/applemacbookpro/opt/miniconda3/envs/tensorflowa/bin/python")

#Make virtual environment in anaconda

reticulate::conda_list()[[1]][8] %>% 
  reticulate::use_condaenv(required = TRUE)


#Load the libraries
library(keras)
library(tidyverse)
library(stringr)
library(tidytext)
library(caret)
library(dplyr)
library(tm)
library(RBERT)
library(tensorflow)
library(reticulate)

#Install RBERT

devtools::install("/Users/applemacbookpro/Downloads/RBERT")
        
#Initiate BERT
BERT_PRETRAINED_DIR <- RBERT::download_BERT_checkpoint(model = "bert_base_uncased")


#Extract tokenized words from agency trainset
BERT_feats <- extract_features(
  examples = agency_trainset$agency,
  ckpt_dir = BERT_PRETRAINED_DIR,
  layer_indexes = 1:12,
)

Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: Evaluation error: ValueError: Tried to convert 'size' to a tensor and failed. Error: Cannot convert a partially known TensorShape to a Tensor: (128, ?).



Traceback:
stop(structure(list(message = "RuntimeError: Evaluation error: ValueError: Tried to convert 'size' to a tensor and failed. Error: Cannot convert a partially known TensorShape to a Tensor: (128, ?).", 
call = py_call_impl(callable, dots$args, dots$keywords), 
cppstack = structure(list(file = "", line = -1L, stack = c("1 reticulate.so 0x000000010773d3de _ZN4Rcpp9exceptionC2EPKcb + 222", 
"2 reticulate.so 0x0000000107746245 _ZN4Rcpp4stopERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEE + 53", ...
13.
python_function at call.py#21
12.
fn at <string>#4
11.
_call_model_fn at tpu_estimator.py#1524
10.
call_without_tpu at tpu_estimator.py#1250
9.
_model_fn at tpu_estimator.py#2470
8.
_call_model_fn at estimator.py#1169
7.
_call_model_fn at tpu_estimator.py#2186
6.
predict at estimator.py#551
5.
predict at tpu_estimator.py#2431
4.
raise_errors at error_handling.py#128
3.
predict at tpu_estimator.py#2437
2.
result_iterator$`next`()
1.
extract_features(examples = agency_trainset$agency, ckpt_dir = BERT_PRETRAINED_DIR, 
layer_indexes = 1:12, )

make checkpoint available for tests

Some tests require checkpoint files to be available. Start tests by downloading a checkpoint, and don't clean up till end of tests.

Convert to TF2/keras

This takes priority over fixing for TF 1.14.

Move Tokenizer to Separate Package

As an RBERT dev, I'd like the wordpiece tokenizer to be in its own optimizable package, so that I don't have to think about it.

In make_examples_simple(text_to_process3) : Examples must contain at most two distinct segments. Segments beyond the second will be ignored.

I would like to retrieve BERT word-embeddings for entire paragraphs with more than 2 sentences. However, in the example below I have added a third sentence to the first "paragraph"; but it gives the error in the title.
Does this mean the the third sentence has not been processed and contributed to the overall word embedding?
And how do I do in order to get a word-embedding representing all three (and more) sentences in the first list?

text_to_process3 <- list(c("Impulse is equal to the change in momentum.",
"Changing momentum requires an impulse.", "A third sentence give a warning."),
c("An impulse is like a push.",
"Impulse is force times time."))
text_to_process3
BERT_feats <- extract_features(
examples = make_examples_simple(text_to_process3),
ckpt_dir = BERT_PRETRAINED_DIR,
layer_indexes = 1:12,
batch_size = 2L
)

Thank you for an awesome package!

Accept character vector for examples in extract_features

As an RBERT user, it'd be convenient to be able to send text directly into extract_features, without having to first turn it into "examples." Ie, let's try to keep the example-ification internal. Hoping this won't be TOO difficult...

Rewrite and Speed Up Tokenizer

As an RBERT user, I'd like the tokenizer to be as fast as it can be, so that I don't have to wait for this step more than is absolutely necessary.

First thing to check: Does keras::text_tokenizer (and friends) do what we need? If so, we should be able to save_text_tokenizer() when the model is downloaded for #51.

Save tokenizer as part of model

The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).

Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).

Not getting to download the bert_base_uncased

On running the package for the first time and am getting the below error message - please can you offer some guidance -thanks

'rDownloading' is not recognized as an internal or external command,
operable program or batch file.

minor breaking change in 0.1.5

The layer_indexes parameter in extract_features was previously required to be a list, but currently must be an integer vector. Should allow either to avoid breaking changes.

Update for TensorFlow 1.14+

We had to specify TF 1.11.0 to get the tests in test_modeling.R and test_run_classifier.R to work. What is it about those tests that's 1.11-specific?

DL of large model failed

Might be Windows-specific. Downloading "bert_large_uncased_wwm" failed (the actual download.file step). Removing method = "libcurl" fixed it. I don't remember why we specify the method, need to try on different OSs. If using auto works everywhere, let's just do that.

Share model List Between Functions

extract_features, download_BERT_checkpoint, and probably some other functions use the model parameter, with a hard-coded list of models. Investigate listing those models in one place and automatically updating the formals of those functions. Ideally it should still list them all in the documentation of that function (and in RStudio autocomplete), so it'll take at least a little bit of fanciness. I'm pretty certain it's possible, though.

figure out TF warning message

It doesn't complain right away, but if you run enough models, you get a message like:

WARNING:tensorflow:5 out of the last 6 calls to <function Model.make_predict_function..predict_function at 0x7fa8913872f0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

test fails in gp(), but passes in manual testing

(@test-download_checkpoint.R#91) Error: testthat unit tests failed

@jonthegeek already has a fix for this, I believe.

Update tokenizer arguments

In tokenizer and its methods, make the first argument a) universal between the methods, and b) less generic than "x".

infer "casedness" of vocabulary from vocabulary

Rather than make the user specify whether the vocabulary is cased, we should be able to infer this from the vocabulary itself with a very high degree of confidence.
The place to do this is probably in FullTokenizer (so anything upstream of that would lose do_lower_case as a parameter).
For example, could add:
do_lower_case <- !any(grepl("^[A-Z]", inv_vocab))
after the line
inv_vocab <- names(vocab)
in tokenization.R

The above assumes that a vocabulary is cased iff it contains at least one token that begins with an uppercase letter. This ensures that we skip any special tokens like [SEP] or [CLS]. Technically, somebody could perversely construct a vocab that is cased, but no tokens start with capitals. The above code would classify such a vocabulary as "uncased", though I believe the correct classification in that case (heh) would be "WTF".

make extract_features require one directory path, not three

extract_features should be able to figure out the vocab, config, and checkpoint file paths, given the directory path.

include zeroth-level embeddings in extract_features

For completeness, would be good to return the bare token embeddings before any transformer layers along with the layer outputs.

Get appveyor working

Doesn't have to be perfect yet, but let's get a pass!

add easy tokenizer utility function

Sometimes it is useful to know the tokenization for a particular piece of text without running the whole BERT model. This is possible, but currently not very easy to do in RBERT. Would be nice to have a function something like:
tokenize_text("some text to tokenize", ckpt_dir).

Convert backticks in docs to \code{}

Out of habit, a lot of things that should be in \code{} in our docs are instead in backticks (eg, line 21 of tokenization.R). Search through and replace those with \code{} (but be sure to manually check, because there may be legitimate backticks within the code itself).

extract_features by model name

As an RBERT user, I'd like to extract features from text without having to overthink the model backend. Allow the user to specify a "model" like in download_BERT_checkpoint, and then infer everything else from that.

Note that this will change the order of the arguments again (adding "model" as the second argument).

uniquify incoming text

If the user sends us the same text 100x, we shouldn't take the time to BERT that 100x. Uniquify then join at the end to get back the full list.

Document available models

Somewhere we should add documentation about the available models. Maybe make a data object of names and urls (and use that in .get_model_url), or just include the info in download_BERT_checkpoint.

change package tests to work with tiny BERT checkpoint

currently, we test with BERT_base, but we may as well use the smallest available.

Clean up download_checkpoint

As an RBERT user, I'd like downloaded checkpoints to be "stable," so that I can confidently use them for future data prep.

Work out a way to do these things:

Confirm that a checkpoint is the same now as it was when I used it previously.
Preprocess any parts of checkpoints that will be used directly by RBERT (vs BERT), such as the vocab (and then, in the thing that reads in the vocab, identify which type I'm reading).
Allow a user to point to a checkpoint that they downloaded for us to process it.

make helper functions for working with two-sequence examples

Probably just modify make_examples_simple to accept two-sequence examples.

AttributeError: module 'tensorflow.contrib.tpu' has no attribute 'InputPipelineConfig'

I'm getting this when doing the example in the README. Any ideas or advice you can share on this? I'm running this locally on a Windows machine, so no TPU here.

> BERT_feats <- extract_features(
+   examples = make_examples_simple(text_to_process2),
+   vocab_file = vocab_file,
+   bert_config_file = bert_config_file,
+   init_checkpoint = init_checkpoint,
+   layer_indexes = 1:12,
+   batch_size = 2L
+ )
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: module 'tensorflow.contrib.tpu' has no attribute 'InputPipelineConfig'

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252   
[3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RBERT_0.1.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        compiler_3.5.2    prettyunits_1.0.2 base64enc_0.1-3   remotes_2.0.2    
 [6] tools_3.5.2       testthat_2.0.1    digest_0.6.20     pkgbuild_1.0.2    pkgload_1.0.2    
[11] jsonlite_1.6      memoise_1.1.0     debugme_1.1.0     lattice_0.20-38   rlang_0.4.0      
[16] Matrix_1.2-15     cli_1.1.0         rstudioapi_0.9.0  curl_3.3          yaml_2.2.0       
[21] xfun_0.9          withr_2.1.2       knitr_1.24        desc_1.2.0        fs_1.3.1         
[26] devtools_2.0.1    rprojroot_1.3-2   grid_3.5.2        reticulate_1.12   glue_1.3.1       
[31] R6_2.4.0          sessioninfo_1.1.1 whisker_0.4       callr_2.0.2       purrr_0.3.2      
[36] magrittr_1.5      backports_1.1.4   tfruns_1.4        usethis_1.4.0     assertthat_0.2.1 
[41] tensorflow_1.14.0 crayon_1.3.4

RBERT::extract_features() returns NULL though everything seems to run

Reprex below.

library(RBERT)

# |- Python ----
reticulate::use_condaenv("r-reticulate")
#> Warning in normalizePath(path.expand(path), winslash, mustWork):
#> path[1]="C:\Users\leungi\AppData\Local\Continuum\anaconda3\envs\fenics/
#> python.exe": The system cannot find the file specified
reticulate::py_config()
#> python:         C:\Users\leungi\AppData\Local\Continuum\anaconda3\envs\r-reticulate\python.exe
#> libpython:      C:/Users/leungi/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python36.dll
#> pythonhome:     C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\envs\R-RETI~1
#> version:        3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 14:00:49) [MSC v.1915 64 bit (AMD64)]
#> Architecture:   64bit
#> numpy:          C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\envs\R-RETI~1\lib\site-packages\numpy
#> numpy_version:  1.17.2
#> 
#> python versions found: 
#>  C:\Users\leungi\AppData\Local\Continuum\anaconda3\envs\r-reticulate\python.exe
#>  C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\envs\R-RETI~1\python.exe
#>  C:\Users\leungi\AppData\Local\CONTIN~1\ANACON~1\python.exe
#>  C:\PROGRA~2\MIB055~1\Shared\PYTHON~1\\python.exe
#>  C:\Users\leungi\AppData\Local\CONTIN~1\MINICO~1\python.exe
#>  C:\Users\leungi\AppData\Local\Continuum\miniconda3\python.exe
#>  C:\Users\leungi\AppData\Local\Continuum\miniconda3\envs\r-tensorflow\python.exe
#>  C:\Users\leungi\AppData\Local\conda\conda\envs\tensorflow_env\python.exe

# |- model ----
# path to downloaded BERT checkpoint
BERT_PRETRAINED_DIR <- file.path(
  "output_data/",
  "BERT_checkpoints",
  "uncased_L-12_H-768_A-12"
)

if (!dir.exists(BERT_PRETRAINED_DIR)) {
  # Download pre-trained BERT model.
  RBERT::download_BERT_checkpoint(
    model = "bert_base_uncased",
    destination = "output_data/"
  )
}
#> Warning in dir.create(checkpoint_dir): cannot create dir 'output_data\
#> \BERT_checkpoints', reason 'No such file or directory'
#> [1] "C:\\Users\\leungi\\AppData\\Local\\Temp\\RtmpMhv1E0\\reprex243242916591\\output_data\\BERT_checkpoints\\uncased_L-12_H-768_A-12"

vocab_file <- file.path(BERT_PRETRAINED_DIR, "vocab.txt")
init_checkpoint <- file.path(BERT_PRETRAINED_DIR, "bert_model.ckpt")
bert_config_file <- file.path(BERT_PRETRAINED_DIR, "bert_config.json")

# |- analyze ----
text_to_process <- c(
  "Impulse is equal to the change in momentum.",
  "Changing momentum requires an impulse.",
  "An impulse is like a push.",
  "Impulse is force times time."
)

BERT_feats <- RBERT::extract_features(
  examples = RBERT::make_examples_simple(text_to_process),
  vocab_file = vocab_file,
  bert_config_file = bert_config_file,
  init_checkpoint = init_checkpoint,
  layer_indexes = as.list(1:12),
  batch_size = 2L
)
#> [1] "*** Example ***"
#> [1] "unique_id: 1"
#> [1] "tokens:"
#> [1] "[CLS] impulse is equal to the change in momentum . [SEP]"
#> [1] "input_ids:"
#> [1] "101 14982 2003 5020 2000 1996 2689 1999 11071 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "*** Example ***"
#> [1] "unique_id: 2"
#> [1] "tokens:"
#> [1] "[CLS] changing momentum requires an impulse . [SEP]"
#> [1] "input_ids:"
#> [1] "101 5278 11071 5942 2019 14982 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "*** Example ***"
#> [1] "unique_id: 3"
#> [1] "tokens:"
#> [1] "[CLS] an impulse is like a push . [SEP]"
#> [1] "input_ids:"
#> [1] "101 2019 14982 2003 2066 1037 5245 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "*** Example ***"
#> [1] "unique_id: 4"
#> [1] "tokens:"
#> [1] "[CLS] impulse is force times time . [SEP]"
#> [1] "input_ids:"
#> [1] "101 14982 2003 2486 2335 2051 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_mask:"
#> [1] "1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"
#> [1] "input_type_ids:"
#> [1] "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0"

BERT_feats
#> $layer_outputs
#> list()
#> 
#> $attention_probs
#> list()

Session Info:

devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.0 (2019-04-26)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/Chicago             
#>  date     2019-09-07                  
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version     date       lib
#>  assertthat    0.2.1       2019-03-21 [1]
#>  backports     1.1.4       2019-04-10 [1]
#>  base64enc     0.1-3       2015-07-28 [1]
#>  callr         3.3.1       2019-07-18 [1]
#>  cli           1.1.0       2019-03-19 [1]
#>  crayon        1.3.4       2017-09-16 [1]
#>  desc          1.2.0       2018-05-01 [1]
#>  devtools      2.1.0       2019-07-06 [1]
#>  digest        0.6.20      2019-07-04 [1]
#>  evaluate      0.14        2019-05-28 [1]
#>  fs            1.3.1       2019-05-06 [1]
#>  glue          1.3.1       2019-03-12 [1]
#>  highr         0.8         2019-03-20 [1]
#>  htmltools     0.3.6       2017-04-28 [1]
#>  jsonlite      1.6         2018-12-07 [1]
#>  knitr         1.24        2019-08-08 [1]
#>  magrittr      1.5         2014-11-22 [1]
#>  memoise       1.1.0       2017-04-21 [1]
#>  pkgbuild      1.0.3       2019-03-20 [1]
#>  pkgload       1.0.2       2018-10-29 [1]
#>  prettyunits   1.0.2       2015-07-13 [1]
#>  processx      3.4.1       2019-07-18 [1]
#>  ps            1.3.0       2018-12-21 [1]
#>  R6            2.4.0       2019-02-14 [1]
#>  RBERT       * 0.1.0       2019-09-07 [1]
#>  Rcpp          1.0.2       2019-07-25 [1]
#>  remotes       2.1.0       2019-06-24 [1]
#>  reticulate    1.13.0-9000 2019-09-07 [1]
#>  rlang         0.4.0       2019-06-25 [1]
#>  rmarkdown     1.14        2019-07-12 [1]
#>  rprojroot     1.3-2       2018-01-03 [1]
#>  sessioninfo   1.1.1       2018-11-05 [1]
#>  stringi       1.4.3       2019-03-12 [1]
#>  stringr       1.4.0       2019-02-10 [1]
#>  tensorflow  * 1.14.0.9000 2019-09-07 [1]
#>  testthat      2.2.1       2019-07-25 [1]
#>  tfruns        1.4         2018-08-25 [1]
#>  usethis       1.5.1       2019-07-04 [1]
#>  whisker       0.4         2019-08-28 [1]
#>  withr         2.1.2       2018-03-15 [1]
#>  xfun          0.8         2019-06-25 [1]
#>  yaml          2.2.0       2018-07-25 [1]
#>  source                              
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.0)                      
#>  CRAN (R 3.6.0)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  Github (jonathanbratt/RBERT@8cf3b21)
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  Github (rstudio/reticulate@5e0df26) 
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.0)                      
#>  CRAN (R 3.6.1)                      
#>  Github (rstudio/tensorflow@5185c97) 
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.1)                      
#>  CRAN (R 3.6.0)                      
#> 
#> [1] C:/Data/R/R-3.6.0/library

figure out better way to pass token type ids to model

currently, I attach tt_ids as an attribute to the tokenized input in tokenize_input. It feels like a misuse of attributes, but I also don't want to, say, pass the tt_ids as part of a list along with the tokenized input. We could just figure out the tt ids later, and calculate them when we need them to run the model?

Remove /shared reference in vignette

Update the "Introduction to RBERT" vignette to use a temp dir, so the code can be run as-is by someone following along. Also... delete vignettes/RBERT_intro.R? I'm not sure what that file is for.

Technically could be a separate issue, but maybe do the same /shared/ --> tmp fix in all the examples?

(TF2) improve functions in functions-to-improve.R

The filename says it all.
These were hastily written to get the branch into a working state, and should be refactored with more attention paid to speed and safety.

Fake tiny checkpoint

We really need a tiny checkpoint for tests. We currently include the smallest one we can (bert_base_uncased) via git-lfs, but I'd definitely like that to be smaller. Including it allows tests to run waaaaaaaaaaaay faster, though.

Speed Up `extract_features`

I think there are still a few tricks we could do to speed this up. Since this is the workhorse of the package, let's do everything we can to get there. Might need to look into base and/or data.table to replace some of the tidyr stuff I have in there.

I also want to look one more time at whether I'm "thinking" about the raw output from tensorflow correctly. In theory it should be sending back a max_seq_length x embeddings matrix for each layer for each example. That shouldn't be TOO hard to tibble-ize. If the slowness is inside the actual BERT work, ok, we can't do anything about it... but I think a lot of it is still in this function.

Move attention processing into extract_features

RBERTviz still has an attention processing step before visualization. The attention_arrays output of extract_features is purely for that use case, so just build it in.

installation problems

Hi RBERT team,

Thank you all for putting this together. I'm working on transitioning to using BERT as my preferred set of embeddings, but since I program almost exclusively in R, I figured I was out of luck. Everything in this package seems tremendous, and I thank you for putting in such hard work.

As it stands, I'm having issues installing the package on my computer. When I run:

devtools::install_github(
"jonathanbratt/RBERT",
build_vignettes = TRUE
)

I'm asked to update several packages. Fine, not a big deal.

But every time we get past that, I receive this error:

"Error: Failed to install 'RBERT' from GitHub:
'local_makevars' is not an exported object from 'namespace:withr'"

I've updated withr to try and fix things (to no avail). This reminded me of an issue I ran into with using rstan before (see here for a discussion: stan-dev/rstan#857) that I was ultimately able to workaround.

I'm assuming my problems come from the fact that I'm stuck using a Windows computer. Here's my R version information if that helps:

platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out

Is there anything I can do to work around this? Since I'm trying to use this for work, I cannot switch OS.

CRAN prep

Update this issue with known things that have to happen before we feel ready for CRAN.

Decide how to handle tokenization conventions

There are several tokenization conventions (e.g. the token used for padding, separating segments, etc.) that need to be specified when doing the wordpiece tokenization for BERT. Currently, some of these conventions are hard-coded in, while others are function parameters. We should decide on a consistent approach here.

Also, more clearly delineate what belongs in RBERT vs. wordpiece.
(macmillancontentscience/wordpiece#15)

Prompt to Install tensorflow

This can wait 'til after we require TF2, but... it'd be nice if we ran tensorflow::tf_version, and prompted them to install if they don't have the version we require.

Add installation instructions to readme

Include the devtools instructions for installing RBERT.

start using assert package for safety checks?

I think we decided this would be a good thing to do.

Better use_one_hot_embeddings documentation in extract_features

The documentation for use_one_hot_embeddings says: "Logical; whether to use one-hot word embeddings or tf.embedding_lookup() for the word embeddings."

I can get most of that from the name of the parameter. What's the difference between those two options?

Also update .model_fn_builder_EF to inherit this param from extract_features (or vice versa?), so it's documented the same in both places. Actually, BertModel also uses it, and it looks like its "home" is embedding_lookup. The documentation there is slightly different, but neither really helps me grok what the difference is/when I'd want it to be TRUE.

Tidy extract_features output

As an RBERT user, I'd like the output of extract_features to be tidy and more R-like, so that I can work with the output more conveniently.

Will have to update RBERTviz to expect the change, but I think a nested tibble would make this MUCH more convenient for R.

Note: If we add a recipes dependency, we'll inherit tibble and dplyr, so this wouldn't be a HUGE add on top of that. Importing tibble at a minimum feels like it should be fine for this. Alternatively I can keep it all in native dfs and move all the recipes stuff to a separate package, but I think we should aim to make this package as easy to work with as possible.

https://github.com/bnosac/golgotha

Just a note.
I've created this: https://github.com/bnosac/golgotha in order to easily use the BERT embeddings in some downstream predictive models and when I tried RBERT I couldn't get the multilingual model to work.
It was also a trial to see on speed of getting these embeddings and to see what these model outputs provide and how I could maybe develop this directly using libtorch bypassing python.

step_rbert_features

I'm not sure yet if this should be inside RBERT or maybe integrated into tidymodels/textrecipes, but we should make it easy to extract features from text in some sort of standard form using BERT checkpoints (within a recipes pipeline).

Load BERT-esque checkpoints in pytorch formats

The original BERT checkpoints released by Google are in a TensorFlow format.
It seems that most of the related work done by other teams is in the PyTorch implementation.
In particular, pre-trained models such as RoBERTa and DistilBERT have been released for PyTorch.

Many of these models are compatible with the BERT architecture, though possibly with different parameters or vocabularies. It would be great to be able to easily load these into RBERT.