kenhanscombe / ukbtools Goto Github PK

View Code? Open in Web Editor NEW

94.0 94.0 25.0 11.45 MB

An R package to manipulate and explore UK Biobank data

Home Page: https://kenhanscombe.github.io/ukbtools/

R 2.37% HTML 97.63%

biobank kcl-sgu r uk-biobank ukb

ukbtools's People

Contributors

Stargazers

Watchers

ukbtools's Issues

Fewer observations than expected

Hello! I have the same error. I have about 500,000 observations (eid), but "my_ukb_data" only contain 12152 observations.

Besides that, I get the warning:

my_ukb_data <- ukb_df("ukbxxxxx")
Warning: data_frame() is deprecated as of tibble 1.1.0.
Please use tibble() instead.
This warning is displayed once every 8 hours.
Call lifecycle::last_warnings() to see where this warning was generated.
Warning in data.table::fread(input = tab_location, sep = "\t", header = TRUE, :
Discarded single-line footer: <<4356529 1 NA NA NA 0 1941 1 NA NA NA 22 NA NA NA 24 NA NA NA 79 NA NA NA 96 NA NA NA 164 NA NA NA 135 NA NA NA 7 2009-06-03 NA NA NA 11010 NA NA NA 6 NA NA NA 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA >>

Originally posted by @FeFortti in #13 (comment)

problem with ukb_context

Hello,
I'm currently using the ukb_context function and I keep getting the following error:

Error: No expression to parse
Call `rlang::last_error()` to see a backtrace.

I've tried to using dplyr::select and it appears to be grabbing all the correct columns for the demographics data and the nomiss.var is definitely in the referenced dataset. I even tried
setting all the demographic columns manually, but still no luck. Here is the code I ran. Any insight would be appreciated:

ukbxxxx_data <- ukb_df("ukbxxxx_symp", path = "/Users/amandarodrigue/Dropbox/Biobank_symp/Psychotic_experiences/")
ukbyyyy_data <-ukb_df("ukbyyyy_symp", path = "/Users/amandarodrigue/Dropbox/Biobank_symp/Psychotic_experiences/")

full_data<-ukb_df_full_join(ukbxxxx_data, ukbyyyy_data)

ukb_context(full_data, nonmiss.var = "volume_of_brainseg_whole_brain_f26514_2_0",
            bar.position = "fill", sex.var = "sex_f31_0_0",
            age.var = "age_when_attended_assessment_centre_f21003_0_0",
            socioeconomic.var = "townsend_deprivation_index_at_recruitment_f189_0_0",
            ethnicity.var = "ethnic_background_f21000_0_0",
            employment.var = "current_employment_status_corrected_f20119_0_0",
            centre.var = "uk_biobank_assessment_centre_f54_0_0")

I've also tried the ukb_context with a subset of data and get the same error.

fread(): found and resolved improper quoting out of sample

Hi Ken,
thank you for this package, I am using it extensively on my project and it's really great.

I have encountered an issue with a data point in my UKB application while importing the basket through ukb_df(), which causes a warning from fread() (with relative nonzero exit code) and a desperate attempt to fix the entry:

Warning message:
In data.table::fread(input = tab_location, sep = "\t", header = TRUE,  :
  Found and resolved improper quoting out-of-sample. First healed line 459278: <
<5893037   [...]>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

The culprit is what I suspect to be an entry from the medical records, containing several escaped quotation marks. Removing them with sed -e "s|\\\'||g" -e 's|\\\"||g' data.tab solves the issue.
Not sure if you plan on doing something about it (as the solution is relatively trivial), but I thought it was worth letting you know.

ukb_df error

Ken
I installed UKBtools in r and initially it worked with your sample dataset. Then it suddenly stopped working and I get the error
cannot find function ukb_df
How can fix this?
Steven

Memory issues with ukb_df_full_join()

Hi there,
Not sure how to report this and/or what kind of information to add, but I am encountering an awful lot of memory issues when using the function ukb_df_full_join() in an HPC environment, from memory leaks (OOM with ~100GB available per node) to segfaults.

Happy to provide all the information needed from my environment

Cheers

Only Gender context bar chart produced

I am keen to use this package for my work with UK Biobank data. Using the example to explore the context of my data, when I run this code.

> subgroup_of_interest <- (my_ukb_data$body_mass_index_bmi_f21001_0_0>=25)
> ukb_context(my_ukb_data, nonmiss.var = NULL, subset.var = subgroup_of_interest,
+             bar.position = "fill", sex.var = "sex_f31_0_0",
+             age.var = "age_at_recruitment_f21022_0_0",
+             socioeconomic.var = "townsend_deprivation_index_at_recruitment_f189_0_0",
+             ethnicity.var = "ethnic_background_f21000_0_0",
+             employment.var = "current_employment_status_f6142_0_0",
+             centre.var = "uk_biobank_assessment_centre_f54_0_0")
Error in unit(x, default.units) : 'x' and 'units' must have length > 0
In addition: Warning messages:
1: Groups with fewer than two data points have been dropped. 
2: Groups with fewer than two data points have been dropped. 
3: Groups with fewer than two data points have been dropped. 
4: Groups with fewer than two data points have been dropped. 
5: Groups with fewer than two data points have been dropped.

I can only see the gender bar chart, not the ethnicity, Townsend etc. My variables are:

sex_f31_0_0
age_at_recruitment_f21022_0_0
ethnic_background_f21000_0_0
townsend_deprivation_index_at_recruitment_f189_0_0
uk_biobank_assessment_centre_f54_0_0
current_employment_status_f6142_0_0

ukb_context doesn't work with new variables in dataset

Hello, I get this error when I try to follow the example of the vignette:

subgroup_of_interest <- (my_ukb_data$body_mass_index_bmi_f21001_0_0 >= 25) 
ukb_context(my_ukb_data, subset.var = subgroup_of_interest)

Error: More than one expression parsed
Call `rlang::last_error()` to see a backtrace

I identified that the problem comes for the variable sex.var in the function ukb_context. The pattern sex.var = "^sex.*0_0" matches several variables:

Browse[3]> sex.var
[1] "sex_f31_0_0"                                                                       
[2] "sexually_molested_as_a_child_f20490_0_0"                                           
[3] "sexual_interference_by_partner_or_expartner_without_consent_as_an_adult_f20524_0_0"
[4] "sex_chromosome_aneuploidy_f22019_0_0"                                              
[5] "sex_inference_x_probeintensity_f22022_0_0"                                         
[6] "sex_inference_y_probeintensity_f22023_0_0"                                         
[7] "sex_of_baby_f41226_0_0"

rlang_error from ukb_context

I have been trying your example commands
subgroup_of_interest <- (my_ukb_data$body_mass_index_bmi_0_0 >= 25)
ukb_context(my_ukb_data, subset.var = subgroup_of_interest)

but I keep getting following error:
<error/rlang_error>
More than one expression parsed
Backtrace:
x

-ukbtools::ukb_context(my_ukb_data, subset.var = subgroup_of_interest)
+-ukbtools:::multiplot(...)
+-ggplot2::ggplot(data, aes_string(sex.var, fill = fill.var))
+-ggplot2:::ggplot.default(data, aes_string(sex.var, fill = fill.var))
-ggplot2::aes_string(sex.var, fill = fill.var)
```
\-base::lapply(...)
```
```
  \-ggplot2:::FUN(X[[i]], ...)
```
```
    \-rlang::parse_expr(x)
```

What am I doing wrong?

Error in html_table_nodes[[data.pos]] : subscript out of bounds

Hello, Thank you for creating the ukbtools package.
The current issue I am facing is
Error in html_table_nodes[[data.pos]] : subscript out of bounds
In addition: Warning message:
XML content does not seem to be XML: './ukb26###.html'
I have moved all the files to the same drive, and re-converted the files using conv.
Any ideas what may be causing this?
Thanks heaps, Mahima

Originally posted by @mkapoor123 in #1 (comment)

importing the data partially after ukb_df function

Thanks Ken for creating this package. I tried to import my data using Ukb_df function, but the data base produced is only 16089 observations and my data set is about 500,000. Do you know what could be the reason for this discrepancy?

Maher,

ukb_icd_freq_by error

Hi,

Using the below command,

ukb_icd_freq_by(all_data, reference.var = "sex_f31_0_0", n.groups = 10,icd.code = c("^(F00)","^(F01)","^(F02)"), icd.labels = c("disease1", "disease2","disease3"), plot.title = "", legend.col = 1, legend.pos = "right", icd.version = 10, freq.plot = FALSE, reference.lab = "Reference variable", freq.lab = "UKB disease frequency")

I get the following error

Error in if (!(icd.code == c("^(I2[0-5])", "^(I6[0-9])", "^(J09|J1[0-9]|J2[0-2]|P23|U04)"))) { :
the condition has length > 1

I can't seem to understand the problem.

ukb_gen_samples_to_remove does not exist?

Hello,

The ukbtools manual https://kenhanscombe.github.io/ukbtools/ describes the function ukb_gen_samples_to_remove

However when I try to use it, I get the follwoing error:
Error in ukb_gen_samples_to_remove(my_relatedness_data, ukb_with_data = pheno$anxiety_self) :
could not find function "ukb_gen_samples_to_remove"

Other functions such as ukb_gen_rel_count work fine, though. I've re-installed ukbtools (just in case it was using an older version) but still not working.

Thanks,
J

Error in .subset2(x, i, exact = exact) : subscript out of bounds

I'm getting an error from the ukb_context() function.

library(ukbtools)

load("ukb34514_data.rda", verbose = T)
dim(ukb34514)

[1] 502527 3862

Now I supply a logical vector with subset.var:

subgroup_of_interest <- (ukb34514$body_mass_index_bmi_f21001_0_0 >= 25)
head(subgroup_of_interest)

[1] TRUE FALSE TRUE TRUE TRUE FALSE

length(subgroup_of_interest)

[1] 502527

For some reason though I get the following error:

ukb_context(ukb34514, subset.var = subgroup_of_interest)

Error in .subset2(x, i, exact = exact) : subscript out of bounds

The sessionInfo() is:

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ukbtools_0.11.3.9000

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1         compiler_3.5.1     pillar_1.4.2       iterators_1.0.10   prettyunits_1.0.2  remotes_2.0.4     
 [7] tools_3.5.1        zeallot_0.1.0      digest_0.6.20      packrat_0.5.0      pkgbuild_1.0.3     pkgload_1.0.2     
[13] memoise_1.1.0.9000 tibble_2.1.3       gtable_0.3.0       pkgconfig_2.0.2    rlang_0.4.0        foreach_1.4.4     
[19] cli_1.1.0          rstudioapi_0.10    yaml_2.2.0         parallel_3.5.1     curl_3.3           stringr_1.4.0     
[25] withr_2.1.2        dplyr_0.8.3        vctrs_0.2.0        hms_0.5.0          desc_1.2.0         fs_1.2.7          
[31] devtools_2.0.2     rprojroot_1.3-2    grid_3.5.1         tidyselect_0.2.5   data.table_1.12.2  glue_1.3.1        
[37] R6_2.4.0           processx_3.3.0     XML_3.98-1.20      sessioninfo_1.1.1  tidyr_0.8.3        readr_1.3.1       
[43] callr_3.2.0        purrr_0.3.2        ggplot2_3.2.0      magrittr_1.5       codetools_0.2-16   backports_1.1.4   
[49] scales_1.0.0       ps_1.3.0           usethis_1.5.0      assertthat_0.2.1   colorspace_1.4-1   stringi_1.4.3     
[55] doParallel_1.0.14  lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

Duplicated column names

Hi,

When I run the following

QUERY="F40|F41"
pre=ukb_icd_prevalence(data = my_ukb_data, icd.version = 10, icd.code = QUERY)

I get an error as follows:

Error in select_impl(.data, vars) : 
  found duplicated column name:

Indeed I found some duplicated variable names, for example

-Simon

error with R commands

I have a big file with lots of fields.
awk -F$'\t' '{if (NR < 2) print NF}' ukb677133.tab
8473

I try to create a key file...
library(ukbtools)
my_ukb_data <- ukb_df("ukb677133")
my_ukb_key <- ukb_df_field(my_ukb_data)
write.table(my_ukb_key, file = "./xukb677133_key.txt", sep = "\t")

But I get this error in R...
Error in lapply(value, as.character) :
R character strings are limited to 2^31-1 bytes
Calls: ukb_df_field -> -> regmatches<- -> lapply
Execution halted

ukbxxx.enc file in ukbtools

Hi ken, do you know how to download the ukbxxx.enc file mentioned by ukbtools?

ukb_unpack ukbxxxx.enc key
ukb_conv ukbxxxx.enc_ukb r
ukb_conv ukbxxxx.enc_ukb docs

Thanks.

Shicheng

ukb_icd_diagnosis – Error: Column 1 must be named.

I was following instructions from here:
https://cran.r-project.org/web/packages/ukbtools/vignettes/explore-ukb-data.html
for the step for Retrieving ICD diagnoses. ukb_icd_diagnosis returns

Error: Column 1 must be named.
Use .name_repair to specify repair.
Call `rlang::last_error()` to see a backtrace

Do you know why this is happening?

Thanks

Error with ukb_df

Hi Ken.

(Hope you are well. You may remember me from the SGDP, years back we had a few processing meetings, with Oliver)

I am keen to try your biobank tool out, but when I run the ukb_df command I get the following error:

my_ukb <- ukb_df("ukbXXXX")
Error in source(if (path == ".") { :
/mnt/10tbstore/projects/ukb/ukbXXXX.r:219:18: unexpected symbol
218: lvl.100391 <- c(-3,-1,1,2,3,4)
219: lbl.100391 <- c("Prefer

I looked for changes in the ukbXXXX.r file before and after I ran ukb_df and can see that there is some weird substitution going on at line 219.

Before the command ukbXXXX.r looks like this on line 214:
lbl.100388 <- c("Prefer not to answer","Do not know","Never/rarely use spread","Butter/spreadable butter","Flora Pro-Active/Benecol","Other type of spread/margarine")

After the command the same line is a bit further down on line 219 and the path have been added in a wierd way:
lbl.100388 <- c("Prefer not to answer","Do not know","Never/rarely use spread.delim('/mnt/10tbstore/projects/ukb/ukbXXXX.tab')

Any ideas what may be causing this?
All the best,
Johan

Is there an option from ukbtools to automatically collapse cohorts into one column?

Is there an option from ukbtools to automatically collapse cohorts into one column? For example, collapsing

"hdl_cholesterol_f30760_0_0" "hdl_cholesterol_f30760_1_0"

"hdl_cholesterol_f30760"

Curve types not supported

Although not described in http://biobank.ctsu.ox.ac.uk/crystal/help.cgi?cd=value_type, there seems to be a Curve type that I believe should be read in as categorical (likely is Compound).

See the below snapshots from an example basket we have and also the type on the website.

Installing UKBTOOLS using devtools

devtools::install_github("kenhanscombe/ukbtools", dependencies = TRUE)

When I use the above command I am getting the following error:

E> * checking for file ‘/nvme/pbs/tmpdir/pbs.202087.flashmgr2/RtmpCkMHkl/remotes615da87bc05/kenhanscombe-ukbtools-3dca23a/DESCRIPTION’ ... OK
E> * preparing ‘ukbtools’:
E> * checking DESCRIPTION meta-information ... OK
E> * installing the package to process help pages
E> * creating vignettes ... ERROR
E> Warning in engine$weave(file, quiet = quiet, encoding = enc) :
E> Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1.
E> Error: processing vignette 'explore-ukb-data.Rmd' failed with diagnostics:
E> The 'markdown' package should be installed and declared as a dependency of the 'ukbtools' package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter contains vignette(s) built with the 'markdown' package. Please see yihui/knitr#1864 for more information.
E> Execution halted

I do not have pandoc in the cluster.

How can I solve this problem?

Regards,

Sathish

unused argument (nThread = if (n_threads == "max") {

Hi, I used the following code to install the latest development version: devtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE)

and then run the following two lines:

library(ukbtools)
my_ukb_data <– ukb_df("ukb9820", path="/restricted/projectnb/ukbiobank/jiehuang/data/ukb/pheno/raw")

Bub I still got the following error:
Error in data.table::fread(input = tab_location, sep = "\t", header = TRUE, :
unused argument (nThread = if (n_threads == "max") {
parallel::detectCores()
} else if (n_threads == "dt") {
data.table::getDTthreads()
} else if (is.numeric(n_threads)) {
min(n_threads, parallel::detectCores())
})

ukbtools

Hello there,

I have read the other post about error in using the ukb_df command but have tried specifying the path to the file without success.

I have done

devtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE)
library(ukbtools)
ukb_dataset <- ukb_df("ukb29xxx-4", path = "/Users/workspace/monday/DH")

But keep getting the following and no matter how i've tweaked it it just doesn't work.
Error in html_table_nodes[[data.pos]] : subscript out of bounds
In addition: Warning message:
XML content does not seem to be XML: './ukb29xxx-4.tab.html'

I have also used

path_to_example_data <- system.file("extdata", package = "ukbtools")
df <- ukb_df("ukbxxxx", path = path_to_example_data)
df_field <- ukb_df_field("ukbxxxx", path = path_to_example_data)

and it does work with the example (just not the actual data)

Please do you have any advice?

Thanks a lot,
K

Help with ukb_gen_sqc_names

Hi Ken

I have used your package ukbtools to label the columns in my ukb_sqc_v2.txt file (really helpful, thank you). However, I still end up with those two columns at the start (x1 and x2) which remained unnamed after I run ukb_gen_sqc_names.

This sqc file doesn't appear to have an IID in it anywhere, However, one of the columns that remains unnamed after I use ukb_gen_sqc_names (x2) looks like it could be IIDs. So, I labelled it as such.

However... if I then try to merge the FID column from the .fam file, into the ukb_sqc file, matching on IID, I get only about 98k matched out of 488k. So presumably this unnamed column in the sqc file actually isn't IID? Or at least it doesn't match my IID column in the fam file?

Have you come across this issue? I've sunk about 3 days trying to sort this now.

Cheers!

Error in node$parent$priority[, node$name] : subscript out of bounds

Hi All,

Kindly help,

How can i fix Error in node$parent$priority[, node$name] : subscript out of bounds? Below i have attached my YAML file and AHP r code.
solar.txt

library(data.tree)
vignette(package = 'data.tree')

library(ahp)
pvAhp <- Load('solar.txt')
Calculate(pvAhp)
Visualize(pvAhp)
Analyze(pvAhp)
AnalyzeTable(pvAhp)

Inconsistent prevalence results

I was trying my hand at using the ukb_icd_prevalence function with a regular expression and I got some inconsistent results when checking it exhaustively against all the codes I was interested in.

Code below:
x <- ukb_icd_prevalence(my_ukb_data, icd.code = "K85.*", icd.version = 10)

y <- (ukb_icd_prevalence(my_ukb_data, icd.code = "K85", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K850", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K851", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K852", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K853", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K854", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K855", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K856", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K857", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K858", icd.version = 10) + ukb_icd_prevalence(my_ukb_data, icd.code = "K859", icd.version = 10))

I found the prevalence of x to be smaller in my dataset than the prevalence of y. This is confusing to me as K85.* should cover all the codes I was looking up in y. If anything I expected y to maybe be smaller than x, to account for K85.00, K85.01 etc. and other sub-codes I may have not included (but did not see in the dataset from a cursory overview).

I am not sure which result to trust. x is 0.0045 and y is 0.0066. Thoughts?

Use vroom instead of data.table::fread for faster loading?

New package vroom out that is much faster than fread at mixed data (aka, not only numerical) and that also supports the wonderful features found in readr. Maybe switch over to vroom instead?

Error: Cannot add ggproto objects together. Did you forget to add this object to a ggplot object?

The first icd frequency by bmi line chart does not work for me. The second, bmi by gender does produce a pair of bar charts.


> ukb_icd_freq_by(my_ukb_data, reference.var = "body_mass_index_bmi_f21001_0_0", freq.plot = TRUE)
Error: Cannot add ggproto objects together. Did you forget to add this object to a ggplot object? 
> ukb_icd_freq_by(my_ukb_data, reference.var = "sex_f31_0_0", freq.plot = TRUE)

If I set freq.plot = FALSE for the bmi chart, I get a correct data frame


# A tibble: 10 x 6
   categorized_var `coronary artery dis~ `cerebrovascular dis~ `lower respiratory tract~ lower upper
   <ord>                           <dbl>                 <dbl>                     <dbl> <dbl> <dbl>
 1 [12.1,22.1]                    0.0392                0.0186                    0.0451  12.1  22.1
 2 (22.1,23.6]                    0.0469                0.0182                    0.0368  22.1  23.6
 3 (23.6,24.7]                    0.0588                0.0192                    0.0376  23.6  24.7
 4 (24.7,25.7]                    0.0706                0.0214                    0.0401  24.7  25.7
 5 (25.7,26.7]                    0.0802                0.0235                    0.0423  25.7  26.7
 6 (26.7,27.9]                    0.0894                0.0244                    0.0468  26.7  27.9
 7 (27.9,29.1]                    0.0983                0.0265                    0.0495  27.9  29.1
 8 (29.1,30.8]                    0.109                 0.0285                    0.0545  29.1  30.8
 9 (30.8,33.6]                    0.126                 0.0306                    0.0629  30.8  33.6
10 (33.6,74.7]                    0.140                 0.0348                    0.0822  33.6  74.7

How to handle multiple fields for the same variable?

I'd like to write a covariates file using ukb_gen_write_plink such as ukb.variables = c("variable1", "variable2", "variable3")
However, I am wondering if there is a way to collapse the serials fields, such that, for instance, I have 4 possible recordings taken for average monthly red wine intake:

average_monthly_red_wine_intake_f4407_0_0
average_monthly_red_wine_intake_f4407_1_0
average_monthly_red_wine_intake_f4407_2_0
average_monthly_red_wine_intake_f4407_3_0

Is there a way to just call it once such as "average_monthly_red_wine_intake_f4407" and to get ukbtools to report only the max,min, or most recent value? Or, in general, is the only way to do it to call for all of the 4 fields using the ukbtools, and then write my own script that collapses the fields in fam file so that I have "average_monthly_red_wine_intake" only once with my best value?

Writing permission required by "ukb_df()"?

I am familiarizing myself with ukbtools, since I am going to pull out phenotypes from the data. However, I am running into a problematic permission issue.

my_ukb_data <- ukb_df("ukbxxxx", path = "/shared/ukb/data/path")


Error in file(file, ifelse(append, "a", "w")) : 
  cannot open the connection
In addition: Warning messages:
1: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
2: In file(file, ifelse(append, "a", "w")) :
  cannot open file '/shared/ukb/data/path/ukbxxxx.r': Permission denied

Since I have both reading and execution permissions for the .r-file, this error message can only be caused by writing permissions. The problem is that I share this directory with other people, so I am worried if writing permissions can be an issue. Is it really true that ukb_df() requires writing permissions?

Alternatively is there another way to do this without writing to the files?

Best regards,

ukbxxx.enc file in ukbtools?

Hi ken, do you know how to download the ukbxxx.enc file mentioned by ukbtools?

ukb_unpack ukbxxxx.enc key
ukb_conv ukbxxxx.enc_ukb r
ukb_conv ukbxxxx.enc_ukb docs

Thanks.

Shicheng

ukb_df permission denied issue

Running my_ukb_data <- ukb_df("ukbxxxx") with my ukb ID I get:

Warning message in file(file, ifelse(append, "a", "w")):
“cannot open file './ukbxxxx.r': Permission denied”

Error in file(file, ifelse(append, "a", "w")): cannot open the connection
Traceback:

ukb_df("ukbxxxx")
.update_tab_path(fileset, column_type = ukb_key$fread_column_type,
. path, n_threads = n_threads)
cat(f, file = r_location, sep = "\n")
file(file, ifelse(append, "a", "w"))

Any thoughts? I thought it would be resolved by running my jupyter notebook with the --allow-root option but that didn't do the trick.

Also I have read/write/execute permissions into the directory in question.

ukb_centre

Hello, I used ukb_centre to add the assessment centre as a text string. Looking at the frequencies this doesn't look right.

> table(my_ukb_data$ukb_centre)

            Barts        Birmingham           Bristol              Bury           Cardiff 
             3797             13939             14058             17878             18647 
Cheadle (revisit)           Croydon         Edinburgh           Glasgow          Hounslow 
            17198             19433             29411             28321             37002 
            Leeds         Liverpool        Manchester    Middlesborough         Newcastle 
            44198             43012             12582             33876             30396 
       Nottingham            Oxford           Reading         Sheffield Stockport (pilot) 
            32816             21286             28875             27380             25501 
            Stoke           Swansea 
             2281               649

For example, comparing with the showcase count https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=54, Birmingham has 25,501 participants, the count for Stockport (pilot) here.

ukb_centre : Error: cannot allocate vector of size 15.0 Gb

Hello, I am using your ukb_centre function to give better descriptive names to the assessment centres. When I do this I get the following output.

> ukb_centre(my_ukb_data, centre.var = "uk_biobank_assessment_centre_f54_0_0") 
Error: cannot allocate vector of size 15.0 Gb 
> str(my_ukb_data$uk_biobank_assessment_centre_f54_0_0)  
chr [1:502536] "11017" "11007" "11011" "11009" "11011" "11021" "11016" "11018" "11010" "11016" ...
> str(ukbcentre)
'data.frame':	24 obs. of  2 variables:
 $ code  : int  11012 11021 11011 11008 11003 11024 11020 11005 11004 11018 ...
 $ centre: chr  "Barts" "Birmingham" "Bristol" "Bury" ...
 - attr(*, "spec")=
  .. cols(
  ..   code = col_integer(),
  ..   centre = col_character()
  .. )--

In the past I have found this is because there is a miss-match between the type of the variable that I am matching with (inner or outer joins?). Here I note that in my_ukb_data the assessment centre is a character string whilst in the ukbcentre it is an int. I thought that the good work you have done with ukb_df/ukb_context might of also dealt with this, but possibly not so? Thanks.

Error: Length of logical index vector for `[` must equal number of columns (or 1):

I'm getting an error when trying to use ukb_context on a subgroup of interest.

my_ukb_data <- ukb_df("ukb24898", path = "/share/projects/uk_biobank/pheno_data")
my_ukb_key <- ukb_df_field("ukb24898", path = "/share/projects/uk_biobank/pheno_data")

One thing I noticed is that the ukb_df_field() command is appending uses_datacoding_... to all of the variables which seems a bit odd -not what I see from the vignette- but perhaps this is because there are multiple UDI's for each Description (e.g. Never eat eggs, dairy, wheat, sugar (pilot) Uses data-coding 100672 has four UDI's: 10855-0.0, 10855-0.1, 10855-0.2, 10855-0.3)?

The error I'm getting is from the ukb_context() function:

heavy_abuse_subgroup <- (my_ukb_data$physically_abused_by_family_as_a_childuses_datacoding_532_f20488_0_0 == "Very often true")
ukb_context(my_ukb_data, nonmiss.var = heavy_abuse_subgroup )

Error: Length of logical index vector for `[` must equal number of columns (or 1):
* `.data` has 3177 columns
* Index vector has length 502543

The phenodata we paid for (41975) apparently does not have body_mass_index or BMI so I cannot try what you have in the vignette. I can however provide you with the data dictionary if we need to troubleshoot using another variable.

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /share/apps/anaconda2/lib/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] feather_0.3.3   ukbtools_0.11.2 usethis_1.4.0   devtools_2.0.2 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1         plyr_1.8.4         compiler_3.5.3     pillar_1.3.1       iterators_1.0.10   prettyunits_1.0.2 
 [7] remotes_2.0.4      tools_3.5.3        testthat_2.1.1     digest_0.6.18      packrat_0.5.0      pkgbuild_1.0.3    
[13] pkgload_1.0.2      memoise_1.1.0.9000 tibble_2.1.1       gtable_0.3.0       pkgconfig_2.0.2    rlang_0.3.4       
[19] foreach_1.4.4      cli_1.1.0          rstudioapi_0.10    parallel_3.5.3     xfun_0.6           knitr_1.22        
[25] stringr_1.4.0      withr_2.1.2        dplyr_0.8.0.1      hms_0.4.2          desc_1.2.0         fs_1.2.7          
[31] rprojroot_1.3-2    grid_3.5.3         tidyselect_0.2.5   data.table_1.12.2  glue_1.3.1         R6_2.4.0          
[37] processx_3.3.0     XML_3.98-1.19      sessioninfo_1.1.1  tidyr_0.8.3        readr_1.3.1        callr_3.2.0       
[43] purrr_0.3.2        ggplot2_3.1.1      magrittr_1.5       codetools_0.2-16   backports_1.1.4    scales_1.0.0      
[49] ps_1.3.0           assertthat_0.2.1   colorspace_1.4-1   stringi_1.4.3      doParallel_1.0.14  lazyeval_0.2.2    
[55] munsell_0.5.0      crayon_1.3.4

Column 1 error again

Hello,

i tried using your example datasset "ukbxxxx" and when i tried to use the icd_diagnosis, it came up with the error:

ukb_icd_diagnosis(mydata, id = "1", icd.version = 10) ## mydata is same as my_ukb_data just change to make typing easier
Error: Column 1 must be named.
Use .name_repair to specify repair.
Run rlang::last_error() to see where the error occurred.

is there something wrong i did? path = (my pc location)\ukbtools-master\ukbtools-master\inst\extdata

ukb_df, Error in data.table::fread, unused argument

Dear Ken,

I am using your ukbtools package to process the UK Biobank data downloaded. The
ukb_df_field() function works fine but I came across the following strange problem with the ukb_df() function:

> library(devtools)
> devtools::install_github("kenhanscombe/ukbtools", dependencies = TRUE)
> library(ukbtools)
> my_ukb_data <- ukb_df("ukb23009")
Error in data.table::fread(input = tab_location, sep = "\t", header = TRUE,  : 
  unused argument (nThread = if (n_threads == "max") {
    parallel::detectCores()
} else if (n_threads == "dt") {
    data.table::getDTthreads()
} else if (is.numeric(n_threads)) {
    min(n_threads, parallel::detectCores())
})

The problem appeared both linux and mac environments. Would be great if you could help!

Thanks in advance

Wenhua

errors

get these errors

uk <- ukb_df("ukb43365")
Error in html_table_nodes[[data.pos]] : subscript out of bounds
In addition: Warning message:
XML content does not seem to be XML: './ukb43365.html'

uk <– ukb_df("ukb43365", path = "/sc/arion/work/lehres01")
Error: unexpected input in "uk <▒"

ukb_gen_samples_to_remove results in integer(0)

I've been trying to use the ukb_gen_samples_to_remove command:

ukb_gen_samples_to_remove(ukb_kinship_data, ukb_with_data = my_list_of_eids)

The documentation says that the my_list_of_eids should be an integer vector (other places say character vector, I have tried both) containing a list of eids. My kinship data file has been working for other commands, but I cannot get any command requiring the 'ukb_with_data' file to work. It always ends up resulting in a blank vector (integer or character depending on how I transform the data).

Is this a known issue or am I missing something? Thanks.

image processing functions

Hi Ken,

Is there any image processing functions on the roadmap to be developed?

Thanks

Shicheng

Trouble loading files into ukbtools

Hello,
I have created these file sets using ukbconv.

ukb42106.html
ukb42106.tab
ukb42106.r

However, after installing the tools in R, I cannot seem to load them up.

getwd()
[1] "/mnt/BIOINFX/UK-Biobank"

library(ukbtools)
my_ukb_data <- ukb_df("ukb42106")
Error: Can't subset columns that don't exist.
✖ Location 2 doesn't exist.
ℹ There are only 1 column.
Run rlang::last_error() to see where the error occurred.

Backtrace:

ukbtools::ukb_df("ukb42106")
ukbtools::ukb_df_field(fileset, path = path)
ukbtools:::fill_missing_description(html_table)
tibble:::[.tbl_df(data[, "Description"], i)
tibble:::vectbl_as_col_location(...)
vctrs::vec_as_location(j, n, names)
vctrs:::stop_subscript_oob(...)
vctrs:::stop_subscript(...)
Run rlang::last_trace() to see the full context.

Backtrace:
█

├─ukbtools::ukb_df("ukb42106")
│ ├─ukb_df_field(fileset, path = path) %>% mutate(fread_column_type = col_type[col.type])
│ └─ukbtools::ukb_df_field(fileset, path = path)
│ └─ukbtools:::fill_missing_description(html_table)
│ ├─data[, "Description"][i]
│ └─tibble:::[.tbl_df(data[, "Description"], i)
│ └─tibble:::vectbl_as_col_location(...)
│ ├─tibble:::subclass_col_index_errors(...)
│ │ └─base::withCallingHandlers(...)
│ └─vctrs::vec_as_location(j, n, names)
│ └─(function () ...
│ └─vctrs:::stop_subscript_oob(...)
│ └─vctrs:::stop_subscript(...)
└─dplyr::mutate(., fread_column_type = col_type[col.type])

$ head ukb42106.r

R program ukb42106.tab created 2021-06-07 by ukb2r.cpp Mar 14 2018 14:22:05

bd <- read.table("/mnt/BIOINFX/UK-Biobank/ukb42106.tab", header=TRUE, sep="\t")
lvl.0009 <- c(0,1)
lbl.0009 <- c("Female","Male")
bd$f.31.0.0 <- ordered(bd$f.31.0.0, levels=lvl.0009, labels=lbl.0009)
lvl.0008 <- c(1,2,3,4,5,6,7,8,9,10,11,12)
lbl.0008 <- c("January","February","March","April","May","June","July","August","September","October","November","December")
bd$f.52.0.0 <- ordered(bd$f.52.0.0, levels=lvl.0008, labels=lbl.0008)
bd$f.53.0.0 <- as.Date(bd$f.53.0.0)

head ukb42106.html

Error with ukb_df in latest version

Hi Ken,

After installing your latest version (either the development or the CRAN one) i get the following error when running ukb_df.

my_ukb <- ukb_df(paste0(datasetname,".sub"))
Error in mutate_impl(.data, dots) :
Evaluation error: as_dictionary() is defunct as of rlang 0.3.0.
Please use as_data_pronoun() instead.

ukb_gen_write_bgenie - Error in UseMethod("left_join")

Hi, I have been trying to use your package to write a covariates table for use with BGENIE. Whilst doing so I encountered the following problem:

I loaded my data in using:
biobank_data <- ukb_df("ukb_key")
bgenie_covars <- ukb_gen_write_bgenie(biobank_data, "path_to_sample_file", "path_to_output_file", list_of_biobank_variables)

And received the error message:

Error in UseMethod("left_join") :
no applicable method for 'left_join' applied to an object of class "character"

Any help would be greatly appreciated!

kenhanscombe / ukbtools Goto Github PK

ukbtools's People

Contributors

Stargazers

Watchers

Forkers

ukbtools's Issues

R program ukb42106.tab created 2021-06-07 by ukb2r.cpp Mar 14 2018 14:22:05

Recommend Projects

Recommend Topics

Recommend Org