Comments (7)
NACHO::load_rcc
requires a (single) directory path to the folder in which RCC files can be found.
Following NACHO vignettes (e.g., https://m.canouil.fr/NACHO/articles/NACHO-analysis.html), here an example with subfolders :
library(dplyr)
library(tidyr)
library(tibble)
library(NACHO)
library(GEOquery)
gse <- getGEO("GSE70970")
targets <- pData(phenoData(gse[[1]]))
getGEOSuppFiles(GEO = "GSE70970", baseDir = tempdir())
data_directory1 <- file.path(tempdir(), "GSE70970", "data", "dir1")
untar(
tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
exdir = data_directory1
)
data_directory2 <- file.path(tempdir(), "GSE70970", "data", "dir2")
untar(
tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
exdir = data_directory2
)
targets <- rbind(targets, targets)
data_directory <- file.path(tempdir(), "GSE70970", "data")
targets$IDFILE <- list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)
GSE70970 <- load_rcc(data_directory, targets, id_colname = "IDFILE")
#> [NACHO] Importing RCC files.
#>
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#> $ access : character
#> $ housekeeping_genes : character
#> $ housekeeping_predict: logical
#> $ housekeeping_norm : logical
#> $ normalisation_method: character
#> $ remove_outliers : logical
#> $ n_comp : numeric
#> $ data_directory : character
#> $ pc_sum : data.frame
#> $ nacho : data.frame
#> $ outliers_thresholds : list
from nacho.
Still can get this work:
Get all relevant files
the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)
[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC"
[2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC"
[3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC"
[4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC"
[5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"
list.files(data_directory, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE)
rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/Desktop/IDv1.csv",
id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE),
housekeeping_predict = TRUE,
)
[NACHO] Importing RCC files.
Error: Must extract column with a single valid subscript.
x Subscript id_colname
has size 0 but must be size 1.
Run rlang::last_error()
to see where the error occurred.
I simply want to utilize this useful package and loop through all the subfolders and read into RCC.
from nacho.
Currently, your code has no chance to work because it does not follow any of the load_rcc
requirements, please have a look at the documentation https://m.canouil.fr/NACHO/reference/load_rcc.html and its example.
rcc <- load_rcc(
data_directory = the_files, # this should be a directory, not a list of files
ssheet_csv = "PATH/Desktop/IDv1.csv", # this should contains a column with RCC filenames (and possibly subdirectory as in my examples)
id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE), # this should be a column name of "ssheet_csv", not a list of files
housekeeping_predict = TRUE
)
from nacho.
Hi,
When all the data is in a individual directory, my code works. However, as this is difficult to parse this back out. I will be dropping this. Thanks so much for the input.
install.packages("NACHO")
library("NACHO")
setwd("PATH" )
keep for now
rcc <- load_rcc(
data_directory = "PATH",
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)
nacho_norm<- normalize(
nacho_object = rcc,
remove_outliers = TRUE
)
I was trying to back this out using the limited documentation....
Define from and to dirs, and the file pattern
from_dir <- "PATH"
to_dir <- "PATH1"
pattern <- ".RCC"
Get all relevant files
the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)
rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)
The issue is for some reason, the IDFILE that has all the names recognizes the rcc files in a single folder, but not in a list.files format. I dont understand why.
from nacho.
The way you are using NACHO is not at all the intended way, thus it can lead to unexpected results, as you making it works like that (it only works because of a "lucky" side-effect).
The documentation is quite clear (I think) about what should be the values and type for each arguments
data_directory
is the parent directory which can includes (as in my example before), multiple directories with RCC files.
/GSE70970/data
+-- dir1
| +-- GSM1824143_NPC-T-1.RCC.gz
| +-- GSM1824144_NPC-T-10.RCC.gz
| +-- GSM1824145_NPC-T-100.RCC.gz
| +-- ...
| \-- GSM1824405_NP-V-N9.RCC.gz
\-- dir2
+-- GSM1824143_NPC-T-1.RCC.gz
+-- GSM1824144_NPC-T-10.RCC.gz
+-- GSM1824145_NPC-T-100.RCC.gz
+-- ...
\-- GSM1824405_NP-V-N9.RCC.gz
Then, building the sample sheet with the "IDFILE" column which will be provided to "id_colname" argument.
Here, the IDFILE includes the sub-folders as well.
targets[c(1:5, 264:269), c(1:2, ncol(targets))]
#> title geo_accession IDFILE
#> GSM1824143 NPC-Training Set-1 GSM1824143 dir1/GSM1824143_NPC-T-1.RCC.gz
#> GSM1824144 NPC-Training Set-10 GSM1824144 dir1/GSM1824144_NPC-T-10.RCC.gz
#> GSM1824145 NPC-Training Set-100 GSM1824145 dir1/GSM1824145_NPC-T-100.RCC.gz
#> GSM1824146 NPC-Training Set-101 GSM1824146 dir1/GSM1824146_NPC-T-101.RCC.gz
#> GSM1824147 NPC-Training Set-102 GSM1824147 dir1/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241431 NPC-Training Set-1 GSM1824143 dir2/GSM1824143_NPC-T-1.RCC.gz
#> GSM18241441 NPC-Training Set-10 GSM1824144 dir2/GSM1824144_NPC-T-10.RCC.gz
#> GSM18241451 NPC-Training Set-100 GSM1824145 dir2/GSM1824145_NPC-T-100.RCC.gz
#> GSM18241461 NPC-Training Set-101 GSM1824146 dir2/GSM1824146_NPC-T-101.RCC.gz
#> GSM18241471 NPC-Training Set-102 GSM1824147 dir2/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241481 NPC-Training Set-103 GSM1824148 dir2/GSM1824148_NPC-T-103.RCC.gz
It will work exactly the same way in a "for" loop to go through directories.
for (idir in c("dir1", "dir2")) {
targets_subdir <- targets[dirname(targets[["IDFILE"]]) %in% idir, ]
targets_subdir[["IDFILE_nodir"]] <- basename(targets_subdir[["IDFILE"]])
assign(x = idir, value = load_rcc(file.path(data_directory, idir), targets_subdir, id_colname = "IDFILE_nodir"))
}
#> [NACHO] Importing RCC files.
#>
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#> $ access : character
#> $ housekeeping_genes : character
#> $ housekeeping_predict: logical
#> $ housekeeping_norm : logical
#> $ normalisation_method: character
#> $ remove_outliers : logical
#> $ n_comp : numeric
#> $ data_directory : character
#> $ pc_sum : data.frame
#> $ nacho : data.frame
#> $ outliers_thresholds : list
#> [NACHO] Importing RCC files.
#>
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#> $ access : character
#> $ housekeeping_genes : character
#> $ housekeeping_predict: logical
#> $ housekeeping_norm : logical
#> $ normalisation_method: character
#> $ remove_outliers : logical
#> $ n_comp : numeric
#> $ data_directory : character
#> $ pc_sum : data.frame
#> $ nacho : data.frame
#> $ outliers_thresholds : list
dir1
#> List of 11
#> $ access : chr "IDFILE_nodir"
#> $ housekeeping_genes : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#> $ housekeeping_predict: logi FALSE
#> $ housekeeping_norm : logi TRUE
#> $ normalisation_method: chr "GEO"
#> $ remove_outliers : logi FALSE
#> $ n_comp : num 10
#> $ data_directory : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir1"
#> $ pc_sum :'data.frame': 10 obs. of 4 variables:
#> $ nacho :'data.frame': 198170 obs. of 119 variables:
#> $ outliers_thresholds :List of 6
#> - attr(*, "RCC_type")= chr "n1"
#> - attr(*, "class")= chr "nacho"
dir2
#> List of 11
#> $ access : chr "IDFILE_nodir"
#> $ housekeeping_genes : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#> $ housekeeping_predict: logi FALSE
#> $ housekeeping_norm : logi TRUE
#> $ normalisation_method: chr "GEO"
#> $ remove_outliers : logi FALSE
#> $ n_comp : num 10
#> $ data_directory : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir2"
#> $ pc_sum :'data.frame': 10 obs. of 4 variables:
#> $ nacho :'data.frame': 198170 obs. of 119 variables:
#> $ outliers_thresholds :List of 6
#> - attr(*, "RCC_type")= chr "n1"
#> - attr(*, "class")= chr "nacho"
To summarise, i suggest/recommend to use the documented approach, otherwise I can not guarantee that the behaviour NACHO will exhibit is the one intended (and the correct one).
I am not at all confident the results you get using file path instead of directory path are correct.
from nacho.
Hi Mcanouil,
I think there might be some confusion between us. The inital way, is the way the way the documentation states. Regardless, our group likes the plots coming off autoplot! I will try to loop this as suggest.
from nacho.
Hum, I do not see in the documentation where load_rcc
uses files instead of a directory for the data_directory
parameter.
Can you tell me where you saw that? Are you using the latest version?
In your first code examples (and after), we can see that you used files not directory.
So, I do not see where is the confusion on my side.
data_directory
is a character vector of length one giving the path to a directory.
the_files
in your case is a character vector of length strictly greater than one giving the paths to RCC files, thus incorrect input for load_rcc
.
Still can get this work:
Get all relevant files
the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)
[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC" [2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC" [3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC" [4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC" [5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"
list.files(data_directory, pattern = ".RCC$|.RCC.gz$", recursive = TRUE)
rcc <- load_rcc( data_directory = the_files, ssheet_csv = "PATH/Desktop/IDv1.csv", id_colname = list.files(the_files, pattern = ".RCC$|.RCC.gz$", recursive = TRUE), housekeeping_predict = TRUE, ) [NACHO] Importing RCC files. Error: Must extract column with a single valid subscript. x Subscript
id_colname
has size 0 but must be size 1. Runrlang::last_error()
to see where the error occurred.I simply want to utilize this useful package and loop through all the subfolders and read into RCC.
For more help, try to make a small reproducible example using for instance the {reprex} R package.
And/or show your directory tree structure with fs::dir_tree
maybe.
Based on your different inputs, if I try to guess and write a working simple code, it should be:
library("NACHO")
data_directory <- "PATH"
ssheet_df <- data.frame(
sample_label = list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)
)
load_rcc(
data_directory = data_directory,
ssheet_csv = ssheet_df,
id_colname = "sample_label"
)
The code above will import all RCC files found within data_directory
recursively.
from nacho.
Related Issues (20)
- Using NACHO for single catridge assays HOT 3
- Background normalization? HOT 1
- Question about sample sheet contents and formatting HOT 3
- Release NACHO 1.0.1
- Move to `data.table` framework HOT 1
- Visualisation after normalise function HOT 2
- Error: invalid first argument HOT 2
- Release NACHO 1.0.2 HOT 1
- Release NACHO 1.1.0
- PlexSet Analysis HOT 1
- Pos and neg controls in RCC file HOT 3
- Allow named vector for RCC file in `load_rcc`? HOT 1
- Is it possible to upload a sample sheet using the NACHO shiny app? HOT 3
- Release NACHO 2.0.0
- Change to new cran checks badge URL HOT 2
- Differing Values in QC plots HOT 4
- Question about R^2 values in positive control linearity QC HOT 3
- Release NACHO 2.0.6 HOT 1
- excluding POS_F from positive control linearity calculation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nacho.