Coder Social home page Coder Social logo

Comments (7)

mcanouil avatar mcanouil commented on August 16, 2024

NACHO::load_rcc requires a (single) directory path to the folder in which RCC files can be found.

Following NACHO vignettes (e.g., https://m.canouil.fr/NACHO/articles/NACHO-analysis.html), here an example with subfolders :

library(dplyr)
library(tidyr)
library(tibble)
library(NACHO)
library(GEOquery)

gse <- getGEO("GSE70970")
targets <- pData(phenoData(gse[[1]]))
getGEOSuppFiles(GEO = "GSE70970", baseDir = tempdir())

data_directory1 <- file.path(tempdir(), "GSE70970", "data", "dir1")
untar(
  tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
  exdir = data_directory1
)
data_directory2 <- file.path(tempdir(), "GSE70970", "data", "dir2")
untar(
  tarfile = file.path(tempdir(), "GSE70970", "GSE70970_RAW.tar"),
  exdir = data_directory2
)

targets <- rbind(targets, targets)

data_directory <- file.path(tempdir(), "GSE70970", "data")
targets$IDFILE <- list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)

GSE70970 <- load_rcc(data_directory, targets, id_colname = "IDFILE")
#> [NACHO] Importing RCC files.
#> 
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#>   $ access              : character
#>   $ housekeeping_genes  : character
#>   $ housekeeping_predict: logical
#>   $ housekeeping_norm   : logical
#>   $ normalisation_method: character
#>   $ remove_outliers     : logical
#>   $ n_comp              : numeric
#>   $ data_directory      : character
#>   $ pc_sum              : data.frame
#>   $ nacho               : data.frame
#>   $ outliers_thresholds : list

from nacho.

ChadAHighfill avatar ChadAHighfill commented on August 16, 2024

Still can get this work:

Get all relevant files

the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)

[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC"
[2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC"
[3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC"
[4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC"
[5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"

list.files(data_directory, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE)

rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/Desktop/IDv1.csv",
id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE),
housekeeping_predict = TRUE,
)
[NACHO] Importing RCC files.
Error: Must extract column with a single valid subscript.
x Subscript id_colname has size 0 but must be size 1.
Run rlang::last_error() to see where the error occurred.

I simply want to utilize this useful package and loop through all the subfolders and read into RCC.

from nacho.

mcanouil avatar mcanouil commented on August 16, 2024

Currently, your code has no chance to work because it does not follow any of the load_rcc requirements, please have a look at the documentation https://m.canouil.fr/NACHO/reference/load_rcc.html and its example.

rcc <- load_rcc(
  data_directory = the_files, # this should be a directory, not a list of files
  ssheet_csv = "PATH/Desktop/IDv1.csv", # this should contains a column with RCC filenames (and possibly subdirectory as in my examples)
  id_colname = list.files(the_files, pattern = "\.RCC$|\.RCC.gz$", recursive = TRUE), # this should be a column name of "ssheet_csv", not a list of files
  housekeeping_predict = TRUE
)

from nacho.

ChadAHighfill avatar ChadAHighfill commented on August 16, 2024

Hi,

When all the data is in a individual directory, my code works. However, as this is difficult to parse this back out. I will be dropping this. Thanks so much for the input.

install.packages("NACHO")
library("NACHO")

setwd("PATH" )

keep for now

rcc <- load_rcc(
data_directory = "PATH",
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)

nacho_norm<- normalize(
nacho_object = rcc,
remove_outliers = TRUE
)

I was trying to back this out using the limited documentation....

Define from and to dirs, and the file pattern

from_dir <- "PATH"
to_dir <- "PATH1"
pattern <- ".RCC"

Get all relevant files

the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)

rcc <- load_rcc(
data_directory = the_files,
ssheet_csv = "PATH/IDv1.csv",
id_colname = "IDFILE",
housekeeping_predict = TRUE,
)

The issue is for some reason, the IDFILE that has all the names recognizes the rcc files in a single folder, but not in a list.files format. I dont understand why.

from nacho.

mcanouil avatar mcanouil commented on August 16, 2024

The way you are using NACHO is not at all the intended way, thus it can lead to unexpected results, as you making it works like that (it only works because of a "lucky" side-effect).

The documentation is quite clear (I think) about what should be the values and type for each arguments
image

data_directory is the parent directory which can includes (as in my example before), multiple directories with RCC files.

/GSE70970/data
+-- dir1
|   +-- GSM1824143_NPC-T-1.RCC.gz
|   +-- GSM1824144_NPC-T-10.RCC.gz
|   +-- GSM1824145_NPC-T-100.RCC.gz
|   +-- ...
|   \-- GSM1824405_NP-V-N9.RCC.gz
\-- dir2
    +-- GSM1824143_NPC-T-1.RCC.gz
    +-- GSM1824144_NPC-T-10.RCC.gz
    +-- GSM1824145_NPC-T-100.RCC.gz
    +-- ...
    \-- GSM1824405_NP-V-N9.RCC.gz

Then, building the sample sheet with the "IDFILE" column which will be provided to "id_colname" argument.
Here, the IDFILE includes the sub-folders as well.

targets[c(1:5, 264:269), c(1:2, ncol(targets))]
#>                            title geo_accession                           IDFILE
#> GSM1824143    NPC-Training Set-1    GSM1824143   dir1/GSM1824143_NPC-T-1.RCC.gz
#> GSM1824144   NPC-Training Set-10    GSM1824144  dir1/GSM1824144_NPC-T-10.RCC.gz
#> GSM1824145  NPC-Training Set-100    GSM1824145 dir1/GSM1824145_NPC-T-100.RCC.gz
#> GSM1824146  NPC-Training Set-101    GSM1824146 dir1/GSM1824146_NPC-T-101.RCC.gz
#> GSM1824147  NPC-Training Set-102    GSM1824147 dir1/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241431   NPC-Training Set-1    GSM1824143   dir2/GSM1824143_NPC-T-1.RCC.gz
#> GSM18241441  NPC-Training Set-10    GSM1824144  dir2/GSM1824144_NPC-T-10.RCC.gz
#> GSM18241451 NPC-Training Set-100    GSM1824145 dir2/GSM1824145_NPC-T-100.RCC.gz
#> GSM18241461 NPC-Training Set-101    GSM1824146 dir2/GSM1824146_NPC-T-101.RCC.gz
#> GSM18241471 NPC-Training Set-102    GSM1824147 dir2/GSM1824147_NPC-T-102.RCC.gz
#> GSM18241481 NPC-Training Set-103    GSM1824148 dir2/GSM1824148_NPC-T-103.RCC.gz

It will work exactly the same way in a "for" loop to go through directories.

for (idir in c("dir1", "dir2")) {
  targets_subdir <- targets[dirname(targets[["IDFILE"]]) %in% idir, ]
  targets_subdir[["IDFILE_nodir"]] <- basename(targets_subdir[["IDFILE"]])
  assign(x = idir, value = load_rcc(file.path(data_directory, idir), targets_subdir, id_colname = "IDFILE_nodir"))
}
#> [NACHO] Importing RCC files.
#> 
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#>   $ access              : character
#>   $ housekeeping_genes  : character
#>   $ housekeeping_predict: logical
#>   $ housekeeping_norm   : logical
#>   $ normalisation_method: character
#>   $ remove_outliers     : logical
#>   $ n_comp              : numeric
#>   $ data_directory      : character
#>   $ pc_sum              : data.frame
#>   $ nacho               : data.frame
#>   $ outliers_thresholds : list
#> [NACHO] Importing RCC files.
#> 
#> [NACHO] Performing QC and formatting data.
#> [NACHO] Computing normalisation factors using "GEO" method.
#> [NACHO] Missing values have been replaced with zeros for PCA.
#> [NACHO] Normalising data using "GEO" method with housekeeping genes.
#> [NACHO] Returning a list.
#>   $ access              : character
#>   $ housekeeping_genes  : character
#>   $ housekeeping_predict: logical
#>   $ housekeeping_norm   : logical
#>   $ normalisation_method: character
#>   $ remove_outliers     : logical
#>   $ n_comp              : numeric
#>   $ data_directory      : character
#>   $ pc_sum              : data.frame
#>   $ nacho               : data.frame
#>   $ outliers_thresholds : list
dir1
#> List of 11
#>  $ access              : chr "IDFILE_nodir"
#>  $ housekeeping_genes  : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#>  $ housekeeping_predict: logi FALSE
#>  $ housekeeping_norm   : logi TRUE
#>  $ normalisation_method: chr "GEO"
#>  $ remove_outliers     : logi FALSE
#>  $ n_comp              : num 10
#>  $ data_directory      : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir1"
#>  $ pc_sum              :'data.frame':   10 obs. of  4 variables:
#>  $ nacho               :'data.frame':   198170 obs. of  119 variables:
#>  $ outliers_thresholds :List of 6
#>  - attr(*, "RCC_type")= chr "n1"
#>  - attr(*, "class")= chr "nacho"
dir2
#> List of 11
#>  $ access              : chr "IDFILE_nodir"
#>  $ housekeeping_genes  : chr [1:5] "RPLP0" "RPL19" "ACTB" "GAPDH" ...
#>  $ housekeeping_predict: logi FALSE
#>  $ housekeeping_norm   : logi TRUE
#>  $ normalisation_method: chr "GEO"
#>  $ remove_outliers     : logi FALSE
#>  $ n_comp              : num 10
#>  $ data_directory      : chr "D:\\Profils\\mcanouil\\AppData\\Local\\Temp\\Rtmp4a7wNw\\GSE70970\\data\\dir2"
#>  $ pc_sum              :'data.frame':   10 obs. of  4 variables:
#>  $ nacho               :'data.frame':   198170 obs. of  119 variables:
#>  $ outliers_thresholds :List of 6
#>  - attr(*, "RCC_type")= chr "n1"
#>  - attr(*, "class")= chr "nacho"

To summarise, i suggest/recommend to use the documented approach, otherwise I can not guarantee that the behaviour NACHO will exhibit is the one intended (and the correct one).
I am not at all confident the results you get using file path instead of directory path are correct.

from nacho.

ChadAHighfill avatar ChadAHighfill commented on August 16, 2024

Hi Mcanouil,

I think there might be some confusion between us. The inital way, is the way the way the documentation states. Regardless, our group likes the plots coming off autoplot! I will try to loop this as suggest.

from nacho.

mcanouil avatar mcanouil commented on August 16, 2024

Hum, I do not see in the documentation where load_rcc uses files instead of a directory for the data_directory parameter.
Can you tell me where you saw that? Are you using the latest version?

In your first code examples (and after), we can see that you used files not directory.
So, I do not see where is the confusion on my side.
data_directory is a character vector of length one giving the path to a directory.
the_files in your case is a character vector of length strictly greater than one giving the paths to RCC files, thus incorrect input for load_rcc.

Still can get this work:

Get all relevant files

the_files <- list.files(path = from_dir, recursive = TRUE, pattern = pattern)

[1] "20211130_209524441022_RCC/20211130_209524441022_SC718373_01.RCC" [2] "20211130_209524441022_RCC/20211130_209524441022_SC718374_04.RCC" [3] "20211130_209524441022_RCC/20211130_209524441022_SC718375_07.RCC" [4] "20211130_209524441022_RCC/20211130_209524441022_SC794160_10.RCC" [5] "20211130_209524441022_RCC/20211130_209524441022_SC794164_02.RCC"

list.files(data_directory, pattern = ".RCC$|.RCC.gz$", recursive = TRUE)

rcc <- load_rcc( data_directory = the_files, ssheet_csv = "PATH/Desktop/IDv1.csv", id_colname = list.files(the_files, pattern = ".RCC$|.RCC.gz$", recursive = TRUE), housekeeping_predict = TRUE, ) [NACHO] Importing RCC files. Error: Must extract column with a single valid subscript. x Subscript id_colname has size 0 but must be size 1. Run rlang::last_error() to see where the error occurred.

I simply want to utilize this useful package and loop through all the subfolders and read into RCC.

For more help, try to make a small reproducible example using for instance the {reprex} R package.
And/or show your directory tree structure with fs::dir_tree maybe.

Based on your different inputs, if I try to guess and write a working simple code, it should be:

library("NACHO")
data_directory <- "PATH"
ssheet_df <- data.frame(
  sample_label = list.files(data_directory, pattern = "\\.RCC$|\\.RCC.gz$", recursive = TRUE)
)
load_rcc(
  data_directory = data_directory, 
  ssheet_csv = ssheet_df, 
  id_colname = "sample_label"
)

The code above will import all RCC files found within data_directory recursively.

from nacho.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.