carpentries-lab / metagenomics-analysis Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 28.0 20.51 MB

Data Processing and Visualization for Metagenomics

Home Page: https://carpentries-lab.github.io/metagenomics-analysis/

License: Other

Ruby 0.82% Makefile 7.96% R 7.24% Shell 4.76% Python 77.04% HTML 2.17%

lesson metagenomics english life-sciences carpentries-lab peer-reviewed stable

metagenomics-analysis's People

Contributors

Stargazers

Watchers

metagenomics-analysis's Issues

Improve Introducing the Shell

Reviewer's comments:

In the Introducing the Shell episode, a consideration could be made for accessing the shell when having a local installation (i.e. not through ssh).
A short disclaimer could be added about the differences in output between an AWS instance and a local instance (in order to avoid confusion).
In the Introducing the Shell episode, the "Shortcut: Tab Completion" section assumes a different starting point from where the learners might be by that time.

Exercises and discussions in Data processing lesson

Reviewer's comment: Most exercises are command-line based, adding more conceptual questions or discussion could help evaluate understanding and relevance of the topic better.

Create a pipeline with all commands

I think we must have an automatized step

Improve setup page

The Setup page assumes the attendees know what pre-imaged means
Put version of programs in yml

Integrate Shell lesson within the metagenomics lesson

Although jumping between the Metagenomics and the "Introduction to the Command line for Genomics" was good enough, there are a bunch of bash commands required to parse metagenomics files, mostly in the Diversity analysis lesson. I think teaching Shell could be better integrated, ideally, the relevant lessons from "Introduction to the command line for genomics" should be carried out using data to be used in the metagenomics pipeline.

Provide files created during the lesson

Reviewer's comment:The file cuatroc.biom that is used for the work in episode Diversity Tackled With R would be useful to be provided (in case learners encountered issues creating it).

To do: Make sure that the file is in the hidden backup directory and put a note somewhere about it.

R section needs to be expanded

R section needs more analysis, may be include further examples from Phyloseq

compress Quality control Workflow

Automating a Quality control Workflow
Change this episode for a box at the end of trimming episode and reference the genomics episode.

Nanopore metagenomics can be added as a box in subsection other resources

Paul and Abel can create a box describing software and references for nanopore, and why nanopore differes from illumina
Nelly

Adapter files

Adapter files are not in the correct location. They must be put in the right location and update this in zenodo.

16s data could also be included in R section

Software Carpentry already have a pipeline for 16s at https://nwu-eresearch.github.io/2017-10-24-ARC-16S/
and there are 4 Cienegas data available at MG-RAST that may be included

Intro to Command Line too dense

Reviewer's comment: It is quite a dense lesson and although the structure is appropriate for learners, I think it could be too much for people who have never used the command line. The exercises would definitely healp ease them into the shell. I don't really have any practical suggestions other than considering splitting this part or maybe adding more practical exercises inbetween.

Typos

Reviewer's comment: Throughout the lesson, there are several small typos that need fixing. :)

To do: Review all episodes for typos and correct them.

Improve Structuring data in spreadsheets in Data Tidiness

Reviewer's comment:In the Data Tidiness episode, the "Structuring data in spreadsheets" should also contain a definition of the sample/observation vs variable.

To do: Add definition of sample/observation and variable.

Improve Working with Files and Directories

Reviewer's comments:
1)In the Working with Files and Directories episode, the "Examining Files" starts with the cat command, but it's not actually used. This might increase confusion, so it might need to be removed.
2) In the Working with Files and Directories episode, the "Details on the FASTQ format" section is rather disjoined to the rest of the episode. It does provide useful context, but it may be either added as an optional/note box, or possibly moved to the episode on analyzing the FASTQ files.

Required software setup section

Reviewer's comments:
1)The Windows installation instructions for Git are not up-to-date (there are additional steps in the installer - as of 27/06/2022). Additionally, for the Mac instructions, and in addition to the video that is helpful, the link to the software could be added directly as well (for convenience).

Option B: there are several links to the manual pages of the required software that are missing. This poses a challenge when trying to find the installation instructions (a clear case is the CheckM-genome software that requires a bit of an effort to find the install instructions). Additionally, the version of trimmomatic is listed as 0.38, but only versions 0.37/0.39/0.40 are available on the linked page. Finally, the link to the MaxBin2 software links back to the workshop page.
A local installation through conda was challenging, as the current versions led to multiple conflicts that could not be resolved. This was equally true both when installing each tool independently, and when using the provided yaml file. However, using the specification file (listed under Option A), it was possible to create a working environment. It may be useful to highlight the specification file also as an option to create a local instance (including the few commands that need to be executed as the final step), as well as for setting up the AWS instance.
The table "Software for Bash" might need some edits, e.g. removing or updating the "Available for" entries, and also moving the description of the KronaTools to the appropriate column.

Misformatted exercise

In https://carpentries-incubator.github.io/metagenomics/07-Diversity-tackled-with-R/index.html, the exercise/solution block for exercise 2 seems to be misformatted.

Alternative text of images

To do: Improve alternative text in the mentioned images following the alternative text guidelines: UCSF: Accessibile Images Best Practices
The bIg Hack: Avoid these common alt-text mistakes.
And review the alternative text in the rest of the images not mentioned here.

Improve Writing Scripts and Working with Data

Reviewr's comments:

In the Writing Scripts and Working with Data episode, there is no proposed text to write in nano. Although it's understood that this allows for more creativity from the learners, it may be useful to add an example for guidance. Such an example will also assist in the flow of the section on "Writing files", as it is currently a bit unclear.
In the Writing Scripts and Working with Data episode, the sentence "You will learn more about writing scripts in a later lesson." links back to the same episode.
In the Writing Scripts and Working with Data episode, the "Transferring data between your local machine and the cloud" needs to be adapted to also fit the case of a local installation - or be provided with a possible alternative.
In the Writing Scripts and Working with Data episode, the section on "Versioning scripts with Git and GitHub" would lead to a confusion, given the target audience. Although knowledge of Git is undoubtedly a useful skill, it may not be easily connected here.

Data processing and visualization too dense

Reviewer's comments:
1)Compared to the content of the first three lessons, this lesson is much more dense, both in terms of content as well as in topics per episode. It could be considered to split some of the episodes into smaller ones.
2) This is the most dense lesson, I believe (at least the longest). I like the flow of the lesson and the contents but splitting it into two parts would be beneficial for the learners, in my honest opinion.
3) A more general comment is regarding the overall distribution of the load across the workshop. In its current form, is a bit "heavier" towards the end - although fully understandable, as a lot of the metagenomics-specific topics are tackled at that time, it may be useful to review, and possibly split, some of the episodes into smaller ones. Another approach would be to ensure that, in a 2-day workshop context, some of the first episodes of the Data processing and visualization for metagenomics lesson are tackled during Day 1, thus distributing a bit the load.

Claudia's suggestion:

Take ggplot section from Taxonomic Analysis with R ti a new episode in Intro to R.
Make a new episode to introduce Phyloseq and make the phyloseq objects and arrangements needed for the diversity plots and taxonomic analysis.
Put only diversity theory, alpha diversity plot and beta diversity plot in the Diversity episode.
Only remain with the abundance plots in the Taxonomic analysis episode.

Intro to R for metagenomics is misleading

Reviewer's comment: The overall content of this episode might be misleading, compared to the actual title of "Intro to R for metagenomics". Also, the dataset used (i.e. musicians) is not directly connected to metagenomics, so a more relevant toy dataset could be constructed for these purposes.

To do: Decide if change name of the lesson or change the content for it to be more about metagenomics.

Graphics of Taxonomic Analysis with R

Reviewer's comment:Some of the Figures in Taxonomic Analysis with R could be improved for color-blindness.

To do: Wait until the structure of the R episodes is improved and the code works to change the abundance graphs so they use a colorblind-friendly palette.

Variable names in Intro to R

Reviewer's comments: As a suggestion, I noticed throughout the lesson the use of the period in variable names (v.examp) but in R it is also used in functions (as.logical()) and could be confusing. Variable names could be just single words to avoid confusion.

Improve discussions in Project Organization and Management lesson

Reviewer's comments*:

there are several discussion exercises that could have a diagnostic context.
In the Data Tidiness episode, the "Discussion 2" box has a few typos and phrasing issues that need to be addressed.
In the Planning for NGS Projects episode, it would be useful to have some potential discussion "solutions" after "Discussion 1" box.

To do: Review all discussions in the lesson. If appropriate make them diagnostic and add solutions. Correct typos and phrasing.

Improve Trimming and Filtering episode

Reviewer's comments:

In the Trimming and Filtering episode, it may be useful to provide the Trimmomatic adapter file (cp ~/.miniconda3/pkgs/trimmomatic-0.38-0/share/trimmomatic-0.38-0/adapters/TruSeq3-PE.fa .) as a downloadable option from the material, in case the command doesn't work due to different versions.
Idea: on the Trimming and Filtering section, give an example of how a multi-line command would look like.
Maybe mention something about anaconda/minicionda, as it appears in some commands. There is a box about conda but the relationship between them is not clear. To do: Check in which episodes is mentioned and choose a place for this.

Introduce biom format

Kraken-biom is used in the R section, so we need to introduce biom format

Add R to the remote computer

The Diversity analysis required to jump from command line in the remote computer to Rstudio in each student's local computer. This brought incompatibilities in versions and OSs. It would be better to done as much as possible in the same interface and just bring files to local computer when it is most likely to work for everyone (in eg. just the JPG figures from R)

Update zenodo link in Examining Data on the NCBI SRA Database

Reviewer's comment: In the Examining Data on the NCBI SRA Database, the previous version of the zenodo dataset is linked ( sequencing dataset (from Okie, et al. 2020) adapted for this workshop), compared to the dataset linked int the Setup page.

To do: Put the permanent DOI

Replace formula images with MathJax

Mathematical formulae are displayed as images in Diversity Tackled with R without alternative text. These images are inaccessible for anyone visiting the lesson using a screen reader.

To make the lesson more accessible, and easier to maintain, the lesson template allows you to replace these images with MathJax elements. Follow the steps below to replace these images.

Add the line math: yes to the bottom of your _config.yml file.
In the episode file, replace the images of formulae with the LaTeX equivalent, wrapped in $ signs, e.g.

$$E(y) = \beta_0 + \beta_1 \times x_1 + \beta_2 \times x_2.$$

For an example of what I describe above, you can browse the _config.yml and episode page source of the Multiple Linear Regression for Public Health lesson in the Incubator.

See the Lesson Example for further documentation of this feature of the lesson template.

MAGs /binning section needs to be developed

We have not a binning section, and would be nice to have one

Improve Assessing Read Quality episode

Reviewer's comment: In the Assessing Read Quality episode, the "Exercise 2: Looking at files metadata" could be better structured, in order for each answer to provide a different context to the original question.

Add confidentiality in Data Tidiness

Reviewer's comment: In the Data Tidiness section could be worth briefly mentioning confidentiality issues that may arise from storing and sharing the metadata, i.e. in the case of human samples or patents.

To do: Add a paragraph or box about confidentiality in the Data Tidiness episode.

Spanish names

Reviewer's comment: In Intro to R for metagenomics. Some of the screenshots have a Spanish localization (as well as the names of a few variables). This is not an issue by itself, but it would be useful to be consistent across the material.

Workshop prerequisites

Reviewer's comment: I think not only does the audience need to "have some familiarity with biological concepts, such as that each living organism has a genome" but also some understanding of what sequencing techniques are and what can they achieve, would be greatly beneficial, otherwise he introduction should ellaborate more on that.

To do: Decide if modify the prerequisite or add an explanation about sequencing before the current Introduction in the Data Tidiness episode.

Database for taxonomic assignment

In the taxonomic assignment part we show how to download the minikraken database, but the kraken-db is the one used in the kraken command. Since we will not run that command anyway we should put the example of download for the kraken-db and not the minikraken db, or both.

R no despliega graficas

Hacer que en el pipeline se guarden las graficas y abrirlas desde un archivo para que se puedan ver.

Add a Metagenomics Workshop Overview

As it is for other Data Carpentry lessons, a workshop overview to include:

Project Organization and Management for Metagenomics: Data Tidiness
Introduction to the Command Line: Introducing the Shell
Data Wrangling and Processing: Background and Metadata
The Workshop Overview also provides a standardized way of sharing the setup.

Examples:
https://datacarpentry.org/ecology-workshop/
https://datacarpentry.org/genomics-workshop/

Min max and mean in diversity analysis exercise didn't work

In this episode nothing is learned when the min, max and mean lines are ran. It is not clear their purpose.

https://carpentries-incubator.github.io/metagenomics/09-diversity-analysis/index.html

Image transcription in Examining Data on the NCBI SRA Database

Reviewer's comment:In the NCBI SRA section (last) some images are quite difficult to describe, especially those of the process of downloading data from SRA. And complete transcriptions of images which only consist of text could be provided.

To do: Provide complete transcription of images with text.

Develop Instructor Notes

Instructors notes must be developed for the Intro to R and Data processing Lessons here: https://carpentries-incubator.github.io/metagenomics-workshop/guide/index.html

More biological contexts needs to be added in the lesson

Conclusions of the paper and from other studies should be included in the lesson as discussion topics and exercises.

Create coreect FAQ and data sites from workshop overview

FAQ and data sites are still pointing to genomics workshop
https://nselem.github.io/metagenomics-workshop/
These sites need to be adapted. Also we need to find some place to credit the authors of genomic lesson

Improve R datatypes

Reviewer's comment:

In the R datatypes episode, there is non rendered link in the solution of Exercise 1.
In the Types of data section, only the integer data type is hyperlinked, don't know if this is on putpose, but maybe all data types could link to a more detailed description.
In this Types of data section there is a mix of English and Spanish ("resultado <- "4 and 3 are not the same in Earth. In Mars maybe... ").

To do : Remove link from exercise solution. Add link to all the datatypes. Put everything in english.

Learning objectives in Intro to R

Reviewer's comment: The first two episode have the exact same LO defined ("What types of data does the R language has?").

To do: Put that learning objective only in the appropriate episode and remove it from the other.

Improve Examining Data on the NCBI SRA Database episode

Reviewer's comments:

the SRA-Run section might need to be slightly restructured, mostly because it currently tries to address both the new and the legacy SRA-Run tool.
2)Examining Data on the NCBI SRA Database can be a little overwhelming. There is too much information on the screen, but I don't know if there is something to do about it other than taking it slow and making sure everyone follows. I didn't really get the reason of using the "old RunSelector" instead of the new one, I imagine the old one will become deprecated at some point...

To do: Improve the sequence of steps and remove confusing mentions to the old run selector.

Improve Redirection

Reviewer's comments:

In the Redirection episode, in the "Exercise 1: Using grep" some additional context on the produced output might be useful (esp. in the 2nd example where it has a long screen).
In the Redirection episode, the "File extensions - part 2" part might be a bit confusing, given that the episode flow up until that point, would not lead to the expected error (unless you run the same command twice). It could be useful to clarify this (or slightly rephrase).
In the Redirection episode, the "Writing for loops" and "Using Basename in for loops" sections do not have an evaluation associated with it (e.g. an expected output file, or a discussion on how it can be further adapted). Unless this section is critical for other parts of the overall workshop, it may be useful to change it as an "optional" content, keeping only the basename command as part of the content.

Improve Data about the experiment in Data Tidiness

Reviewer's comment: In the Data Tidiness episode, the "Data about the experiment" might be a bit confusing, as it currently refers to three different templates (file, Here and the guide). It may be more practical to focus on one, and have the rest as optional choices (although they appear to have a significant overlap).

To do: Improve the explanation about README files so we refer only to one thing. And if we are mentioning the other options give them different names.

Change name of Data Files

Data files have very long unreadable names. Given the size of the shell window it quickly becomes very difficult to follow when multiple arguments and files are to be together in commands. I think it would be more didactic if filenames were shorter and more descriptive. I think a simple cp + mv would work at the begining when also teaching the importance of keeping raw data untouched.

Add explanation about factors in Intro to R

Factors are used in episode 9, so they should be explained in the R section.

Break lessons that took a lot of time to teach

The "Diversity analysis" and the "Taxonomic Assignation" lessons lasted for over an 1.5 hours. I thinkj both of them should be split into two lessons.

carpentries-lab / metagenomics-analysis Goto Github PK

metagenomics-analysis's People

Contributors

Stargazers

Watchers

Forkers

metagenomics-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org