datacarpentry / shell-genomics Goto Github PK

Introduction to the Command Line for Genomics

Home Page: https://datacarpentry.org/shell-genomics

License: Other

carpentries data-carpentry lesson shell programming english genomics stable

shell-genomics's Introduction

Shell Genomics lessons

An introduction to the Unix shell for people working with genomics data. This lesson is part of the Data Carpentry Genomics Workshop. Please see http://www.datacarpentry.org/shell-genomics/ for a rendered version of this material.

Contribution

Make a suggestion or correct an error by raising an Issue.

Code of Conduct

All participants should agree to abide by the Data Carpentry Code of Conduct.

Authors

Shell Genomics is authored and maintained by the community.

Citation

Please cite as:

Erin Alison Becker, Anita Schürch, Tracy Teal, Sheldon John McKay, Jessica Elizabeth Mizzi, François Michonneau, et al. (2019, June). datacarpentry/shell-genomics: Data Carpentry: Introduction to the shell for genomics data, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260560

shell-genomics's People

Contributors

Stargazers

Watchers

Forkers

ctb aorimi jasonjwilliamsny mckays630 ezcn sahilseth genomematt melbournebioinformatics ngs-docs mcdickenson metajinomics tracykteal marypiper junhuili justshingapi chris-pepin mschoonen juefish oxpeter ytakemon kariljordan dmcguckin samnooij uh-ci hoytpr cvisger shannonekj nbkingsley juliososa youngsookyou joiry amyehodge njisrawi halexand swang8 colindaven mfoos b1t0 gvdr jrcunning donalbonny duke-gcb tmogrady schmops78 johnsolk jrkirk61 priya-gittest aays bluegenes jessicalumian mpiercy bebatut sivanandan lclclclclclclc ryanpeek astrobiomike diyadas espressonator alexeyrakov pajanne jmastough rltillett arpankbasak ajank sheffield-bioinformatics-core tstilwel ceumicrodata dwinter sogada arredondo23 taylorreiter cbrueffer nicjar tobyhodges zeyaxue erinwitkop sstevens2 ryneches akshayparopkari ktp-forked-repos beacurious binxiepeterson wyim-pgl btmoyers rortizmerino foreal17 mvdb01 chhendri libcce mfernandes61 kaust-vislab rachellombardi sirselim kbieser borevitzlab gunzivan28 brunoasm jpaganini kccg gpsykes

shell-genomics's Issues

List of Issues

#Incomplete list of significant changes to content that needs to be written and/or revised:

General

See the standard lesson release checklist #47 for things to look for when creating issues.
#1 Deletion of master branch - both files in master branch are present in gh-pages so master branch could be deleted
Write instructor notes
Add Setup instructions either for shell specifically, or for the entire genomics lesson? This is an open question.
Extras -> References, About, Discussion, and Figures need content as noted in #38
Find a single place to list resources for further learning. At present, they appear at the ends of both episode 1 and episode 5. See this additional resource, which also needs a home.
slides.md is mentioned in some issues and PRs. Is it deprecated? It doesn't seem to be part of this lesson any longer. See PR 9 and issue 23.

Episode 1- Introducing the Shell

Typos noted in #39

Episode 2 - The Filesystem

Episode 3 - Working with Files

Episode 4 - Redirection

Episode 5 - Writing Scripts

Clean up links at bottom of page or move all learning resources to a single place. There are currently others listed at the end of episode 1 as well.
Time estimates and more needed as noted in #38
Permissions discussion incomplete as noted in #45
Nano screen-shot shows a different file name on the titlebar (shows awesome.sh, should be README.txt)

Episode 6 - Project Organization

bad-reads vs bad_reads

Some exercises and examples use the filename bad-reads.txt and others use bad_reads.txt. Learners may end up with two different "bad reads" files and be confused about which they are using for a particular exercise. We should either distinguish between these two files (if we want to make a point about using different methods to pull out the bad reads) or make this consistent and make sure we're instructing the learners to delete the bad reads file between each new method.

provide starting working directory for exercise (filesystem episode)

There's an exercise where the challenge is

There is a hidden directory in our file system. Explore the options for ls to find out how to see hidden directories. List the contents of the directory and identify the name of the text file in that directory.

but it occurs immediately after a section about learning how to navigate between directories, so adventurous learners may not be in the right directory when this challenge arises. Adding the starting directory to the question would help learners get the intended takeaway.

Add instructor notes document for this lesson

I'm working on helping direct instructor attention towards fixing up/contributing to instructor notes. Currently don't have a link to provide for instructor notes for this lesson. Please add - even a blank document would be somewhere to point towards.

Add "clear" command.

03-working-with-files has a section for command history that includes the following:

^-C (Control-C) will cancel the command you are writing, and give you a fresh prompt.
^-R will do a reverse-search through your command history. This is very useful.

Should add (Control-R) to the second line for clarity.
Can also add the clear command.

Standard Lesson Release Checklist

Lesson Release checklist

For each lesson release, copy this checklist to an issue and check off
during preparation for release

Scheduled Freeze Date: YYYY-MM-DD
Scheduled Release Date: YYYY-MM-DD

Checklist of tasks to complete before release:

typo

In 03-working-with-files

to go backwarsd

math fail in sequence counting (redirection)

It counts the number of lines or characters. So, we can use it to count the number of lines we’re getting back from our grep command. And that will magically tell us how many sequences we’re finding.

Since we're grepping for +1 before and +2 after, the line count is actually 4x the number of sequences

instructions for windows Git bash users not needed

In 02_the_filesystem.md there is a part that talk about accessing help files for Windows users accessing Terminal through Git bash. This text can be deleted.

or if you’re using Git bash for Windows
$ ls --help

Add CONTRIBUTING.md and CONTRIBUTORS.md

exercise not formatting

The second exercise in the examining files section of 03-working-with-files is not rendering.

redirection example could be clearer

In 04-redirection, we discuss the difference between > and >> for redirecting output to a file. First we have learners grep for a string of 10 Ns in both of the fastq files and output that to a new file (bad_reads.txt). Then we have then do this iteratively, doing the first file and then using >> to append the search results for the second file.

Introducing both > and >> is a good idea, however, we could make the difference between these two commands more clear. For example, we could us wc -l to illustrate that new reads have been added, like so:

grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt
wc -l bad_reads.txt

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
wc -l bad_reads.txt
grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt
wc -l bad_reads.txt

We would need to introduce wc -l here (I think we do this later?). Also, there are no bad reads in the second fastq file, so the output actually isn't different because the second grep doesn't add anything to bad_reads.txt. Should use a different search string to make sure that we have different results. Or add a fake read to SRR097977.fastq that does have a match.

Difficult to keep commands up to date with changes in top level directory name

An issue is that we often change the name of the main directory that files are downloaded to. This then means that the commands that people need to write change, and then either all the commands have to be updated, or they just don't match what we're typing.

One thing that might help is to define the 'directory' in the YAML header then just use {{ page.directory }} in the commands. Although I'm not sure if the YAML will render within code blocks.

grep manual page

Teaching a workshop with @ErinBecker and a learner pointed out a cool way to search manuals:

man cut | grep delimiter

The example above searches the manual for cut for the flag to change the delimiter.

This would make a great challenge exercise!

Should be no setup.

No setup info specific to this lesson as its on the cloud. Also, the Shell lesson starts assuming folks are on a local machine; should be scrubbed.

probable typo - introducing the shell

For example, type cd to go back to your home directly, then enter:

probably intended as home directory

Partial tab complete

In 02-the-filesystem.md, the discussion of tab complete says that typing SR<tab> does nothing. However, it actually does auto-complete to SRR09. The text here needs to be updated.

Nothing happens because there are multiple files which start with SR. The shell does not know which one to fill in. When you hit tab again, the shell will list the possible choices.

Updating Learning Objectives

Hi @mckays630 and @kcranston,

I sent an e-mail on December 19th regarding updating the learning objectives for the command line and data wrangling lessons. The proposed learning objectives are attached. I'd appreciate your feedback on these revised learning objectives, and once you've approved then I'll submit a pull request to update the lesson.

Data Carpentry is in the process of revising the learning objectives for all of our lessons such that they are measurable.

Thank you for your help and time. The files are attached.

Kari
Data-wrangling-and-processing-LOsharedwithmaintainers.docx
Introduction-to-the-command-line-LOsharedwithmaintainers.docx

things people learned

This issue is for things that people have say they learned or liked about this lesson.

Student comments

"I learned how to access and begin to utilize the terminal on my personal laptop"

"I have a better understanding of how to write/save/execute scripts"

Confusing wording "the home directory" (filesystem episode)

Navigate to the home directory if you are not already there.

Intention is to get learner to THEIR home directory, but if you're not used to the concept of your home directory, this could come across as "the directory called home". Proposing change to

Navigate to your home directory if you are not already there.

typos in redirection episode

"grep" misspelled as "greap"
In the second "sort" example there is extra whitespace

error in text of directory navigation

02-the-filesystem.md says:

You can chain these together like so:
$ ls ../../
prints the contents of /home/dcuser which is your home directory.

However, doing ls ../../ at this point in the lesson takes the user all the way to /home/ (not /home/dcuser/).

Provide files for rendering gh-pages

Similarily to https://github.com/datacarpentry/excel-ecology this repository need _includes, _layout, css etc to render correctly in gh-pages.

Formatting issues in Redirection & Writing Scripts episodes

Exercises/challenges not formatted with colored boxes
When append (>>) is introduced, the characters are oddly formatted (redirection only)
Use of inline "as-code" formatting for unix commands (cut, sort, uniq, etc.) & directory names disappears as you go further down the page (redirection) or is spotty (writing scripts)
Code examples not block-quoted (bottom of writing scripts)

introducing bad file naming practices

The 03-working-with-files episode has the learner renaming a file to SRR098026-copy.fastq_DO_NOT_TOUCH! and then changing the name to a better name later. I dislike the idea of introducing a filename like this, even if we're having them change it later, unless we're explicitly making it clear why this type of filename isn't good practice. There are a couple of reasons I can think of, but there may be others:

use of special characters "!"
not putting file extension last in file name

So I think either we should remove this intermediate step (of having learners change the name to this) or add text about why this isn't good practice.

`/` is the root directory

03-working-with-files.md gives an example and an exercise using wildcards to search files in /usr/bin. It's not immediately clear, however, to those not familiar with the filesystem that the leading / in /usr/bin corresponds to the root (top level) directory and that this command will work regardless of the user's current location in the filesystem. Need to add this to the "Navigational shortcuts" section in the previous episode.

"cat" is a confusing example for text searching

In 03-working-with-files, section on examining files -

Note, if you are at the end of the file and search for the word “cat”, less will not find it. You need to go to the beginning of the file and search.

I know that "cat" is being used because it's a three letter string that will occur in a nucleotide file, but it's confusing because cat is also a command that was discussed just recently. Change to another string, like "caa", to avoid confusion.

Reference for teaching the command line to biologists

This could be a useful reference in thinking about teaching or presenting the shell to people new to the command line

Addressing the digital divide in contemporary biology: Lessons from teaching UNIX
http://www.biorxiv.org/content/early/2017/04/07/122424.full.pdf+html

HT @ctb

output of $ ls SR<tab><tab> command

In 01-introduction, after the "$ ls SR" it would be nice to see the output of that command.

Missing "structural" elements

Overall:

Setup (known issue, included for completeness)
Instructor Notes (known issue, included for completeness)
Extras>Glossary is blank
Extras>About only shows LC/SWC/DC generic "about" info
Extras>Figures & Extras>Discussion are blank

http://www.datacarpentry.org/shell-genomics/01-introduction/

Fine

http://www.datacarpentry.org/shell-genomics/02-the-filesystem/

Time estimates missing
Typo/fragment in Questions
No key points

http://www.datacarpentry.org/shell-genomics/03-working-with-files/

Time estimates missing
Questions missing
No key points
answer to "bonus" wildcard question missing

http://www.datacarpentry.org/shell-genomics/04-redirection/

Time estimates missing
Questions missing
No key points
Exercises lack solutions

http://www.datacarpentry.org/shell-genomics/05-writing-scripts/

Time estimates missing
Existing Question needs question mark

http://www.datacarpentry.org/shell-genomics/XX-organization/

Time estimates missing
Questions missing
Objectives missing
No key points

Episode 6 appears somewhat out of order

I think this could be fixed by removing

In this exercise we will setup a filesystem for the project we will be working on during this workshop. We will also introduce you to some more helpful shell commands, programs and tools, including:
mkdir
history
tail
|
nano

and

We will talk much more about the | command in a later lesson. For now, it’s important to know that this is called a pipe and it sends the output of the first command (history) as input to the next command (tail). We have used the -n option to give the last 7 lines of our history.

and perhaps also the "Questions" at the bottom, which don't seem too connected

Delete the master branch?

For consistency Data Carpentry lessons are stored in gh-pages branch which allows to generate HTML pages via GitHub Jekyll. It would be good to apply this solution to this repository.
Suggest:

Check if the files currently in master are in gh-pages
If they are, delete master and leave only gh-pages.
If they are not, move them into gh-pages and delete master.

Add solutions for exercises

Not all of the exercises have solutions.

Use containers for configuring software environment?

For the most part, beginner practitioners of genomics don't really need to worry about whether their terminal is connected to their laptop, a cluster on campus, Jetstream, AWS, or something else. (Of course, they need to know that some computers have more resources than others, so sometimes we connect a remote computer instead of our laptop/desktop for big jobs.) But for beginners I don't think the distinctions between laptop, cloud, HPC, etc, are all that meaningful.

Perhaps one of the challenges is having a unified software installation across all of these platforms? If so, I'd suggest we create a single unified Docker container with all of the prerequisite software installed, and use this to teach the shell independent of the hardware we're using for the back end. AWS machines can run Docker images, Jetstream machines can run Docker images, and anyone with root access to their laptop or desktop can run Docker images. The biggest potential issue is HPC provisioning, since most sysadmins are reluctant to install Docker (requires privilege escalation). But Singularity is a compatible alternative (should run Docker images fine) that does not require privilege escalation.

I'd be happy to spearhead the effort to create the official "Data Carpentry shell genomics" Docker image if there is support for this idea.

Output of ls in dc_sample_data is different from answer in exercise

As of 9/15/17, the output for ls -l is below:

$ ls -l
total 62788
drwxrwxr-x 4 dcuser dcuser 4096 May 21 2016 r_genomics
drwxr-x--- 2 dcuser dcuser 4096 Jul 30 2015 sra_metadata
drwxr-xr-x 2 dcuser dcuser 4096 Jul 30 2015 untrimmed_fastq
-rw-rw-r-- 1 dcuser dcuser 64281061 Jul 31 2015 variant_calling.tar.gz

The lesson currently shows:

drwxrwxr-x 4 dcuser dcuser 4096 May 21 2016 r_genomics
drwxr-x--- 2 dcuser dcuser 4096 Jul 30 2015 sra_metadata
drwxr-xr-x 2 dcuser dcuser 4096 Jul 30 2015 untrimmed_fastq
drwxr-xr-x 3 dcuser dcuser 4096 Jul 31 2015 variant_calling
-rw-rw-r-- 1 dcuser dcuser 64281061 Jul 31 2015 variant_calling.tar.gz

Need to remove the unzipped variant_calling file from exercise solution.

searching the wrong file

03-working-with-files says:

For instance, let’s search the file we have open for the sequence GTGCGGGCAATTAACAGGGGTTCAC. You can see that we go right to that sequence and can see what it looks like.

However, this search string is from SRR097977.fastq and learners have been working with SRR098026.fastq.

commands and output not differentiated

In 04-redirection, near the middle and bottom of the page, there are several chunks that are rendering as extended code blocks that include both input code and output. These should be rendered separately.

Amazon AMI for this lesson?

I was just proofreading this lesson for the issue bonanza, and was unable to test the commands "for real" because I don't have access to the EC2 instance learners would have. Luckily, this is pretty data-agnostic stuff, so no complaints, but it got me wondering if there was an Amazon AMI for this lesson so that solo learners or non-official workshop runners can follow along? I found this blog posthttp://www.datacarpentry.org/blog/amsterdam-genomics/ referencing such an AMI, but did not find it in the Amazon AMI marketplace. Can we consider making it part of lesson materials?

Exercise switching from /usr/bin to /bin

The exercise in 01_the_filesystem.html that starts with

List all of the files in /bin that start with the letter 'c

asks the learner to use the /bin folder, while just before the /usr/bin folder was used. At the Oslo workshop learners failed to notice. This didn't matter for the exercise (they still got the answers right) but maybe sticking to /usr/bin is better?

Permissions discussion incomplete (writing scripts)

We see that it says “-rw-r–r–” which means that the file can mainly be read. That’s the ‘r’.

I don't think this is a complete enough explanation, my argument being that it doesn't meet the standard of giving learners a starting place to learn the finer points themselves. I know this borders on "adding a concept" but maybe we can brainstorm something?

fastq vs .fastq

03-working-with-files.md uses *fastq as the first example of a wildcard. A subsequent example then does *.fastq. I would always use *.fastq in my work, and typing *fastq feels weird. I know that these two will be equivalent, but for cognitive load reasons I think we should pick one and be consistent.

Waiting to introduce navigating two directories at once

In 01-introduction the concept of navigating two directories at once, "cd dc_sample_data/untrimmed_fastq" is introduced. This is actually a difficult concept, so it's likely best to wait to introduce it until later in the lesson.

Missing newline at end of SraRunTable.txt

On the current Amazon ami, the file '/home/dcuser/dc_sample_data/sra_metadata/SraRunTable.txt' is missing a newline at the end. This results in wc -lcounting 37 lines instead of 38

More information on the shell information

At the bottom of 01-introduction, there's a section "More information on the shell". Are those the correct resources to point to? Should we provide more information on what these resources are?

broken link in slides.md

Noticed a broken link to nano1.png and nano2.png under the "Writing files" section of slides.md

add loops episode

When I taught this lesson with @k8hertweck last month, we finished the shell lessons early and ended up improvising an episode on loops. Introducing loops on the first day was very useful for the learners and they used that skill several times on day 2 (I'm not sure whether it is part of the normal day 2 curriculum to do loops or if that was something Kate added).

I propose introducing an episode on loops between 03-working-with-files and 04-redirection.

A very useful for loop at this point in the lesson would be:
for FILENAME in *; do mv $FILENAME $FILENAME-copy; done
to allow learners to rename all of the backup files they created in the previous episode.

This also allows for more challenging exercises including:

a for loop including changing file permissions in batch to change backup files to read-only
a for loop including creating of the backup copies, in addition to renaming them

Typos and additional exercise

I am writing to make contribution to “Genomics Workshop” lessons as part of my Data Carpentry instructor training checkout procedure.

I have proof-read the lesson materials and noticed the following typos in the code or in the text:

http://www.datacarpentry.org/shell-genomics/lessons/01_the_filesystem.html:

"Let's go into the sample data directory: cd dc_sample data” - the directory name should be “dc_sample_data"
“We can use a command line argument with 'ls' to get more information”
"Another useful one is '-a', which show everything"

https://github.com/datacarpentry/cloud-genomics/blob/gh-pages/lessons/1.logging-onto-cloud.md:

"Commercial clouds are can be very powerful”
"We will cover as much as we you need to get through the Data Carpentry lessons”

https://github.com/datacarpentry/cloud-genomics/blob/gh-pages/lessons/4.parallel-analysis.md:

"And you computer does each execution in order”
"But what if you do'nt have enough cores to match the number of programs run?”
"What commands can use you to see your process IDs?”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/00-readQC.md”

"it will help your remember what you did”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/01-automating_a_workflow.md:

"First, recall the code from our our fastqc workflow from this morning”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/02-variant-calling-workflow.md:

"so the name of each fastq file will by assigned to $fq"

At the lesson: https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/02-variant-calling-workflow.md
mkdir command is repeated 5 times to create directories:
$ mkdir -p results/sai
$ mkdir -p results/sam
$ mkdir -p results/bam
$ mkdir -p results/bcf
$ mkdir -p results/vcf
It could be done in one line instead: mkdir -p results/sai results/sam results/bam results/bcf results/vcf

3. An additional exercise for https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/02-variant-calling-workflow.md lesson could be: “Explore the content of sam file and find the first 6 mandatory fields listed in SAM format specifications: https://samtools.github.io/hts-specs/SAMv1.pdf “

Please let me know if you need further clarification regarding my contributions and whether these would be sufficient for the checkout procedure.

Thank you
Asli Uyar

hidden directory for exercise

The exercise in 02-the-filesystem has learners using a directory ".hidden". It's likely not necessary to have it be an actual hidden directory with the . in front of it, as it add cognitive overhead.

This would involve an update of the instance, but could the directory instead just be a regular directory called "hidden"?

introduce file permissions

In 03-working-with-files, we introduce the idea of making a backup copy of our raw data. This would be a good place to introduce file permissions and change permissions for these files to read only so that we can't accidentally overwrite our back up copies.

Site links on 'lessons' page point into github repo

Hi, it is confusing to me that on

http://www.datacarpentry.org/lessons/#genomics-workshop

the "Introduction to the command line - site" link points to https://github.com/datacarpentry/shell-genomics/tree/gh-pages/lessons, rather than to a rendered page. Is this intentional?

If not, I can try to find the source to that lessons page and fix it there; just let me know where the correct URL is (or I can try to find that, too ;)