Coder Social home page Coder Social logo

datacarpentry / shell-genomics Goto Github PK

View Code? Open in Web Editor NEW
60.0 26.0 190.0 54.4 MB

Introduction to the Command Line for Genomics

Home Page: https://datacarpentry.org/shell-genomics

License: Other

carpentries data-carpentry lesson shell programming english genomics stable

shell-genomics's Introduction

DOI

Create a Slack Account with us Slack Status

Shell Genomics lessons

An introduction to the Unix shell for people working with genomics data. This lesson is part of the Data Carpentry Genomics Workshop. Please see http://www.datacarpentry.org/shell-genomics/ for a rendered version of this material.

Contribution

Make a suggestion or correct an error by raising an Issue.

Code of Conduct

All participants should agree to abide by the Data Carpentry Code of Conduct.

Authors

Shell Genomics is authored and maintained by the community.

Citation

Please cite as:

Erin Alison Becker, Anita Schürch, Tracy Teal, Sheldon John McKay, Jessica Elizabeth Mizzi, François Michonneau, et al. (2019, June). datacarpentry/shell-genomics: Data Carpentry: Introduction to the shell for genomics data, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260560

shell-genomics's People

Contributors

ajank avatar akshayparopkari avatar amyehodge avatar aschuerch avatar binxiepeterson avatar cpauvert avatar ctb avatar erinbecker avatar esallychang avatar fmichonneau avatar ggrimes avatar hoytpr avatar jessicalumian avatar jmastough avatar joiry avatar kariljordan avatar kcranston avatar martinosorb avatar mckays630 avatar mfoos avatar nbkingsley avatar p-j-smith avatar shannonekj avatar smcclatchy avatar sstevens2 avatar swang8 avatar taylorreiter avatar tobyhodges avatar tracykteal avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

shell-genomics's Issues

List of Issues

#Incomplete list of significant changes to content that needs to be written and/or revised:

General

Episode 1- Introducing the Shell

Episode 2 - The Filesystem

Episode 3 - Working with Files

Episode 4 - Redirection

Episode 5 - Writing Scripts

Episode 6 - Project Organization

bad-reads vs bad_reads

Some exercises and examples use the filename bad-reads.txt and others use bad_reads.txt. Learners may end up with two different "bad reads" files and be confused about which they are using for a particular exercise. We should either distinguish between these two files (if we want to make a point about using different methods to pull out the bad reads) or make this consistent and make sure we're instructing the learners to delete the bad reads file between each new method.

provide starting working directory for exercise (filesystem episode)

There's an exercise where the challenge is

There is a hidden directory in our file system. Explore the options for ls to find out how to see hidden directories. List the contents of the directory and identify the name of the text file in that directory.

but it occurs immediately after a section about learning how to navigate between directories, so adventurous learners may not be in the right directory when this challenge arises. Adding the starting directory to the question would help learners get the intended takeaway.

Add instructor notes document for this lesson

I'm working on helping direct instructor attention towards fixing up/contributing to instructor notes. Currently don't have a link to provide for instructor notes for this lesson. Please add - even a blank document would be somewhere to point towards.

Add "clear" command.

03-working-with-files has a section for command history that includes the following:

^-C (Control-C) will cancel the command you are writing, and give you a fresh prompt.
^-R will do a reverse-search through your command history. This is very useful.

Should add (Control-R) to the second line for clarity.
Can also add the clear command.

Standard Lesson Release Checklist

Lesson Release checklist

For each lesson release, copy this checklist to an issue and check off
during preparation for release

Scheduled Freeze Date: YYYY-MM-DD
Scheduled Release Date: YYYY-MM-DD

Checklist of tasks to complete before release:

  • check that the learning objectives reflect the content of the lessons
  • check that learning objectives are phrased as statements using action words
  • check for typos
  • check that the live coding examples work as expected
  • if example code generates warnings, explain in narrative and instructor notes
  • check that challenges and their solutions work as expected
  • check that the challenges test skills that have been seen
  • check that the setup instructions are up to date (e.g., update version numbers)
  • check that data is available and mentions of the data in the lessons are accurate
  • check that the instructor guide is up to date with the content of the lessons
  • check that all the links within the lessons work (this should be automated)
  • check that the cheat sheets included in lessons are up to date (e.g., RStudio updates them regularly)
  • check that languge is clear and free of idioms and colloquialisms
  • make sure formatting of the code in the lesson looks good (e.g. line breaks)
  • check for clarity and flow of narrative
  • update README as needed
  • fill out “overview” for each module - minutes needed for teaching and exercises, questions and learning objectives
  • check that contributor guidelines are clear and consistent
  • clean up files (e.g. delete deprecated files, insure filenames are consistent)
  • update the release notes (NEWS)
  • tag release on GitHub

typo

In 03-working-with-files

to go backwarsd

math fail in sequence counting (redirection)

It counts the number of lines or characters. So, we can use it to count the number of lines we’re getting back from our grep command. And that will magically tell us how many sequences we’re finding.

Since we're grepping for +1 before and +2 after, the line count is actually 4x the number of sequences

instructions for windows Git bash users not needed

In 02_the_filesystem.md there is a part that talk about accessing help files for Windows users accessing Terminal through Git bash. This text can be deleted.

or if you’re using Git bash for Windows
$ ls --help

exercise not formatting

The second exercise in the examining files section of 03-working-with-files is not rendering.

redirection example could be clearer

In 04-redirection, we discuss the difference between > and >> for redirecting output to a file. First we have learners grep for a string of 10 Ns in both of the fastq files and output that to a new file (bad_reads.txt). Then we have then do this iteratively, doing the first file and then using >> to append the search results for the second file.

Introducing both > and >> is a good idea, however, we could make the difference between these two commands more clear. For example, we could us wc -l to illustrate that new reads have been added, like so:

grep -B1 -A2 NNNNNNNNNN *.fastq > bad_reads.txt
wc -l bad_reads.txt

grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt
wc -l bad_reads.txt
grep -B1 -A2 NNNNNNNNNN SRR097977.fastq >> bad_reads.txt
wc -l bad_reads.txt

We would need to introduce wc -l here (I think we do this later?). Also, there are no bad reads in the second fastq file, so the output actually isn't different because the second grep doesn't add anything to bad_reads.txt. Should use a different search string to make sure that we have different results. Or add a fake read to SRR097977.fastq that does have a match.

Difficult to keep commands up to date with changes in top level directory name

An issue is that we often change the name of the main directory that files are downloaded to. This then means that the commands that people need to write change, and then either all the commands have to be updated, or they just don't match what we're typing.

One thing that might help is to define the 'directory' in the YAML header then just use {{ page.directory }} in the commands. Although I'm not sure if the YAML will render within code blocks.

grep manual page

Teaching a workshop with @ErinBecker and a learner pointed out a cool way to search manuals:

man cut | grep delimiter

The example above searches the manual for cut for the flag to change the delimiter.

This would make a great challenge exercise!

Should be no setup.

No setup info specific to this lesson as its on the cloud. Also, the Shell lesson starts assuming folks are on a local machine; should be scrubbed.

Partial tab complete

In 02-the-filesystem.md, the discussion of tab complete says that typing SR<tab> does nothing. However, it actually does auto-complete to SRR09. The text here needs to be updated.

Nothing happens because there are multiple files which start with SR. The shell does not know which one to fill in. When you hit tab again, the shell will list the possible choices.

Updating Learning Objectives

Hi @mckays630 and @kcranston,

I sent an e-mail on December 19th regarding updating the learning objectives for the command line and data wrangling lessons. The proposed learning objectives are attached. I'd appreciate your feedback on these revised learning objectives, and once you've approved then I'll submit a pull request to update the lesson.

Data Carpentry is in the process of revising the learning objectives for all of our lessons such that they are measurable.

Thank you for your help and time. The files are attached.

Kari
Data-wrangling-and-processing-LOsharedwithmaintainers.docx
Introduction-to-the-command-line-LOsharedwithmaintainers.docx

things people learned

This issue is for things that people have say they learned or liked about this lesson.

Student comments

"I learned how to access and begin to utilize the terminal on my personal laptop"

"I have a better understanding of how to write/save/execute scripts"

Confusing wording "the home directory" (filesystem episode)

Navigate to the home directory if you are not already there.

Intention is to get learner to THEIR home directory, but if you're not used to the concept of your home directory, this could come across as "the directory called home". Proposing change to

Navigate to your home directory if you are not already there.

error in text of directory navigation

02-the-filesystem.md says:

You can chain these together like so:
$ ls ../../
prints the contents of /home/dcuser which is your home directory.

However, doing ls ../../ at this point in the lesson takes the user all the way to /home/ (not /home/dcuser/).

Formatting issues in Redirection & Writing Scripts episodes

  1. Exercises/challenges not formatted with colored boxes
  2. When append (>>) is introduced, the characters are oddly formatted (redirection only)
  3. Use of inline "as-code" formatting for unix commands (cut, sort, uniq, etc.) & directory names disappears as you go further down the page (redirection) or is spotty (writing scripts)
  4. Code examples not block-quoted (bottom of writing scripts)

introducing bad file naming practices

The 03-working-with-files episode has the learner renaming a file to SRR098026-copy.fastq_DO_NOT_TOUCH! and then changing the name to a better name later. I dislike the idea of introducing a filename like this, even if we're having them change it later, unless we're explicitly making it clear why this type of filename isn't good practice. There are a couple of reasons I can think of, but there may be others:

  • use of special characters "!"
  • not putting file extension last in file name

So I think either we should remove this intermediate step (of having learners change the name to this) or add text about why this isn't good practice.

`/` is the root directory

03-working-with-files.md gives an example and an exercise using wildcards to search files in /usr/bin. It's not immediately clear, however, to those not familiar with the filesystem that the leading / in /usr/bin corresponds to the root (top level) directory and that this command will work regardless of the user's current location in the filesystem. Need to add this to the "Navigational shortcuts" section in the previous episode.

"cat" is a confusing example for text searching

In 03-working-with-files, section on examining files -

Note, if you are at the end of the file and search for the word “cat”, less will not find it. You need to go to the beginning of the file and search.

I know that "cat" is being used because it's a three letter string that will occur in a nucleotide file, but it's confusing because cat is also a command that was discussed just recently. Change to another string, like "caa", to avoid confusion.

Missing "structural" elements

Overall:

  • Setup (known issue, included for completeness)
  • Instructor Notes (known issue, included for completeness)
  • Extras>Glossary is blank
  • Extras>About only shows LC/SWC/DC generic "about" info
  • Extras>Figures & Extras>Discussion are blank

http://www.datacarpentry.org/shell-genomics/01-introduction/

  • Fine

http://www.datacarpentry.org/shell-genomics/02-the-filesystem/

  • Time estimates missing
  • Typo/fragment in Questions
  • No key points

http://www.datacarpentry.org/shell-genomics/03-working-with-files/

  • Time estimates missing
  • Questions missing
  • No key points
  • answer to "bonus" wildcard question missing

http://www.datacarpentry.org/shell-genomics/04-redirection/

  • Time estimates missing
  • Questions missing
  • No key points
  • Exercises lack solutions

http://www.datacarpentry.org/shell-genomics/05-writing-scripts/

  • Time estimates missing
  • Existing Question needs question mark

http://www.datacarpentry.org/shell-genomics/XX-organization/

  • Time estimates missing
  • Questions missing
  • Objectives missing
  • No key points

Episode 6 appears somewhat out of order

I think this could be fixed by removing

In this exercise we will setup a filesystem for the project we will be working on during this workshop. We will also introduce you to some more helpful shell commands, programs and tools, including:
mkdir
history
tail
|
nano

and

We will talk much more about the | command in a later lesson. For now, it’s important to know that this is called a pipe and it sends the output of the first command (history) as input to the next command (tail). We have used the -n option to give the last 7 lines of our history.

and perhaps also the "Questions" at the bottom, which don't seem too connected

Delete the master branch?

For consistency Data Carpentry lessons are stored in gh-pages branch which allows to generate HTML pages via GitHub Jekyll. It would be good to apply this solution to this repository.
Suggest:

  1. Check if the files currently in master are in gh-pages
  2. If they are, delete master and leave only gh-pages.
  3. If they are not, move them into gh-pages and delete master.

Use containers for configuring software environment?

For the most part, beginner practitioners of genomics don't really need to worry about whether their terminal is connected to their laptop, a cluster on campus, Jetstream, AWS, or something else. (Of course, they need to know that some computers have more resources than others, so sometimes we connect a remote computer instead of our laptop/desktop for big jobs.) But for beginners I don't think the distinctions between laptop, cloud, HPC, etc, are all that meaningful.

Perhaps one of the challenges is having a unified software installation across all of these platforms? If so, I'd suggest we create a single unified Docker container with all of the prerequisite software installed, and use this to teach the shell independent of the hardware we're using for the back end. AWS machines can run Docker images, Jetstream machines can run Docker images, and anyone with root access to their laptop or desktop can run Docker images. The biggest potential issue is HPC provisioning, since most sysadmins are reluctant to install Docker (requires privilege escalation). But Singularity is a compatible alternative (should run Docker images fine) that does not require privilege escalation.

I'd be happy to spearhead the effort to create the official "Data Carpentry shell genomics" Docker image if there is support for this idea.

Output of ls in dc_sample_data is different from answer in exercise

As of 9/15/17, the output for ls -l is below:

$ ls -l
total 62788
drwxrwxr-x 4 dcuser dcuser 4096 May 21 2016 r_genomics
drwxr-x--- 2 dcuser dcuser 4096 Jul 30 2015 sra_metadata
drwxr-xr-x 2 dcuser dcuser 4096 Jul 30 2015 untrimmed_fastq
-rw-rw-r-- 1 dcuser dcuser 64281061 Jul 31 2015 variant_calling.tar.gz

The lesson currently shows:

drwxrwxr-x 4 dcuser dcuser 4096 May 21 2016 r_genomics
drwxr-x--- 2 dcuser dcuser 4096 Jul 30 2015 sra_metadata
drwxr-xr-x 2 dcuser dcuser 4096 Jul 30 2015 untrimmed_fastq
drwxr-xr-x 3 dcuser dcuser 4096 Jul 31 2015 variant_calling
-rw-rw-r-- 1 dcuser dcuser 64281061 Jul 31 2015 variant_calling.tar.gz

Need to remove the unzipped variant_calling file from exercise solution.

searching the wrong file

03-working-with-files says:

For instance, let’s search the file we have open for the sequence GTGCGGGCAATTAACAGGGGTTCAC. You can see that we go right to that sequence and can see what it looks like.

However, this search string is from SRR097977.fastq and learners have been working with SRR098026.fastq.

commands and output not differentiated

In 04-redirection, near the middle and bottom of the page, there are several chunks that are rendering as extended code blocks that include both input code and output. These should be rendered separately.

Amazon AMI for this lesson?

I was just proofreading this lesson for the issue bonanza, and was unable to test the commands "for real" because I don't have access to the EC2 instance learners would have. Luckily, this is pretty data-agnostic stuff, so no complaints, but it got me wondering if there was an Amazon AMI for this lesson so that solo learners or non-official workshop runners can follow along? I found this blog posthttp://www.datacarpentry.org/blog/amsterdam-genomics/ referencing such an AMI, but did not find it in the Amazon AMI marketplace. Can we consider making it part of lesson materials?

Exercise switching from /usr/bin to /bin

The exercise in 01_the_filesystem.html that starts with

List all of the files in /bin that start with the letter 'c

asks the learner to use the /bin folder, while just before the /usr/bin folder was used. At the Oslo workshop learners failed to notice. This didn't matter for the exercise (they still got the answers right) but maybe sticking to /usr/bin is better?

Permissions discussion incomplete (writing scripts)

We see that it says “-rw-r–r–” which means that the file can mainly be read. That’s the ‘r’.

I don't think this is a complete enough explanation, my argument being that it doesn't meet the standard of giving learners a starting place to learn the finer points themselves. I know this borders on "adding a concept" but maybe we can brainstorm something?

*fastq vs *.fastq

03-working-with-files.md uses *fastq as the first example of a wildcard. A subsequent example then does *.fastq. I would always use *.fastq in my work, and typing *fastq feels weird. I know that these two will be equivalent, but for cognitive load reasons I think we should pick one and be consistent.

Waiting to introduce navigating two directories at once

In 01-introduction the concept of navigating two directories at once, "cd dc_sample_data/untrimmed_fastq" is introduced. This is actually a difficult concept, so it's likely best to wait to introduce it until later in the lesson.

Missing newline at end of SraRunTable.txt

On the current Amazon ami, the file '/home/dcuser/dc_sample_data/sra_metadata/SraRunTable.txt' is missing a newline at the end. This results in wc -lcounting 37 lines instead of 38

More information on the shell information

At the bottom of 01-introduction, there's a section "More information on the shell". Are those the correct resources to point to? Should we provide more information on what these resources are?

broken link in slides.md

Noticed a broken link to nano1.png and nano2.png under the "Writing files" section of slides.md

add loops episode

When I taught this lesson with @k8hertweck last month, we finished the shell lessons early and ended up improvising an episode on loops. Introducing loops on the first day was very useful for the learners and they used that skill several times on day 2 (I'm not sure whether it is part of the normal day 2 curriculum to do loops or if that was something Kate added).

I propose introducing an episode on loops between 03-working-with-files and 04-redirection.

A very useful for loop at this point in the lesson would be:
for FILENAME in *; do mv $FILENAME $FILENAME-copy; done
to allow learners to rename all of the backup files they created in the previous episode.

This also allows for more challenging exercises including:

  • a for loop including changing file permissions in batch to change backup files to read-only
  • a for loop including creating of the backup copies, in addition to renaming them

Typos and additional exercise

I am writing to make contribution to “Genomics Workshop” lessons as part of my Data Carpentry instructor training checkout procedure.

  1. I have proof-read the lesson materials and noticed the following typos in the code or in the text:

http://www.datacarpentry.org/shell-genomics/lessons/01_the_filesystem.html:

  • "Let's go into the sample data directory: cd dc_sample data” - the directory name should be “dc_sample_data"

  • “We can use a command line argument with 'ls' to get more information”

  • "Another useful one is '-a', which show everything"

https://github.com/datacarpentry/cloud-genomics/blob/gh-pages/lessons/1.logging-onto-cloud.md:

  • "Commercial clouds are can be very powerful”
  • "We will cover as much as we you need to get through the Data Carpentry lessons”

https://github.com/datacarpentry/cloud-genomics/blob/gh-pages/lessons/4.parallel-analysis.md:

  • "And you computer does each execution in order”
  • "But what if you do'nt have enough cores to match the number of programs run?”
  • "What commands can use you to see your process IDs?”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/00-readQC.md”

  • "it will help your remember what you did”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/01-automating_a_workflow.md:

  • "First, recall the code from our our fastqc workflow from this morning”

https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/02-variant-calling-workflow.md:

  • "so the name of each fastq file will by assigned to $fq"
  1. At the lesson: https://github.com/datacarpentry/wrangling-genomics/blob/gh-pages/lessons/02-variant-calling-workflow.md
    mkdir command is repeated 5 times to create directories:
    $ mkdir -p results/sai
    $ mkdir -p results/sam
    $ mkdir -p results/bam
    $ mkdir -p results/bcf
    $ mkdir -p results/vcf
    It could be done in one line instead: mkdir -p results/sai results/sam results/bam results/bcf results/vcf

Please let me know if you need further clarification regarding my contributions and whether these would be sufficient for the checkout procedure.

Thank you
Asli Uyar

hidden directory for exercise

The exercise in 02-the-filesystem has learners using a directory ".hidden". It's likely not necessary to have it be an actual hidden directory with the . in front of it, as it add cognitive overhead.

This would involve an update of the instance, but could the directory instead just be a regular directory called "hidden"?

introduce file permissions

In 03-working-with-files, we introduce the idea of making a backup copy of our raw data. This would be a good place to introduce file permissions and change permissions for these files to read only so that we can't accidentally overwrite our back up copies.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.