Git and GitHub

What is Git?

Git A version control system that helps a software team manage changes to source code over time. Version control software keeps track of every modification to the code in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members.

What is GitHub?

GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. Think of it as a dropbox for code.

Why Should I Use These?

Git and GitHub allow you to have all of your code in one place, whether locally, remotely, or both, as well as providing an easy way to sync your code. It also keeps track of changes to your code automatically

You can also easily share with others through GitHub (or other remote repository).

Providing source code for analysis scripts and software is required for publishing where these custom codebases/workflows were used.

When used with other tools (Docker, Jupyter) you can have a fully functioning environment/instance of your code up in minutes.

Which Should I Use?

Ideally you would use both. These tools compliment each other though they can be used independently though it will not be as seamless of an experience.

Image result for git vs github

Follow Along!

Check out the slides below to follow along and get familiar with both tools:

For a more in depth look into using these tools, check out this book from the people at Git.

Containerization with Docker

BADAS Slides

Installation

To view detailed installation instructions you may visit the Docker website. For this tutorial we have taken out much of the details and provided you with the commands you will most likely need.

MacOS

All you need is to download the dmg here. Alternatively you will need to visit the Docker Hub website, create an account, and download it there.

Windows

Download Docker for Windows Installer.exe and follow the installation instructions.

Linux

For this tutorial we will assume that you are running Ubuntu. If you are running a different flavor of Linux please refer to the installation page on Docker Hub.

Set up the repository

sudo apt-get update

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo apt-key fingerprint 0EBFCD88

Output from the apt-key command should be the same as the output below

pub   rsa4096 2017-02-22 [SCEA]
      9DC8 5822 9FC7 DD38 854A  E2D8 8D81 803C 0EBF CD88
uid           [ unknown] Docker Release (CE deb) <docker@docker.com>
sub   rsa4096 2017-02-22 [S]

If all is good then set up the stable repository

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

Install Docker CE

Now that you have linked the Docker repository you must update your list of repositories so that you can then install via apt-get.

sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io

Validate the Installation

In order to test that your installation was successful we must run the hello-world command using docker on the command line. To do so, open your terminal and run the following command.

Note: Linux installations will most likely need to run all docker commands as sudo

docker run hello-world

You should see the following output describing what just happened:

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/engine/userguide/

Congratulations! You’ve now got docker installed and working. Let’s move on to some cool stuff!

Pre-Built Docker Images

When setting up a particular environment with Docker you have two options on how to proceed:

Configure it yourself (pretty tedious and not too fun)
Grab a prebuilt image and modify if needed (preferred).

Docker Hub is a great resource for finding pre-built docker images that will save you from doing most of the leg work. Not only are there hundreds of thousands of publicly available options to choose from, you can also share your own images!

As seen in the hello-world example output, docker run contacts Docker Hub, looks for a prebuilt image called “hello-world” and (if it exists) it will essentially download that image to your computer and will execute the startup commands.

Now that we know the order of operations, let’s start by checking to see if there are any Jupyter images currently available. You can find a list of stacks available on both Docker Hub as well as Jupyter’s Docker page.

Let’s start out by grabbing the basic Jupyter Notebook image:

docker run --rm --name jupyter -p 8888:8888 jupyter/minimal-notebook

The -p option binds the host port to the container’s port which is necessary in this instance. --rm automatically removes the container when it exits. This is good practice to prevent future headaches.

You can check to see that it is running by using docker ps:

$ docker ps                                                                                                                                                                                           ⏎
CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS                    NAMES
04af1ce8822c        jupyter/minimal-notebook   "tini -g -- start-no…"   23 minutes ago      Up 23 minutes       0.0.0.0:8888->8888/tcp   jupyter

The CONTAINER ID is how you can access this particular image instance. With this you can kill the image docker kill 04af1ce8822c, but leave it for now.

Docker Image Workflow

Important to understand what just happened to make the best of what Docker has to offer. To summarize, Docker Hub hosts a plethora of prebuilt, immutable images which we can pull onto our local machine and run. A container is an instance of an image which can then be modified, rebuilt, and executed. In the next section we will be dealing primarily with containers.

Accessing Your Container

Containers can be accessed in a couple of ways, depending on how they’re configured as well as what purpose they serve. Since this particular container is hosting a Jupyter Notebook we will most likely want to access via a GUI.

Access via Web Interface

The output from the docker run command provided you with a URL which you can use to access via a web browser. It is running locally and thus can be accessed by visiting 127.0.0.1:8888. Note that you will need to copy and paste the alphanumeric token that is output in your terminal after running the docker run command as this is the password for the Jupyter notebook. Play around with the notebook to make sure it works as expected.

Upon successful login you should see a familiar interface. If you are not familiar with Jupyter Notebooks then please see our tutorial.

Access via Command Line

There will be instances where your container has no graphical interface or you need to make changes while it is running. To do so Docker provides an exec subcommand to make this possible. Not only can you access your container but you can send commands to the container.

For now we will access this via the command line. To do so you will need the container ID.

* NOTE your container ID will be different for every container so copying and pasting this will not work!

docker exec -it jupyter bash

-i tells docker that you would like an interactive session, so you will actually enter the container.
-t is defining the name of this particular container (aka a tag) by which we access it.
bash is the command to run upon entering the container, which is essentially a terminal environment.

Take some time to look around and test it out. Do all the commands work? Is there anything missing?

Try ping google.com. So I can just install ping right?

sudo apt-get install ping doesn’t work though? Needs a password?

This is a common issue when dealing with images; often they don’t have everything we need so we need to install them ourselves but there are passwords that are unknown. One way to overcome this is to enter the container as sudo so that you can modify the container as needed.

Type exit or ctrl+d to exit then run:

docker exec --user root -it jupyter bash

Defining the user as root instead of jovyan will enter the container as su so you can proceed as normal.

apt-get update
apt-get install inetutils-ping
ping google.com

Modifying the container in this manner is ideal for testing purposes. However, what if you want to always have certain packages or configurations every time you run the container? What if you want to publish the image exactly as it is in the current instance? Thankfully there are configuration files for this!

Dockerfiles

EVERY docker image must contain a Dockerfile in order to build an image. You didn’t see one when we pulled from because that image was already built – a requirement for pushing to Docker Hub. Dockerfiles, in this instance, are useful for defining a base image and then modifying it as you deem fit.

The documentation is quite thorough and can be referenced here. In this tutorial we will simply configure the dockerfile to include ping upon starting the image.

Let’s start by creating a brand new directory on your local machine: mkdir jupyter. Enter this directory and create a Dockerfile: touch Dockerfile. NOTE: Every Dockerfile must be named as Dockerfile (case sensitive), no exceptions.

Now, consider the Dockerfile as a layered configuration that describes what exactly you want in your environment, while layering meaning that every subsequent instruction is applied to the previous instructions (called stacking). There is a specific format to adhere to and you can reference the best practices for further details and use cases.

Consider the following Dockerfile:

# Describes the base image from which to stack further instructions
FROM ubuntu:15.04

# Copies files from your current local '.' directory to the image path '/app'
COPY . /app

# Provide an instruction to the build instructions. 
# Build instructions are executed before the container instance is executed.
# This example is building the app that will be the main service of this container
RUN make /app

# Specify the command to run upon executing the container post-build
# i.e. the exec command used in the previous exercise.
CMD python /app/app.py

Every instruction follows the same format: INSTRUCTION arguments. So COPY is the instruction and . and /app are the arguments.

Building an Image

Once your Dockerfile is ready to go, you’re ready to build your modified image! To do so, docker provides a build command that looks for a Dockerfile in the specified directory. It is also best a good practice to tag each build if you plan on keeping these images around and/or distributing/maintaining (think of this as versioning software).

docker build -t test .

This may take some time to build as it has to download the image from the repository and then run the set of build instructions. Upon successful build you can then run the image using the docker run command.

Adding Data to an Image

Being the expert data analysts y’all are, it’s common that we’ll have lots of data we need to work with. Adding it to the image sounds easy, right? It’s the COPY command you learned. Well it depends; it is named COPY because it does just that, so for smaller files this is fine (think scripts, small csv files, etc.). However, what happens if you’re analyzing sequence data and the files are > 20GB? It takes forever to copy outside of docker!

Behold volumes. Volumes allow a particular directory or directories to be accessed via the docker container without copying the entire dataset into the container itself. The only real drawback is that the content within the mounted directories is not packages with the image, though the data are persisted, meaning that if the instance is killed the data will be accessible upon a restart.

Let’s give it a shot!

Notice how there is only the work directory in the login page of your Jupyter notebook. Let’s add a test file and see if we can get it to show up in the notebook.

Kill the container then create a file named test in your current directory. Using the same docker run command, add the volume command:

docker run -v "$PWD":/home/jovyan/work --rm --name jupyter -p 8888:8888 jupyter/minimal-notebook

-v is the argument for adding a volume to the container, with the syntax being host_directory:container_directory. This command adds everything from the local current working directory to the container’s directory. Now visit the work directory in your notebook:

Test Your Skills!

Now that you’ve went through the basics of how to get up and running with Docker see if you can set up a custom container yourself:

Add the ping command to the Jupyter image you downloaded at the beginning of this tutorial
Build the image
See if ping works

Jupyter Notebook

Jupyter notebooks are a very useful tool for sharing data and analyses. They are essentially an interactive coding environment that is feature rich and has very useful and practical integrations.

Some examples of notebooks that have been used in place of materials and methods sections can be found here. These are good reference materials for thinking of how to structure your notebook based on your analyses.

The github repository where our notebooks are housed can be found here. The easiest way to obtain these notebooks that are hosted on GitHub is to clone them. You can do so with the following command:
git clone https://github.com/gencorefacility/BADAS.git

The instructions for running Jupyter Notebooks on NYU’s resources will be listed here when available, so stay tuned!

Resources for this BADAS seminar are below:

Regular Expressions

A regular expression (aka regex) is a sequence of characters that describe or match a pattern of text. For example, the string aabb12 could be described as aabb12, two a’s, two’bs, then 1, then 2, or four letters followed by two numbers.

Keep in mind that regular expressions differ slightly per language but the theort and application remain the same. We will start with Python.

Metacharacters

Metacharacters are characters that have an alternate meaning rather than a literal meaning. There are many of these and they can be found on the Python regular expression documentation.

For example, we’ll take the following characters to search a text file:
– ‘A’
– ‘.’
– ‘$’

import re

text = 'Alan is the coolest man. Ever!$$$'

# compiling a regular expression allows you to reuse the regular expression.
A = re.compile('A')
dot = re.compile('.')
dollar = re.compile('$')

# Any pattern that is matches will be presented as an item within a list.
A.findall(text)

['A']

It matches ‘A’! No surprise here. Now let’s see what the ‘.’ does:

dot.findall(text)

['A',
 'l',
 'a',
 'n',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'c',
 'o',
 'o',
 'l',
 'e',
 's',
 't',
 ' ',
 'm',
 'a',
 'n',
 '.',
 ' ',
 'E',
 'v',
 'e',
 'r',
 '!',
 '$',
 '$',
 '$']

It matched everything?! That’s because the ‘.’ is a special character that we will get into later.

Now let’s see what the dollar sign finds us:

dollar.findall(text)

['']

Nope, not a bug: ‘$’ did match something but it is also a special character.

Position Metacharacters

This type of regular expression is used to match characters based on where they are located as opposed to what the character means.

Let’s take the $ for example. I told you that it matched something, but nothing was returned in the output. That is because $ is a special character that matches the end of a line. Let’s try an example:

text = 'Alan is the coolest man ever, man'

# In regular expressions, backslashes tell the engine to interpret metacharacters as literals.
pattern = re.compile('man')

pattern.findall(text)

['man', 'man']

This is no surprise, it found both instances of ‘man’. Now, let’s say we only want to capture ‘man’ at the end of a string:

pattern = re.compile('man$')

pattern.findall(text)

['man']

Trust me, this is grabbing the last man. Don’t believe me?

text = 'Alan is the coolest man ever'

pattern.findall(text)

[]

Told ya. Now, what happens if we actually do want to match a dollar sign? Well, we can use an escape character, \ (backslash), before the dollar sign to tell the regex interpreter to use the literal meaning. Let’s use the first example:

text = 'Alan is the coolest man. Ever!$$$'

pattern = re.compile('$')
pattern.findall(text)

['']

Nothing. Let’s try to escape it:

pattern = re.compile('\$')
pattern.findall(text)

['$', '$', '$']

Now, what do you think this regular expression do?

pattern = re.compile('\$$')
pattern.findall(text)

['$']

Similarly, ‘^’ (caret, but not the kind that rabbits eat) matches the beginning of a line:

# IGNORECASE tells the intepreter match the meaning of the character, not character and case
pattern = re.compile('a', re.IGNORECASE)

pattern.findall(text)

['A', 'a', 'a']

And now with the caret:

pattern = re.compile('^a', re.IGNORECASE)

pattern.findall(text)

['A']

There are also boundary metacharacters. For example, ‘\b’ matches a word that ends with ‘ing’. Inversely, ‘\B’ matches a non-boundary word, so it would match the ‘ing’ in ‘things’ but not ‘thing’. This is useful for specifying substrings or whole words exclusively.

NOTE: some characters are treated as literals even when they are backslached. See here for more deltails:
https://stackoverflow.com/questions/2241600/python-regex-r-prefix
https://stackoverflow.com/questions/21104476/what-does-the-r-in-pythons-re-compiler-pattern-flags-mean

text = 'pistol'

# 'is' is surrounded by non-blank characters, not boundaries
pattern = re.compile(r'\bis\b')
print('word boundary',pattern.findall(text))

# However, it is surrounded by non-boundary characters, so it is found by non-word boundary searches
pattern = re.compile('\Bis\B')
print('non-word boundary',pattern.findall(text))

word boundary []
non-word boundary ['is']

text = 'is'

pattern = re.compile(r'\bis\b')
print('word boundary', pattern.findall(text))

pattern = re.compile('\Bis\B')
print('non-word boundary', pattern.findall(text))

word boundary ['is']
non-word boundary []

Single Metacharacters

These metacharacters match specific types of charactes. For example, you can match all alphanumeric characters with ‘\w’ or any whitespace character with ‘\s’

text = '123 abc !@#'

# any number
pattern = re.compile('\d')
pattern.findall(text)

['1', '2', '3']

# any alphanumeric character
pattern = re.compile('\w')
pattern.findall(text)

['1', '2', '3', 'a', 'b', 'c']

# any non-word character
pattern = re.compile('\W')
pattern.findall(text)

[' ', ' ', '!', '@', '#']

# any non-newline character
pattern = re.compile('.')
pattern.findall(text)

['1', '2', '3', ' ', 'a', 'b', 'c', ' ', '!', '@', '#']

Quantifiers

All examples before have been searching for individual characters. Quanitfiers allow you to match repeated patterns.

text = 'aa bb cdef 123'

# this looks for every instance of a word character
pattern = re.compile('\w')
pattern.findall(text)

['a', 'a', 'b', 'b', 'c', 'd', 'e', 'f', '1', '2', '3']

# the '+' tells the interpreter to look for one or more consecutive characters
pattern = re.compile('\w+')
pattern.findall(text)

['aa', 'bb', 'cdef', '123']

# '?' is looking for characters that appear once or not at all. It basically makes the preceding character optional.
text = 'colour'

pattern = re.compile('colou?r')
print('colour',pattern.findall(text))

text = 'color'
print('color', pattern.findall(text))

colour ['colour']
color ['color']

# Asterisk is doesn't require the preceding character to be there, but if it is it will match it
# will repeat the pattern as many times as is can.
text = 'b'
pattern = re.compile('bo*')
print('b',pattern.findall(text))

text = 'boo'
print('boo',pattern.findall(text))

text = 'boooo!'
print('boooo!', pattern.findall(text))

b ['b']
boo ['boo']
boooo! ['boooo']

# curly brackets define a number of times in which the preceding pattern will be repeated
pattern = re.compile('bo{2}')
text = 'boo'
print('boo',pattern.findall(text))

text = 'boooo!'
print('boooo!', pattern.findall(text))

boo ['boo']
boooo! ['boo']

Character Classes

A character class allows you to match a particular set of user defined characters as oppsed to the predefined metacharacters we went over previously. Think of searching for any vowel.

text = 'I like chocolate'

pattern = re.compile('[AEIOUY]', re.IGNORECASE)
print('vowels', pattern.findall(text))

vowels ['I', 'i', 'e', 'o', 'o', 'a', 'e']

Alternatively, when you use a caret ^ within the square brackets it will match the inverse of those defined in the character class. In this case, it will match all consonants.

pattern = re.compile('[^AEIOUY]', re.IGNORECASE)
print('consonants', pattern.findall(text))

consonants [' ', 'l', 'k', ' ', 'c', 'h', 'c', 'l', 't']

You can also specify ranges of characters if they are naturally consecutive by using a hyphen - to separate the beginning and the end:

pattern = re.compile('[a-d]', re.IGNORECASE)
print('pattern', pattern.findall(text))

pattern ['c', 'c', 'a']

Alterations

This is essentially just an ‘or’ statement. An example would be if you are looking for whether a sentence says ‘we have ten dollars.’ or ‘I have ten dollars.’ Since the only varying piece of that sentence are the pronouns, you can try to match either.

text = 'I have ten dollars'

pattern = re.compile('we|i|they', re.IGNORECASE)
print('I', pattern.findall(text))

text = 'They have ten dollars'
print('They', pattern.findall(text))

text = 'We have ten dollars'
print('We', pattern.findall(text))

I ['I']
They ['They']
We ['We']

Backreferences

Back references/captures allow you to reuse regular expressions and/or patterns that match that regular expression.

Let’s give a biological example:
You’re looking for a motif that has flanking restriction sites ACTG. The motif can be of any length and any composition of nucleotides but is always flanked by those cuts sites:

text = 'ACTGTTTTTTTTTACTG'

# the '\1' is the refence to the first captured pattern.
# All patterns are ennumerated but can also be named.
print('matches:',re.search(r'(ACTG)([GTAC]+)\1', text).groups())

text = 'ATCGCAGCTACGACTGAAAAAAAAAAAAAAACTG'
print('matches:',re.search(r'(ACTG)([ACTG]+)\1', text).groups())

matches: ('ACTG', 'TTTTTTTTT')
matches: ('ACTG', 'AAAAAAAAAAAAAA')

Substitution Mode

There are several ways that one can use regular expressions, though the two main modes are searching and substituting.

Searching/matching will only look for whether or not a pattern is matched. Substituting will replace any pattern that is matched with another pattern. Above we have used only searching, so below will only showcase an example of a substitution.

text = 'ACTGTTTTTTTTTACTG'
re.sub(r'ACTG', 'AAAA', text)

'AAAATTTTTTTTTAAAA'

Command Line Examples

Constructing a Regular Expression

Arguably the most difficult task with regular expressions is being able to construct one. Often in cases I am trying extract information that match multiple different patterns or the same pattern but in different lines of text. In order to create a regular expression, you first need to identify the pattern. This generally requires some brief understanding of the file format and the relevant information as well as finding a pattern that applies to information that you’re interested in.

To give you some insight on how to do this, the examples below are scenarios that I often find myself in that are a perfect fit for regular expressions.

Parsing a GFF file

There are a few tools, such as sed, awk, and grep, that are used for text munging but I happen to use Perl, as it provides a lot of flexibility when doing complex regular expressions and other data munging tasks.

grep can only match patterns (or the inverse of patterns) and cannot be used to replace or transform text. Though limited, it is a nice tool to have for quick searches in files or across filesystems. In this example, all I need is to match a pattern, so grep will suffice.

A GFF file is a standard tab-delimited file format for genomic annotations. Most gff files contain every type of annotated feature for that organism (mRNA, exons, UTRs, motifs, etc.) which are not always relevant to the analysis and may sometimes may actuall interfere with the analysis.

Let’s say I want to extract all annotated mRNAs from this file. First thing is to understand the gff file format. Reading the documentation will tell you that it is tab delimited and the third column contains the feature type. So looking at the file will sho you:

! head -n 20 caenorhabditis_elegans.PRJNA13758.WBPS9.annotations.gff3

##gff-version 3
##sequence-region I 1 15072434
##sequence-region II 1 15279421
##sequence-region III 1 13783801
##sequence-region IV 1 17493829
##sequence-region V 1 20924180
##sequence-region X 1 17718942
##sequence-region MtDNA 1 13794
I   BLAT_EST_OTHER  expressed_sequence_match    1   50  12.8    -   .   ID=yk585b5.5.6;Target=yk585b5.5 119 168 +
I   BLAT_Trinity_OTHER  expressed_sequence_match    1   52  20.4    +   .   ID=elegans_PE_SS_GG6116|c0_g1_i1.2;Target=elegans_PE_SS_GG6116|c0_g1_i1 174 225 +
I   inverted    inverted_repeat 1   212 66  .   .   Note=loop 426
I   Genbank assembly_component  1   2679    .   +   .   genbank=FO080985
I   Genomic_canonical   assembly_component  1   2679    .   +   .   Name=cTel33B;Note=Clone:cTel33B,GenBank:FO080985
I   Variation_project_Polymorphism  tandem_duplication  1   11000   .   +   .   variation=WBVar02123961;public_name=WBVar02123961;other_name=cewivar00854884;strain=JU533;polymorphism=1;consequence=Coding_exon
I   interpolated_pmap_position  gene    1   559784  .   .   .   ID=gmap:spe-13;gmap=spe-13;status=uncloned;Note=-21.3602 cM (+/- 1.84 cM)
I   Balanced_by_balancer    biological_region   1   5263413 .   .   .   balancer=Rearrangement:hT3;balancer_type=Translocation;Note=Summary: Translocation (rigorous proof of reciprocity lacking)%252C moderately well characterized%252C very stable. Very effective balancer for left portion of LG I from left end to around let-363%252C and the right portion of LG X from the right end to between dpy-7 and unc-3. hT3(I)%252C which disjoins from normal LG I%252C is probably LG X (right) translocated to LG I (right). hT3(X)%252Cwhich disjoins from normal LG X%252C is probably LG I (left) translocated to LG X (left).,Growth characteristics: Original isolate marked with dpy-5 and unc-29. Homozygous inviable%252C probably breaks in let-363 (I). Heterozygotes exhibit reduced viability%252C low level of X chromosome nondisjunction (1.2%25).,Handling: Easy to manipulate. Rare exceptional progeny carry one half-translocation as a free duplication. Recombination frequency in unbalanced intervals increased on both LG I and LG X.,Recommended use: General balancing%252C strain maintenance
I   Balanced_by_balancer    biological_region   1   7383197 .   .   .   balancer=Rearrangement:sDp2;balancer_type=Duplication;Note=Summary: Free duplication%252C well characterized%252C does not recombine with normal homologues. Very effective balancer for the left portion of LG I from the left end through unc-15 (just left of unc-13).,Handling: sDp2-bearing males mate and give some progeny%252C but are slow growing and do not compete well with non-Dp males in mating.,Recommended use: General balancing%252C strain maintenance%252C mutant screens.
I   Balanced_by_balancer    biological_region   1   7454088 .   .   .   balancer=Rearrangement:szDp1;balancer_type=Duplication;Note=Summary: Complex free duplication%252C well characterized%252C does not recombine with normal LG I. Very effective balancer for the left portion of LG I from the left end through unc-13. Consists of one half-translocation [szT1(X)] from szT1 maintained in addition to a normal chromosome complement.,Growth characteristics: Animals carrying two copies of szDp1 apparently inviable. Duplication strains give rise to spontaneous males through meiotic nondisjunction of the X chromosome.,Handling: szDp1-bearing males either do not mate or are infertile.,Recommended use:
I   Balanced_by_balancer    biological_region   1   7454088 .   .   .   balancer=Rearrangement:szT1;balancer_type=Translocation;Note=Summary: Reciprocal translocation%252C well characterized%252C very stable. Effective balancer for left portion of LG I through unc-13%252C nearly all of LG X from right end to around dpy-3. szT1(I) is large segment of LG X (right) translocated to LG I%252C disjoins from normal LG I. szT1(X) is LG I (left) translocated to fragment of LG X (left)%252C disjoins from normal LG X.,Handling: Easy to manipulate. Lon-2 szT1 males mate well. Rare exceptional progeny carry one half-translocation as a complex free duplication. Gives rise spontaneously to rare apparent lethal mutations that may represent fusion of szT1(X) and the normal X. Shows threefold enhanced recombination frequency immediately adjacent to right of LG I breakpoint and about twofold enhanced frequency in the unc-101 - unc-54 interval.,Recommended use: General balancing%252C strain construction%252C strain maintenance.
I   Balanced_by_balancer    biological_region   1   8244513 .   .   .   balancer=Rearrangement:hT1;balancer_type=Translocation;Note=Summary: Reciprocal translocation%252C well characterized%252C very stable. Very effective balancer for left portion of LG I from the left end through let-80%252C and the left portion of LG V from the left end through dpy-11. hT1(I) is LG V (left) translocated to LG I (right)%252C disjoins from normal LG I. hT1(V) is LG I (left) translocated to LG V (right)%252C disjoins from normal LG V.,Growth characteristics: Homozygous inviable%252C cause unknown. Arrests at L3. Brood size in heterozygotes ~75. Easy to manipulate. Heterozygous males mate well. Rare exceptional progeny carry one half-translocation as a complex free duplication. Recombination frequency in the unbalanced unc-101 - unc-54 interval on LG I is increased twofold.,Handling: Easy to manipulate. Heterozygous males mate well. Rare exceptional progeny carry one half-translocation as a complex free duplication. Recombination frequency in the unbalanced unc-101 - unc-54 interval on LG I is increased twofold.,Recommended use: General balancing%252C strain maintenance. mutant screens.

Looking at the entire file is daunting and doesn’t provide a list of features found within this file. So, we can use a few commands to get a full representation of each feature type described in the file.

! cut -f3 caenorhabditis_elegans.PRJNA13758.WBPS9.annotations.gff3 | sort | uniq

antisense_RNA
assembly_component
base_call_error_correction
binding_site
biological_region
CDS
complex_substitution
conserved_region
deletion
DNAseI_hypersensitive_site
duplication
enhancer
exon
experimental_result_region
expressed_sequence_match
five_prime_UTR
gene
##gff-version 3
G_quartet
histone_binding_site
insertion_site
intron
inverted_repeat
lincRNA
low_complexity_region
miRNA
miRNA_primary_transcript
mRNA
mRNA_region
nc_primary_transcript
ncRNA
nucleotide_match
operon
PCR_product
piRNA
point_mutation
polyA_signal_sequence
polyA_site
polypeptide_motif
possible_base_call_error
pre_miRNA
promoter
protein_coding_primary_transcript
protein_match
pseudogenic_rRNA
pseudogenic_transcript
pseudogenic_tRNA
reagent
regulatory_region
repeat_region
RNAi_reagent
rRNA
SAGE_tag
scRNA
##sequence-region I 1 15072434
##sequence-region II 1 15279421
##sequence-region III 1 13783801
##sequence-region IV 1 17493829
##sequence-region MtDNA 1 13794
##sequence-region V 1 20924180
##sequence-region X 1 17718942
SL1_acceptor_site
SL2_acceptor_site
snoRNA
SNP
snRNA
substitution
tandem_duplication
tandem_repeat
TF_binding_site
three_prime_UTR
transcribed_fragment
transcription_end_site
transcript_region
translated_nucleotide_match
transposable_element
transposable_element_insertion_site
tRNA
TSS_region

We see that there are a lot of RNAs here, and there happen to be two different mRNA feature types. So, now that we know the feature type and the format we can construct our regular expression:

# The -P tells grep to use perl regular expressions which make things a bit easier otherwise you'll have
# to escape some characters!

# The pattern we're looking for is mRNA surrounded by tabs/white space
! grep -P '\tmRNA\t' caenorhabditis_elegans.PRJNA13758.WBPS9.annotations.gff3 | head -n 20

I   WormBase    mRNA    4116    10230   .   -   .   ID=Transcript:Y74C9A.3;Parent=Gene:WBGene00022277;Name=Y74C9A.3;wormpep=WP:CE28146;locus=homt-1
I   WormBase    mRNA    11495   16793   .   +   .   ID=Transcript:Y74C9A.2a.1;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.1;wormpep=WP:CE24660;locus=nlp-40
I   WormBase    mRNA    11495   16837   .   +   .   ID=Transcript:Y74C9A.2a.2;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.2;wormpep=WP:CE24660;locus=nlp-40
I   WormBase    mRNA    11499   16837   .   +   .   ID=Transcript:Y74C9A.2a.3;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.3;wormpep=WP:CE24660;locus=nlp-40
I   WormBase    mRNA    11505   16837   .   +   .   ID=Transcript:Y74C9A.2a.4;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.4;wormpep=WP:CE24660;locus=nlp-40
I   WormBase    mRNA    11618   16837   .   +   .   ID=Transcript:Y74C9A.2a.5;Parent=Gene:WBGene00022276;Name=Y74C9A.2a.5;wormpep=WP:CE24660;locus=nlp-40
I   WormBase    mRNA    11623   16837   .   +   .   ID=Transcript:Y74C9A.2b;Parent=Gene:WBGene00022276;Name=Y74C9A.2b;wormpep=WP:CE49228;locus=nlp-40
I   WormBase    mRNA    17487   26781   .   -   .   ID=Transcript:Y74C9A.4b;Parent=Gene:WBGene00022278;Name=Y74C9A.4b;wormpep=WP:CE28147;locus=rcor-1
I   WormBase    mRNA    17497   24796   .   -   .   ID=Transcript:Y74C9A.4d;Parent=Gene:WBGene00022278;Name=Y74C9A.4d;wormpep=WP:CE49439;locus=rcor-1
I   WormBase    mRNA    17497   26643   .   -   .   ID=Transcript:Y74C9A.4c;Parent=Gene:WBGene00022278;Name=Y74C9A.4c;wormpep=WP:CE49153;locus=rcor-1
I   WormBase    mRNA    17497   26781   .   -   .   ID=Transcript:Y74C9A.4a;Parent=Gene:WBGene00022278;Name=Y74C9A.4a;wormpep=WP:CE24662;locus=rcor-1
I   WormBase    mRNA    27591   32544   .   -   .   ID=Transcript:Y74C9A.5;Parent=Gene:WBGene00022279;Name=Y74C9A.5;wormpep=WP:CE40291;locus=sesn-1
I   WormBase    mRNA    43733   44677   .   +   .   ID=Transcript:Y74C9A.1;Parent=Gene:WBGene00022275;Name=Y74C9A.1;wormpep=WP:CE34428
I   WormBase    mRNA    47467   49857   .   +   .   ID=Transcript:Y48G1C.12;Parent=Gene:WBGene00044345;Name=Y48G1C.12;wormpep=WP:CE38647
I   WormBase    mRNA    49919   54426   .   +   .   ID=Transcript:Y48G1C.4a;Parent=Gene:WBGene00021677;Name=Y48G1C.4a;wormpep=WP:CE30021;locus=pgs-1
I   WormBase    mRNA    52292   54360   .   +   .   ID=Transcript:Y48G1C.4b;Parent=Gene:WBGene00021677;Name=Y48G1C.4b;wormpep=WP:CE49150;locus=pgs-1
I   WormBase    mRNA    55293   64066   .   -   .   ID=Transcript:Y48G1C.5;Parent=Gene:WBGene00021678;Name=Y48G1C.5;wormpep=WP:CE39437
I   WormBase    mRNA    71425   81071   .   +   .   ID=Transcript:Y48G1C.2b.2;Parent=Gene:WBGene00000812;Name=Y48G1C.2b.2;wormpep=WP:CE49183;locus=csk-1
I   WormBase    mRNA    71425   81063   .   +   .   ID=Transcript:Y48G1C.2b.1;Parent=Gene:WBGene00000812;Name=Y48G1C.2b.1;wormpep=WP:CE49183;locus=csk-1
I   WormBase    mRNA    71845   80633   .   +   .   ID=Transcript:Y48G1C.2a.1;Parent=Gene:WBGene00000812;Name=Y48G1C.2a.1;wormpep=WP:CE34405;locus=csk-1
grep: write error

Renaming FastQ Files

The filenames for sequences are generally too many characters an too much information, which can be annoying when working with those files. To extract only the relevant information and shorten the filename I often construct a regular expression, coupled with some bash commands, to rename the files how I deem fit.

ls /home/at120/badas/2015-03-11_AE52Y-redo

[0m[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a1.341000000083f5.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a2.3410000000847d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a3.341000000084f4.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a4.3410000000857c.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a5.341000000085f3.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a6.3410000000867b.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a7.341000000086f2.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated a8.3410000000877a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b1.34100000008403.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b2.3410000000848a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b3.34100000008502.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b4.34100000008589.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b5.34100000008601.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b6.34100000008688.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated b7.34100000008700.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c1.34100000008410.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c2.34100000008497.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c3.3410000000851f.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c4.34100000008596.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c5.3410000000861e.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c6.34100000008695.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated c7.3410000000871d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d1.3410000000842d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d2.341000000084a4.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d3.3410000000852c.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d4.341000000085a3.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d5.3410000000862b.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d6.341000000086a2.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated d7.3410000000872a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e1.3410000000843a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e2.341000000084b0.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e3.34100000008539.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e4.341000000085bf.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e5.34100000008638.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e6.341000000086be.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated e7.34100000008737.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f1.34100000008447.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f2.341000000084cd.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f3.34100000008546.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f4.341000000085cc.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f5.34100000008645.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f6.341000000086cb.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated f7.34100000008744.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g1.34100000008454.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g2.341000000084da.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g3.34100000008553.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g4.341000000085d9.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g5.34100000008652.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g6.341000000086d8.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated g7.34100000008751.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h1.34100000008460.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h2.341000000084e7.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h3.3410000000856f.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h4.341000000085e6.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h5.3410000000866e.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h6.341000000086e5.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n01 flu mrt-pcr updated h7.3410000000876d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a1.342000000083f2.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a2.3420000000847a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a3.342000000084f1.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a4.34200000008579.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a5.342000000085f0.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a6.34200000008678.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a7.342000000086ff.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated a8.34200000008777.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b1.34200000008400.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b2.34200000008487.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b3.3420000000850f.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b4.34200000008586.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b5.3420000000860e.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b6.34200000008685.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated b7.3420000000870d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c1.3420000000841d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c2.34200000008494.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c3.3420000000851c.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c4.34200000008593.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c5.3420000000861b.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c6.34200000008692.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated c7.3420000000871a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d1.3420000000842a.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d2.342000000084a1.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d3.34200000008529.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d4.342000000085a0.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d5.34200000008628.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d6.342000000086af.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated d7.34200000008727.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e1.34200000008437.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e2.342000000084bd.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e3.34200000008536.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e4.342000000085bc.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e5.34200000008635.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e6.342000000086bb.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated e7.34200000008734.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f1.34200000008444.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f2.342000000084ca.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f4.342000000085c9.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f5.34200000008642.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f6.342000000086c8.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated f7.34200000008741.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g1.34200000008451.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g2.342000000084d7.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g3.34200000008550.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g4.342000000085d6.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g5.3420000000865f.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g6.342000000086d5.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated g7.3420000000875e.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h1.3420000000846d.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h2.342000000084e4.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h3.3420000000856c.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h4.342000000085e3.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h5.3420000000866b.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h6.342000000086e2.fastq.gz[0m*
[38;5;34m000000000-AE52Y l01n02 flu mrt-pcr updated h7.3420000000876a.fastq.gz[0m*

There is a lot of information here, some of which is intelligible only to Tubo. The only information that I would like to retain are the sample names, consisting of one lowercase letter followed by one number and the pairing information, which is n0 and then a 1 or a 2.

What I usually do to get started is to work with one filename and construct the pattern from that. Regular expressions are read from left to right, so we should start by grabbing the first bit of information we want, which is the pair information:

%%bash
echo "000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz" | perl -pe 's/^.+n0([12]).+/$1/g'

The syntax for a perl substitution regular expression is as follows:

perl -pe '<mode>/<pattern>/<replacement text>/<modifier>'

mode can be:
– substitution s
– match m
– translate tr

pattern is your regular expression
replacement text is the text to replace text that matches your regular expression

modifier can be:
– ignore case i
– global g: will find all patterns, not just the first

and some others. Take a look at the quick reference for perl regular expressions for more information.

Next would be to grab the sample name:

%%bash

# It's good to be very specific with regular expressions for multiple reasons.
# 1) It limits the scope of the regular expression
# 2) It makes it easier to read later.
# So I do not NEED to specify 'flu' but it makes for fewer instructions and is more human readable.

echo "000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz" \
| perl -pe 's/^.+n0([12])\sflu.+\s([a-z]\d)\..+/$1$2/g'

2f3

Lastly, the sample name and file extension and rearrange the caputred patterns:

%%bash
echo "000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz" \
| perl -pe 's/^.+n0([12])\sflu.+\s([a-z]\d)\..+(fastq.gz)$/$2.r$1.$3/g'

f3.r2.fastq.gz

Now to wrap it up in a for loop and invoke a copy command:

%%bash
cd /home/at120/badas/2015-03-11_AE52Y-redo
for i in *gz; do cp "$i" `echo $i | perl -pe 's/^.+n0([12])\sflu.+\s([a-z]\d)\..+(fastq.gz)$/$2.r$1.$3/g'` ; done

ls

000000000-AE52Y l01n01 flu mrt-pcr updated a1.341000000083f5.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a2.3410000000847d.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a3.341000000084f4.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a4.3410000000857c.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a5.341000000085f3.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a6.3410000000867b.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a7.341000000086f2.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated a8.3410000000877a.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b1.34100000008403.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b2.3410000000848a.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b3.34100000008502.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b4.34100000008589.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b5.34100000008601.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b6.34100000008688.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated b7.34100000008700.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c1.34100000008410.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c2.34100000008497.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c3.3410000000851f.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c4.34100000008596.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c5.3410000000861e.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c6.34100000008695.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated c7.3410000000871d.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d1.3410000000842d.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d2.341000000084a4.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d3.3410000000852c.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d4.341000000085a3.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d5.3410000000862b.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d6.341000000086a2.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated d7.3410000000872a.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e1.3410000000843a.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e2.341000000084b0.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e3.34100000008539.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e4.341000000085bf.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e5.34100000008638.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e6.341000000086be.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated e7.34100000008737.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f1.34100000008447.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f2.341000000084cd.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f3.34100000008546.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f4.341000000085cc.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f5.34100000008645.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f6.341000000086cb.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated f7.34100000008744.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g1.34100000008454.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g2.341000000084da.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g3.34100000008553.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g4.341000000085d9.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g5.34100000008652.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g6.341000000086d8.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated g7.34100000008751.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h1.34100000008460.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h2.341000000084e7.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h3.3410000000856f.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h4.341000000085e6.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h5.3410000000866e.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h6.341000000086e5.fastq.gz
000000000-AE52Y l01n01 flu mrt-pcr updated h7.3410000000876d.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a1.342000000083f2.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a2.3420000000847a.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a3.342000000084f1.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a4.34200000008579.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a5.342000000085f0.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a6.34200000008678.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a7.342000000086ff.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated a8.34200000008777.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b1.34200000008400.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b2.34200000008487.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b3.3420000000850f.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b4.34200000008586.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b5.3420000000860e.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b6.34200000008685.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated b7.3420000000870d.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c1.3420000000841d.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c2.34200000008494.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c3.3420000000851c.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c4.34200000008593.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c5.3420000000861b.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c6.34200000008692.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated c7.3420000000871a.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d1.3420000000842a.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d2.342000000084a1.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d3.34200000008529.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d4.342000000085a0.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d5.34200000008628.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d6.342000000086af.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated d7.34200000008727.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e1.34200000008437.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e2.342000000084bd.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e3.34200000008536.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e4.342000000085bc.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e5.34200000008635.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e6.342000000086bb.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated e7.34200000008734.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f1.34200000008444.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f2.342000000084ca.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f3.34200000008543.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f4.342000000085c9.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f5.34200000008642.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f6.342000000086c8.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated f7.34200000008741.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g1.34200000008451.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g2.342000000084d7.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g3.34200000008550.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g4.342000000085d6.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g5.3420000000865f.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g6.342000000086d5.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated g7.3420000000875e.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h1.3420000000846d.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h2.342000000084e4.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h3.3420000000856c.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h4.342000000085e3.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h5.3420000000866b.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h6.342000000086e2.fastq.gz
000000000-AE52Y l01n02 flu mrt-pcr updated h7.3420000000876a.fastq.gz
a1.r1.fastq.gz
a1.r2.fastq.gz
a2.r1.fastq.gz
a2.r2.fastq.gz
a3.r1.fastq.gz
a3.r2.fastq.gz
a4.r1.fastq.gz
a4.r2.fastq.gz
a5.r1.fastq.gz
a5.r2.fastq.gz
a6.r1.fastq.gz
a6.r2.fastq.gz
a7.r1.fastq.gz
a7.r2.fastq.gz
a8.r1.fastq.gz
a8.r2.fastq.gz
b1.r1.fastq.gz
b1.r2.fastq.gz
b2.r1.fastq.gz
b2.r2.fastq.gz
b3.r1.fastq.gz
b3.r2.fastq.gz
b4.r1.fastq.gz
b4.r2.fastq.gz
b5.r1.fastq.gz
b5.r2.fastq.gz
b6.r1.fastq.gz
b6.r2.fastq.gz
b7.r1.fastq.gz
b7.r2.fastq.gz
c1.r1.fastq.gz
c1.r2.fastq.gz
c2.r1.fastq.gz
c2.r2.fastq.gz
c3.r1.fastq.gz
c3.r2.fastq.gz
c4.r1.fastq.gz
c4.r2.fastq.gz
c5.r1.fastq.gz
c5.r2.fastq.gz
c6.r1.fastq.gz
c6.r2.fastq.gz
c7.r1.fastq.gz
c7.r2.fastq.gz
d1.r1.fastq.gz
d1.r2.fastq.gz
d2.r1.fastq.gz
d2.r2.fastq.gz
d3.r1.fastq.gz
d3.r2.fastq.gz
d4.r1.fastq.gz
d4.r2.fastq.gz
d5.r1.fastq.gz
d5.r2.fastq.gz
d6.r1.fastq.gz
d6.r2.fastq.gz
d7.r1.fastq.gz
d7.r2.fastq.gz
e1.r1.fastq.gz
e1.r2.fastq.gz
e2.r1.fastq.gz
e2.r2.fastq.gz
e3.r1.fastq.gz
e3.r2.fastq.gz
e4.r1.fastq.gz
e4.r2.fastq.gz
e5.r1.fastq.gz
e5.r2.fastq.gz
e6.r1.fastq.gz
e6.r2.fastq.gz
e7.r1.fastq.gz
e7.r2.fastq.gz
f1.r1.fastq.gz
f1.r2.fastq.gz
f2.r1.fastq.gz
f2.r2.fastq.gz
f3.r1.fastq.gz
f3.r2.fastq.gz
f4.r1.fastq.gz
f4.r2.fastq.gz
f5.r1.fastq.gz
f5.r2.fastq.gz
f6.r1.fastq.gz
f6.r2.fastq.gz
f7.r1.fastq.gz
f7.r2.fastq.gz
g1.r1.fastq.gz
g1.r2.fastq.gz
g2.r1.fastq.gz
g2.r2.fastq.gz
g3.r1.fastq.gz
g3.r2.fastq.gz
g4.r1.fastq.gz
g4.r2.fastq.gz
g5.r1.fastq.gz
g5.r2.fastq.gz
g6.r1.fastq.gz
g6.r2.fastq.gz
g7.r1.fastq.gz
g7.r2.fastq.gz
h1.r1.fastq.gz
h1.r2.fastq.gz
h2.r1.fastq.gz
h2.r2.fastq.gz
h3.r1.fastq.gz
h3.r2.fastq.gz
h4.r1.fastq.gz
h4.r2.fastq.gz
h5.r1.fastq.gz
h5.r2.fastq.gz
h6.r1.fastq.gz
h6.r2.fastq.gz
h7.r1.fastq.gz
h7.r2.fastq.gz

! echo "TAAAAAA" | perl -pe 's/([ACTG]){3}//g'

TAAAAAA

Customizing Your Unix Environment

This walkthrough is designed to show users some interesting ways to use the command line and set it up in a way that makes it most comfortable.

Getting Started

In order to follow the steps exactly as described in this tutorial you will need the following:

A personal computer with an ssh client
An ssh connection to a server (HPC, standalone, etc)
optional: a VPN connection to the network where the remote server is hosted.

What is a Unix Environment?

A Unix environment is composed of three components:

1) The kernel: this is the master control program of the operating system. This handles everything that make a computer a computer, such as I/O streams, booting, peripherals, etc.

2) File system: this is a hierarchical file system. This is the system in which the operating system keeps track of all the files and folders it needs in order to function. An example filesystem is seen in the figure below:

3) The shell aka. terminal, command prompt: This is the interface between the user and the kernel. The main purpose of the shell is to take a command or series of commands and sends them to the kernel to be processed into a set of instructions. This is place where you will spend most of your time analyzing data, so may as well make it comfortable for you to work in, which is the main objective in this tutorial.

.bashrc, .bash_profile, and .profile files

Each if these files could be considered a config file – they all help in making your terminal environment the way you want, though each has a particular purpose within the shell. Each of these are found in your home directory (/home/$USER/) which we will access via the shortcut ~ or ~/, which just saves you from having to type the full path to your home directory.

A note about hidden files: as you may have noticed, there is a preceding . in front of the filenames. This is Unix/Linux notation to hide a file by default when looking into a directory. This is because you most likely won’t want to see these files every time you ls your home directory (they can start to clutter things) but also because these are generally very important files that are critical to the way your account functions, so tampering with these by mistake could be very serious. In order to view them, you can run ls -a which will list everything in the directory. To access them, use the complete filename, including the ..

.bashrc

This is the main file where you’re add most of your configurations, such as aliases, PATH variable, color schemes, etc. Unlike the other configuration files, this affects every terminal instance as opposed to only the login shell, which is handled by `.bash_profile’.

A few things that are defined in this file are:
* set the $PS1 variable, which displays hostname and current directory
* set the $PATH variable (discussed below)
* aliases
* history settings

IMPORTANT: this file should never output anything! You will run into some very frustrating situations if it does.

.bash_profile

This handles the login nodes, i.e. as soon as you ssh or log into a computer, this is the file that gets loaded. The reason being is that sometimes you want to view diagnostics of the machine that you’re logging into (how long has it been running, are there any updates that need to be installed, etc.) which you wouldn’t want to see in every other terminal instance. Unfortunately this does not have the same settings as the .bashrc file, so your path will not work unless you load your .bashrc file from within. To do so, you can add the following lines to your .bash_profile file (it may already be there, so check before you add this):

if [ -f $HOME/.bashrc ]; then
        source $HOME/.bashrc
fi

# likewise for .profile
if [ -f $HOME/.profile]; then
        source $HOME/.profile
fi

.profile

This file isn’t used very often, however one thing to note: anything that should be available to graphical applications or sh MUST go here.

TMUX Session

Tmux is a terminal multiplexer that enables a number of terminals (or windows) to be created inside a single terminal window or remote terminal session. It is useful for dealing with multiple programs from a command-line interface, and for separating programs from the Unix shell that started the program. For the full set of features please check the man page or this handy cheatsheet.

To get started, ssh into your server:

ssh user@server

IMPORTANT: If you are using Dalma, please note that tmux is not installed on this system. Please skip to the next section of this tutorial or find another server to test this on.

To start a tmux sessions:

tmux

Or, you can name the tmux session to help keep things organized. In this example, the tmux session we will use is named ‘test’:

tmux new -s test

To detach from a tmux session (log out of the session but keep it running) press ctrl+b then d

To see existing sessions on you account

tmux ls # or tmux list-sessions

NOTE: your tmux sessions will not appear if you are not on the same login node they were created on. For example, prince has log-0 and log-1. A tmux session created on log-0 will not be accessible to that on log-1 and vice versa. Please see the session on setting up your ssh alias below.

To log back into your tmux session:

tmux attach

Or, of you have multiple sessions:

tmux attach -t # or test, in the case of this example

Script Command

The script command is a way to log and replay the commands that were entered during a script session. This is useful for sharing workflows and keeping track of what has been done.

While in a tmux session, start a script session to begin logging you commands and name it test_commands:

script test_commands

To end a script session and dump the commands into a file type exit or ctrl+d. The results will be saved into a folder named test_commands within your working directory. If no filename was entered provided they will be saved to a file name typescript.

To append to an existing script file:

# in this example, filename is test_commands
script -a

Downloading, Compiling, and Adding Software to the $PATH

Download and Compile

There are often times when us users do not have sufficient permissions to install software on a server so then we have to wait for someone with permissions to do it for me. As a workaround, you can download and compile (NOT install), and run packages yourself.

The first step is to grab the package you’re interested in. We can use this using wget, curl, or git clone. In this example, we will grab BWA using git clone. This will download the BWA source code into the current directory (it is good practice to compile all of your software into one directory, say /scratch/$USER/software). Once it is done, we can compile the software using make. There are other ways to compile software, but this is one of the more common ways so we will stick with this.

# Download the repository using git
git clone https://github.com/lh3/bwa.git

# change into the bwa directory
cd bwa

# compile using mak
make

# once it is done, you can try to run it bwa
./bwa

# You should see the help output from bwa.
# If you do then it works! If not, you may need to make the `bwa` binary file executable using `chmod +x ./bwa`

PATH Variable

Now that we have it running, we can continue in our analysis. However, it’s pretty annoying that you have to write /scratch/$USER/software/bwa/bwa every time you want to run BWA; why can’t we just type bwa like all the other commands (ls, pwd, cp)? Afterall, it is a command, isn’t it? Well, in fact you can! This is where the PATH variable comes into play. For normal, distributed commands (such as those above) the path variable points to a directory that holds the binary executable files in it, telling your user account where to find them. So, if we print the $PATH variable:

echo $PATH
/opt/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games

You might see something like this, a list of directory paths, separated by :. If we ls one of those directories, say /bin:

ls /bin
bash          bzmore  dir        fuser   kill        lsblk   nc.openbsd    ntfsfix   pwd        sh            systemd-inhibit         uname       zfgrep
bunzip2       cat     dmesg      fusermount  kmod        lsmod   netcat        ntfsinfo  rbash      sh.distrib        systemd-machine-id-setup    uncompress      zforce
busybox       chacl   dnsdomainname  getfacl     less        mkdir   netstat       ntfsls    readlink   sleep         systemd-notify          unicode_start   zgrep
bzcat         chgrp   domainname     grep    lessecho    mknod   networkctl    ntfsmove  red        ss            systemd-tmpfiles        vdir        zless
bzcmp         chmod   dumpkeys       gunzip  lessfile    mktemp  nisdomainname     ntfstruncate  rm     static-sh         systemd-tty-ask-password-agent  vmmouse_detect  zmore
bzdiff        chown   echo       gzexe   lesskey     more    ntfs-3g       ntfswipe  rmdir      stty          tailf               wdctl       znew
bzegrep       chvt    ed         gzip    lesspipe    mount   ntfs-3g.probe     open      rnano      su            tar                 which       zsh
bzexe         cp      efibootmgr     hciconfig   ln      mountpoint  ntfs-3g.secaudit  openvt    run-parts  sync          tempfile            whiptail    zsh5
bzfgrep       cpio    egrep      hostname    loadkeys    mt      ntfs-3g.usermap   pidof     rzsh       systemctl         touch               ypdomainname
bzgrep        dash    false      ip      login       mt-gnu  ntfscat       ping      sed        systemd       true                zcat
bzip2         date    fgconsole      journalctl  loginctl    mv      ntfscluster       ping6     setfacl    systemd-ask-password  udevadm             zcmp
bzip2recover  dd      fgrep      kbd_mode    lowntfs-3g  nano    ntfscmp       plymouth  setfont    systemd-escape    ulockmgr_server         zdiff
bzless        df      findmnt        keyctl  ls      nc      ntfsfallocate     ps        setupcon   systemd-hwdb      umount              zegrep

These are a lot of the commands you use on a daily basis in the terminal (ls, pwd, cp). The $PATH variable is how your account can keep track of what commands to provide your user as different users have different setups, privileges, software, etc.. So now that we have a folder called software where we’re going to house all of our compiled software, we can tell the $PATH variable to look into our software directory for binaries that we would like in our path so that we may access them from anywhere within the filesystem. If a file is executable then that and it resides in one of the paths in our $PATH variable, it can be executed from anywhere.

To do this, we need to append to the path (either the full or relative are fine) to the $PATH variable. IMPORTANT: Make sure that you follow the commands exactly and not overwrite the $PATH variable but append to it. If you overwrite it, the operating system will not be able to find any of the necessary commands since their paths will not be found in the $PATH variable.

There are two ways to add the new path to the $PATH variable:

# To append to the path variable for the current session only (until you log out)
export PATH=$PATH:/scratch/$USER/software/bwa

# To verify, you can use print your $PATH variable to see if it now contains the new path
echo $PATH

# To permanently add a path to your you will need to make this edit in your `.bashrc` file
vim ~/.bashrc

# Then at the top add the same command we used above and then save and quit
export PATH=$PATH:/scratch/$USER/software/bwa

# To reload your terminal environment with these new changes we will need to `source` the `.bashrc` file
source ~/.bashrc

# Now if you try to run `bwa` it should work. If not, then try logging out of the of your session and log back in
bwa

# If you get the same output as ./bwa from the previous step, then congratulations!

Command History

Now that you know how to edit and reload the .bashrc file, we should go ahead and make some more changes that may be convenient for you.

It is common practice to search your command history to see all the commands that were executed previously on your account (you can do so with the history command or search with ctrl+r). However, what if you enter 100,000 commands? Most accounts won’t remember that many by default, they will only remember the last 1000, even if some commands are redundant. To change this to store all commands (or nearly all), you can add these lines to the .bashrc file:

vim ~/.bashrc

# Removes duplicated commands and doesn’t store commands that start with a space
HISTCONTROL=ignoreboth
Sets the filesize (chances are you won’t fill this up for several years)
HISTFILESIZE=10000000
Number of lines that are store in memory while your session is ongoing
HISTSIZE=100000

# save and quit and then source the file
source ~/.bashrc

Aliases

Aliases are a way to rename commands or strings of commands. A common example is ssh; many analysts log into a server on a daily basis:

ssh -X at120@prince.hpc.nyu.edu

This is something I use on a daily basis. It may seem like not too much to type, but it does get annoying doing it every day. To make this more succinct, we can create an alias named something shorter which, when executed, will execute the ssh command. This is done in the .bashrc file as well. So open it up and add these lines:

# NOTE: you can change the server to any other server you'd like, you do not have to use prince.hpc.nyu.edu
# Also NOTE: you can change $USER to whatever username you use for that server. 
# For example, my laptop user is alan, but my hpc user is at120. So for this example I would use at120 instead of $USER
alias prince='ssh -X $USER@prince.hpc.nyu.edu'

source ~/.bashrc

Now you should be able to ssh into the server with just the command prince.

BUT I STILL NEED TO ENTER A PASSWORD! Is there a way to skip that part, too? Yes.

Password-less SSH

Believe it or not, this is a significantly more secure way to access other computers. However, that’s not the point of this section. In this section we’ll simply show briefly how it works and how to set it up.

The first step is to create a signature for your account, so that when you ssh into a computer it knows who you are and will let you in without prompting for a password. This is called the RSA key. You may already have one; they are stored in the ~/.ssh folder. If that folder doesn’t exist then create it:

# To make the .ssh directory
mkdir ~/.ssh # only do this if you do not have an .ssh directory in your home directory already!

# Then ls the .ssh directory to check to see if you have an RSA key already. It will be in the id_rsa.pub file. Your directory, if it exists, may look something like this
ls ~/.ssh
config  id_rsa  id_rsa.orig  id_rsa.pem  id_rsa.pub  id_rsa.pub.orig  keras-workshop.pem  known_hosts

# If id_rsa.pub exists, then skip the command immediately below.
# If it does not exist, we can create an and RSA key with the following commands
ssh-keygen -t rsa 
# Press enter all the way through, even when it prompts for a password!

# To verify, ls the directory and see if id_rsa.pub is now there
ls ~/.ssh

Now what we want to do is give our public RSA key to the server you’re going to access (presumably the one your use to create your alias for!). We can do that with a slightly sophisticated command (but don’t worry, you’ll learn what it means in one of our later posts 😉

# This command appends your signature to the authorized keys file on the server, which is a list of signatures that it recognizes.
# NOTE: You will have to enter your password at this step.
cat ~/.ssh/id_rsa.pub | ssh at120@prince.hpc.nyu.edu ‘cat >> .ssh/authorized_keys’

# Once the transfer is complete, try running your alias
hpc

Hopefully now you can login without being prompted for a password!