Regular Expressions

A regular expression (aka regex) is a sequence of characters that describe or match a pattern of text. For example, the string aabb12 could be described as aabb12, two a’s, two’bs, then 1, then 2, or four letters followed by two numbers.

Keep in mind that regular expressions differ slightly per language but the theort and application remain the same. We will start with Python.

Metacharacters

Metacharacters are characters that have an alternate meaning rather than a literal meaning. There are many of these and they can be found on the Python regular expression documentation.

For example, we’ll take the following characters to search a text file:
– ‘A’
– ‘.’
– ‘$’

It matches ‘A’! No surprise here. Now let’s see what the ‘.’ does:

It matched everything?! That’s because the ‘.’ is a special character that we will get into later.

Now let’s see what the dollar sign finds us:

Nope, not a bug: ‘$’ did match something but it is also a special character.

Position Metacharacters

This type of regular expression is used to match characters based on where they are located as opposed to what the character means.

Let’s take the $ for example. I told you that it matched something, but nothing was returned in the output. That is because $ is a special character that matches the end of a line. Let’s try an example:

This is no surprise, it found both instances of ‘man’. Now, let’s say we only want to capture ‘man’ at the end of a string:

Trust me, this is grabbing the last man. Don’t believe me?

Told ya. Now, what happens if we actually do want to match a dollar sign? Well, we can use an escape character, \ (backslash), before the dollar sign to tell the regex interpreter to use the literal meaning. Let’s use the first example:

Nothing. Let’s try to escape it:

Now, what do you think this regular expression do?

Similarly, ‘^’ (caret, but not the kind that rabbits eat) matches the beginning of a line:

And now with the caret:

There are also boundary metacharacters. For example, ‘\b’ matches a word that ends with ‘ing’. Inversely, ‘\B’ matches a non-boundary word, so it would match the ‘ing’ in ‘things’ but not ‘thing’. This is useful for specifying substrings or whole words exclusively.

NOTE: some characters are treated as literals even when they are backslached. See here for more deltails:
https://stackoverflow.com/questions/2241600/python-regex-r-prefix
https://stackoverflow.com/questions/21104476/what-does-the-r-in-pythons-re-compiler-pattern-flags-mean

Single Metacharacters

These metacharacters match specific types of charactes. For example, you can match all alphanumeric characters with ‘\w’ or any whitespace character with ‘\s’

Quantifiers

All examples before have been searching for individual characters. Quanitfiers allow you to match repeated patterns.

Character Classes

A character class allows you to match a particular set of user defined characters as oppsed to the predefined metacharacters we went over previously. Think of searching for any vowel.

Alternatively, when you use a caret ^ within the square brackets it will match the inverse of those defined in the character class. In this case, it will match all consonants.

You can also specify ranges of characters if they are naturally consecutive by using a hyphen - to separate the beginning and the end:

Alterations

This is essentially just an ‘or’ statement. An example would be if you are looking for whether a sentence says ‘we have ten dollars.’ or ‘I have ten dollars.’ Since the only varying piece of that sentence are the pronouns, you can try to match either.

Backreferences

Back references/captures allow you to reuse regular expressions and/or patterns that match that regular expression.

Let’s give a biological example:
You’re looking for a motif that has flanking restriction sites ACTG. The motif can be of any length and any composition of nucleotides but is always flanked by those cuts sites:

Substitution Mode

There are several ways that one can use regular expressions, though the two main modes are searching and substituting.

Searching/matching will only look for whether or not a pattern is matched. Substituting will replace any pattern that is matched with another pattern. Above we have used only searching, so below will only showcase an example of a substitution.

Command Line Examples

Constructing a Regular Expression

Arguably the most difficult task with regular expressions is being able to construct one. Often in cases I am trying extract information that match multiple different patterns or the same pattern but in different lines of text. In order to create a regular expression, you first need to identify the pattern. This generally requires some brief understanding of the file format and the relevant information as well as finding a pattern that applies to information that you’re interested in.

To give you some insight on how to do this, the examples below are scenarios that I often find myself in that are a perfect fit for regular expressions.

Parsing a GFF file

There are a few tools, such as sed, awk, and grep, that are used for text munging but I happen to use Perl, as it provides a lot of flexibility when doing complex regular expressions and other data munging tasks.

grep can only match patterns (or the inverse of patterns) and cannot be used to replace or transform text. Though limited, it is a nice tool to have for quick searches in files or across filesystems. In this example, all I need is to match a pattern, so grep will suffice.

A GFF file is a standard tab-delimited file format for genomic annotations. Most gff files contain every type of annotated feature for that organism (mRNA, exons, UTRs, motifs, etc.) which are not always relevant to the analysis and may sometimes may actuall interfere with the analysis.

Let’s say I want to extract all annotated mRNAs from this file. First thing is to understand the gff file format. Reading the documentation will tell you that it is tab delimited and the third column contains the feature type. So looking at the file will sho you:

Looking at the entire file is daunting and doesn’t provide a list of features found within this file. So, we can use a few commands to get a full representation of each feature type described in the file.

We see that there are a lot of RNAs here, and there happen to be two different mRNA feature types. So, now that we know the feature type and the format we can construct our regular expression:

Renaming FastQ Files

The filenames for sequences are generally too many characters an too much information, which can be annoying when working with those files. To extract only the relevant information and shorten the filename I often construct a regular expression, coupled with some bash commands, to rename the files how I deem fit.

There is a lot of information here, some of which is intelligible only to Tubo. The only information that I would like to retain are the sample names, consisting of one lowercase letter followed by one number and the pairing information, which is n0 and then a 1 or a 2.

What I usually do to get started is to work with one filename and construct the pattern from that. Regular expressions are read from left to right, so we should start by grabbing the first bit of information we want, which is the pair information:

The syntax for a perl substitution regular expression is as follows:

perl -pe '<mode>/<pattern>/<replacement text>/<modifier>'

mode can be:
– substitution s
– match m
– translate tr

pattern is your regular expression
replacement text is the text to replace text that matches your regular expression

modifier can be:
– ignore case i
– global g: will find all patterns, not just the first

and some others. Take a look at the quick reference for perl regular expressions for more information.

Next would be to grab the sample name:

Lastly, the sample name and file extension and rearrange the caputred patterns:

Now to wrap it up in a for loop and invoke a copy command:

Leave a Reply

Your email address will not be published. Required fields are marked *