Institut Pasteur Logo

1 Preliminary considerations

2 Instructions

2.1

Log into the virtual machine (accessible from https://desktop.pasteur.fr if you are in Institut Pasteur of if the VPN is on):

  • Use your usual Pasteur credentials.

  • Select the “Ubuntu PhD HubBioIT” machine.

  • The part before the @ in the short form of your e-mail will be your login / user name in this machine.

2.2

Open a command-line terminal (“Show Applications” button at the bottom of the left-side dock, then search for “Terminal”), as well as a graphical file browser (“Files” button in the left-side dock).

2.3

Look at what the path to your current working directory is, using pwd.

username@host:~$ pwd
/home/username

By default, a new terminal starts with the current working directory set as your home directory.

It is the directory within which you are allowed to create, modify and erase files and folders. The rest of the file system should be protected against your modification attempts (voluntary or not).

2.4

Use the ls command to list the files present in your home folder.

username@host:~$ ls
build
data
Documents
exercises
exercises.zip
galaxy.md
images
index.md
main_chars.md
Makefile
Notes.md
Program_interfaces.md
README.md
R_script_practice.md
R_script_practice.Rmd
scripts
Shell_practice.md
Shell_practice.Rmd

(The actual list will likely be different in your session.)

It is good practice not to work directly inside your home directory. You should have a Documents folder better suited for that.

2.5

Use the cd (change directory) command followed by this folder’s path as argument in order to set your working directory to be this folder.

username@host:~$ cd Documents

Tab-completion

In the above example, you can just type cd Doc and then use the tabulation key of your keyboard to have the path auto-completed. This helps avoiding typos and saves time. If there are several possibilities, hitting the tab key twice reveals the available options. Add the necessary letters to lift the ambiguity, and use tab again.


In the virtual machine, with the default configuration, your prompt will probably be updated to reflect the change, but the command itself will not generate any output in the terminal.

2.6

Check what the path to your working directory is now, using pwd.

username@host:~/Documents$ pwd
/home/username/Documents

2.7

Look at the content of this directory using ls.

username@host:~/Documents$ ls

If you are using a fresh new account, this directory should be empty, and no results will be displayed.

2.8

Create a new folder named HTS with the mkdir (make directory) command.

username@host:~/Documents$ mkdir HTS

If everything went well, you should get a new prompt, without any other textual feedback.

2.9

Use ls to check that the new folder has indeed been created.

username@host:~/Documents$ ls
HTS

2.10

Use the cd command to set your working directory to be this folder.

username@host:~/Documents$ cd HTS

2.11

Check again what your working directory is.

username@host:~/Documents/HTS$ pwd
/home/username/Documents/HTS

2.12

Download the zip archive containing the exercise material: http://hub-courses.pages.pasteur.fr/refresher_utilities_hts/exercises.zip

This can be done using the wget command, with the URL of the file to download as argument.

username@host:~/Documents/HTS$ wget http://hub-courses.pages.pasteur.fr/refresher_utilities_hts/exercises.zip

(If you happen to do this exercise on your own computer under Mac OSX, and wget is not available, try the curl command.)

2.13

Check with ls that the archive appeared.

username@host:~/Documents/HTS$ ls
exercises.zip

2.14

List the content of the archive to see what’s inside (using the zipinfo command), then extract the archive (using the unzip command). Remember that you can use tab-completion to save time and avoid typos.

username@host:~/Documents/HTS$ zipinfo exercises.zip
Archive:  exercises.zip
Zip file size: 92540 bytes, number of entries: 13
drwxr-xr-x  3.0 unx        0 bx stor 24-Mar-28 21:02 exercises/
drwxrwxrwx  3.0 unx        0 bx stor 24-Mar-28 21:02 exercises/scripts/
-rw-rw-rw-  3.0 unx     9565 tx defN 24-Mar-28 21:02 exercises/scripts/Rpractice.R
-rw-rw-rw-  3.0 unx        0 bx stor 24-Mar-28 21:02 exercises/scripts/.gitkeep
-rw-rw-rw-  3.0 unx      940 tx defN 24-Mar-28 21:02 exercises/scripts/DEPRECATED_practice1.R
drwxrwxrwx  3.0 unx        0 bx stor 24-Mar-28 21:02 exercises/data/
-rw-rw-rw-  3.0 unx    30282 tx defN 24-Mar-28 21:02 exercises/data/nCov.fasta
-rw-rw-rw-  3.0 unx    30001 tx defN 24-Mar-28 21:02 exercises/data/NC_045512_one_line.fa
-rw-rw-rw-  3.0 unx   198205 tx defN 24-Mar-28 21:02 exercises/data/mi.csv
-rw-rw-rw-  3.0 unx     4738 bx defN 24-Mar-28 21:02 exercises/data/scerevisiae_2-micron.20170117.gff.gz
-rw-rw-rw-  3.0 unx   119393 tx defN 24-Mar-28 21:02 exercises/data/nCov_variants.fa
-rw-rw-rw-  3.0 unx        0 bx stor 24-Mar-28 21:02 exercises/data/.gitkeep
-rw-rw-rw-  3.0 unx    16070 tx defN 24-Mar-28 21:02 exercises/data/nCov.bed
13 files, 409194 bytes uncompressed, 90158 bytes compressed:  78.0%
username@host:~/Documents/HTS$ unzip exercises.zip
Archive:  exercises.zip
   creating: exercises/
   creating: exercises/scripts/
  inflating: exercises/scripts/Rpractice.R  
 extracting: exercises/scripts/.gitkeep  
  inflating: exercises/scripts/DEPRECATED_practice1.R  
   creating: exercises/data/
  inflating: exercises/data/nCov.fasta  
  inflating: exercises/data/NC_045512_one_line.fa  
  inflating: exercises/data/mi.csv   
  inflating: exercises/data/scerevisiae_2-micron.20170117.gff.gz  
  inflating: exercises/data/nCov_variants.fa  
 extracting: exercises/data/.gitkeep  
  inflating: exercises/data/nCov.bed  

2.15

Check with ls that you now have a new folder (the extra information provided by option -l start with a “d” when a “directory” is in the content list).

username@host:~/Documents/HTS$ ls -l
total 96
drwxr-xr-x 4 root root  4096 Mar 28 21:02 exercises
-rw-r--r-- 1 root root 92540 Mar 28 21:02 exercises.zip

2.16

Set your current working directory to be that new folder, using cd.

username@host:~/Documents/HTS$ cd exercises

2.17

Look at the content of this directory using ls, with extra information in the content list.

username@host:~/Documents/HTS/exercises$ ls -l
total 8
drwxrwxrwx 2 root root 4096 Mar 28 21:02 data
drwxrwxrwx 2 root root 4096 Mar 28 21:02 scripts

2.18

What do we have in the data folder? (Use ls again, with an argument)

username@host:~/Documents/HTS/exercises$ ls data
mi.csv
NC_045512_one_line.fa
nCov.bed
nCov.fasta
nCov_variants.fa
scerevisiae_2-micron.20170117.gff.gz

The file extensions give hints about the type of content.

  • .csv stands for “comma-separated values” (more about this type of file here).

  • .fa or .fasta is usually used for text files containing one or more nucleic acid or amino-acid sequence(s).

  • .bed is for a common text-based format used to list features associated with genomic coordinates (more information here).

  • .gff is for another text-based format used to document annotations in a genome (more information here).

  • An extra .gz extension is for files compressed in the gzip format.

These are all purely text-based formats. This makes them easy to process using a large variety of tools, from generalist text-processing utilities to specialized bioinformatics programs. We will soon see how it looks inside.

Note that you can in principle use whatever extension you like in Linux, independently of the actual content of the file. It is better to stick to common conventions, though.

2.19

Compare with ls -l and ls -lh.


Command history and editing

You can use the up and down arrows to navigate in the history of the commands you have already typed. Then, you can edit a re-called command. Use the left and right arrows to position your insertion point if mouse clicking does’t work in the command-line. Editing previous commands avoids wasting time when testing variants of the same command.


username@host:~/Documents/HTS/exercises$ ls -l data
total 404
-rw-rw-rw- 1 root root 198205 Mar 28 21:02 mi.csv
-rw-rw-rw- 1 root root  30001 Mar 28 21:02 NC_045512_one_line.fa
-rw-rw-rw- 1 root root  16070 Mar 28 21:02 nCov.bed
-rw-rw-rw- 1 root root  30282 Mar 28 21:02 nCov.fasta
-rw-rw-rw- 1 root root 119393 Mar 28 21:02 nCov_variants.fa
-rw-rw-rw- 1 root root   4738 Mar 28 21:02 scerevisiae_2-micron.20170117.gff.gz
username@host:~/Documents/HTS/exercises$ ls -lh data
total 404K
-rw-rw-rw- 1 root root 194K Mar 28 21:02 mi.csv
-rw-rw-rw- 1 root root  30K Mar 28 21:02 NC_045512_one_line.fa
-rw-rw-rw- 1 root root  16K Mar 28 21:02 nCov.bed
-rw-rw-rw- 1 root root  30K Mar 28 21:02 nCov.fasta
-rw-rw-rw- 1 root root 117K Mar 28 21:02 nCov_variants.fa
-rw-rw-rw- 1 root root 4.7K Mar 28 21:02 scerevisiae_2-micron.20170117.gff.gz

2.20

Look at the first 10 lines of the nCov.fasta file, using the head command with the path to this file as argument (remember that this file is inside the data folder and that the tab key is your friend).

username@host:~/Documents/HTS/exercises$ head data/nCov.fasta
>EPI_ISL_402124
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA
AATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGG
ACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
CGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGC
CTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACAT
CTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAA
ACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAG
AACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGA

The first line, starting with “>”, is a sequence header. It sets a name for the sequence that follows.

2.21

Change directory so that from now on you will work from within the data folder.

username@host:~/Documents/HTS/exercises$ cd data

2.22

Check the manual of the tail command in order to find how to display the last three lines of a text file. For this, use the man command, followed by the name of the command you want to know more about.

username@host:~/Documents/HTS/exercises/data$ man tail

You can navigate within this manual using the up and down arrows of the keyboard. Press q to exit the manual. This is an example of program having an interactive keyboard-driven interface. Many commands have an entry in the manual.

2.23

Now use what you just learned to display the last three lines of the nCov.fasta file. Don’t forget that you recently changed your working directory.

username@host:~/Documents/HTS/exercises/data$ tail -n 3 nCov.fasta
ACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAATTTTAGTAGTGCTATCCCCAT
GTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAAAAAAAAAAA

You just used a command (tail), an option (-n) with its own argument (3), and an argument for the whole command (the path to the file nCov.fasta).

(In this case, a shortcut syntax is also possible: tail -3 nCov.fasta.)

2.24

The grep command can extract lines from a text file that contain a certain pattern. Use it to extract the sequence headers from the nCov.fasta file.

You should use two arguments. The first one should be the pattern you are looking for (between quotes), the second should be the path to the file.

username@host:~/Documents/HTS/exercises/data$ grep ">" nCov.fasta
>EPI_ISL_402124

Notice that there is only one header line. This fasta file contains one large sequence: An example SARS-CoV-2 genome.

If you were dealing with the genome of a eukaryote, you would likely have one sequence for each chromosome. For a bacteria, you could have one sequence for the main chromosome, and one for each plasmid.

2.25

Try the same search on the file nCov_variants.fa.

username@host:~/Documents/HTS/exercises/data$ grep ">" nCov_variants.fa
>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
>SARS-CoV2-B.1.1.7-501Y.V1-UK
>SARS-CoV2-B.1.135-501Y.V2-SAf
>SARS-CoV2-P.1-501Y.V3-BR

2.26

The file nCov.bed contains information about features present at some genomic coordinates. Count the number of lines in this file, using the wc (word count) command and its -l option to count lines instead of words.

username@host:~/Documents/HTS/exercises/data$ wc -l nCov.bed
59 nCov.bed

2.27

Have a look at the first four lines of this file, using head and a shortcut notation similar to the one mentioned earlier for tail.

username@host:~/Documents/HTS/exercises/data$ head -4 nCov.bed
EPI_ISL_402124  0   265 id-EPI_ISL_402124:1..265    .   +   RefSeq  five_prime_UTR  .   ID=id-EPI_ISL_402124:1..265;gbkey=5'UTR
EPI_ISL_402124  0   29903   EPI_ISL_402124:1..29903 .   +   RefSeq  region  .   ID=EPI_ISL_402124:1..29903;Dbxref=taxon:2697049;collection-date=Dec-2019;country=China;gb-acronym=SARS-CoV-2;gbkey=Src;genome=genomic;isolate=Wuhan-Hu-1;mol_type=genomic RNA;nat-host=Homo sapiens;old-name=Wuhan seafood market pneumonia virus
EPI_ISL_402124  265 805 id-YP_009724389.1:1..180    .   +   RefSeq  mature_protein_region_of_CDS    .   ID=id-YP_009724389.1:1..180;Note=nsp1%3B produced by both pp1a and pp1ab;Parent=cds-YP_009724389.1;gbkey=Prot;product=leader protein;protein_id=YP_009725297.1
EPI_ISL_402124  265 805 id-YP_009725295.1:1..180    .   +   RefSeq  mature_protein_region_of_CDS    .   ID=id-YP_009725295.1:1..180;Note=nsp1%3B produced by both pp1a and pp1ab;Parent=cds-YP_009725295.1;gbkey=Prot;product=leader protein;protein_id=YP_009742608.1

A word on bed format

This contains tabular information where each line corresponds to a genomic feature, and where columns are separated by tabulations, a special blank character widely used in bioinformatics.

The first three columns correspond to the genomic coordinates:

  1. a “chromosome” name;
  2. a (zero-based) start position of the feature within the chromosome;
  3. a (one-based) stop position of the feature.

(For instance, EPI_ISL_402124 0 265 corresponds to the first 265 positions of chromosome “EPI_ISL_402124”.)

  1. The fourth column contains a name for the described feature.

  2. The fifth contains a “score” (a dot is used when no score information is provided).

  3. The sixth column specifies the strand with respect to which the feature should be interpreted (“+” or “-”).

Other columns can be used to provide more information.


We will use this bed file to extract the individual sequences corresponding to each feature.

But before creating new files, let us create a folder in which we can safely work with copies of the bed and fasta files without risking to erase existing data.

We want to name it work and create it outside the current data directory.

For this, we need to move back to the folder containing the one in which we are currently working. This can be done using the special .. argument to cd.

When .. is present in a path, it means “the directory just above”: exercises/data/../../exercises/data/.. is actually the same thing as exercises.

2.28

Using the above trick, use cd to get back to the exercises folder contained in the HTS folder that is contained in the Documents folder which belongs to our home directory.

username@host:~/Documents/HTS/exercises/data$ cd ..

2.29

Now, use mkdir to create a work directory, and check with ls that it has been created.

username@host:~/Documents/HTS/exercises$ mkdir work
username@host:~/Documents/HTS/exercises$ ls
data
scripts
work

The command to copy files is cp and takes two arguments. The first one is the “source” (the path to the existing file) and the second is the “destination” (the path to the future copy).

2.30

Make copies of the nCov.fasta and nCov.bed files present in the data folder and put them inside the work folder.

username@host:~/Documents/HTS/exercises$ cp data/nCov.fasta work/nCov.fasta

It also works if we only specify a destination folder. The name of the copy will be the same as the source.

username@host:~/Documents/HTS/exercises$ cp data/nCov.bed work

2.31

Now move into the work directory, and check its content.

cd work
ls

We will use the bed file and the fasta file to create a new fasta file containing one sequence for each feature.

This can be done using bedtools.

Like many commands, bedtools can display some help information. The actual way to obtain this information depends on the command. Most often, it is obtained via a -h or a --help (with two dashes) option. Sometimes, just typing the command without arguments does that too. bedtools accepts the three ways.

2.32

Look at bedtool’s help using the -h option.

username@host:~/Documents/HTS/exercises/work$ bedtools -h
bedtools is a powerful toolset for genome arithmetic.

Version:   v2.27.1
About:     developed in the quinlanlab.org and by many contributors worldwide.
Docs:      http://bedtools.readthedocs.io/
Code:      https://github.com/arq5x/bedtools2
Mail:      https://groups.google.com/forum/#!forum/bedtools-discuss

Usage:     bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    shift         Adjust the position of intervals.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
    random        Generate random intervals in a genome.
    shuffle       Randomly redistrubute intervals in a genome.
    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.

[ General help ]
    --help        Print this help menu.
    --version     What version of bedtools are you using?.
    --contact     Feature requests, bugs, mailing lists, etc.

We will use the getfasta sub-command, which can be called using bedtools with the getfasta argument.

2.33

To have an idea of how it works, invoke it using it’s own -h option.

username@host:~/Documents/HTS/exercises/work$ bedtools getfasta -h

Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.27.1
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
    -fi Input FASTA file
    -fo Output file (opt., default is STDOUT
    -bed    BED/GFF/VCF file of ranges to extract from -fi
    -name   Use the name field for the FASTA header
    -name+  Use the name field and coordinates for the FASTA header
    -split  given BED12 fmt., extract and concatenate the sequences
        from the BED "blocks" (e.g., exons)
    -tab    Write output in TAB delimited format.
        - Default is FASTA format.

    -s  Force strandedness. If the feature occupies the antisense,
        strand, the sequence will be reverse complemented.
        - By default, strand information is ignored.

    -fullHeader Use full fasta header.
        - By default, only the word before the first space or tab 
        is used.

We see that we need to specify a fasta file using -fi and a bed file using -bed. There’s actually also an option to specify an output file: -fo. However, this option is not mentioned in the help of bedtools versions prior to 2.27.

We also see that we can use the -s option to take into account the feature’s strand information, and the -name option to use the fourth column of the bed file as headers in the resulting fasta output.

2.34

Put this all together to generate a new fasta file named nCov_features.fa, and check that the desired file has been created.

username@host:~/Documents/HTS/exercises/work$ bedtools getfasta -fi nCov.fasta -bed nCov.bed -fo nCov_features.fa -s -name
index file nCov.fasta.fai not found, generating...
Feature (EPI_ISL_402124:0-29903) beyond the length of EPI_ISL_402124 size (29891 bp).  Skipping.
Feature (EPI_ISL_402124:29674-29903) beyond the length of EPI_ISL_402124 size (29891 bp).  Skipping.
username@host:~/Documents/HTS/exercises/work$ ls
nCov.bed
nCov.fasta
nCov.fasta.fai
nCov_features.fa

We’ll now use the grep command to identify genomic features in nCov.bed that could correspond to the spike protein.

2.35

First, try to search for the "SPIKE" pattern in the file.

username@host:~/Documents/HTS/exercises/work$ grep "SPIKE" nCov.bed

By default, grep is a case-sensitive command. Maybe the file contains “spike”, but not “SPIKE”.

We can confirm this by counting how many lines contain “SPIKE” and how many contain “spike”.

2.36

Use the man command (arrows to navigate, q to quit) to find a suitable option in grep, and use it to count lines containing “SPIKE”, then containing “spike”.

username@host:~/Documents/HTS/exercises/work$ man grep
username@host:~/Documents/HTS/exercises/work$ grep -c "SPIKE" nCov.bed
0
username@host:~/Documents/HTS/exercises/work$ grep -c "spike" nCov.bed
2

2.37

There is also an option to ignore case. Find it in the manual, then try it with the pattern "SPIKE".

username@host:~/Documents/HTS/exercises/work$ man grep
username@host:~/Documents/HTS/exercises/work$ grep -i "SPIKE" nCov.bed
EPI_ISL_402124  21562   25384   cds-YP_009724390.1  .   +   RefSeq  CDS 0   ID=cds-YP_009724390.1;Parent=gene-GU280_gp02;Dbxref=Genbank:YP_009724390.1,GeneID:43740568;Name=YP_009724390.1;Note=structural protein%3B spike protein;gbkey=CDS;gene=S;locus_tag=GU280_gp02;product=surface glycoprotein;protein_id=YP_009724390.1
EPI_ISL_402124  21562   25384   gene-GU280_gp02 .   +   RefSeq  gene    .   ID=gene-GU280_gp02;Dbxref=GeneID:43740568;Name=S;gbkey=Gene;gene=S;gene_biotype=protein_coding;gene_synonym=spike glycoprotein;locus_tag=GU280_gp02

We notice that there are actually two lines containing the desired pattern, but in lower-case. One is for the full gene, the other for the coding part only (CDS).

2.38

Using, grep, locate the sequence of the CDS of the spike protein in the nCov_features.fa file. (Remember that you used the names in the bed file (4th column) to generate the headers in the fasta file.)

username@host:~/Documents/HTS/exercises/work$ grep "cds-YP_009724390.1" nCov_features.fa
>cds-YP_009724390.1(+)

(Even more precise: grep ">cds-YP_009724390.1" nCov_features.fa)

2.39

Use the -A option to not only display the header line, but also the line just after it. (You can use grep --help for more options: grep has both a command-line help and a manual page.)

The options should be put before the pattern and the file.

username@host:~/Documents/HTS/exercises/work$ grep -A 1 "cds-YP_009724390.1" nCov_features.fa
>cds-YP_009724390.1(+)
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTCGAAGACCCAGTCCCTACTTATTGTTAATAACGCTACTAATGTTGTTATTAAAGTCTGTGAATTTCAATTTTGTAATGATCCATTTTTGGGTGTTTATTACCACAAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGCGAATAATTGCACTTTTGAATATGTCTCTCAGCCTTTTCTTATGGACCTTGAAGGAAAACAGGGTAATTTCAAAAATCTTAGGGAATTTGTGTTTAAGAATATTGATGGTTATTTTAAAATATATTCTAAGCACACGCCTATTAATTTAGTGCGTGATCTCCCTCAGGGTTTTTCGGCTTTAGAACCATTGGTAGATTTGCCAATAGGTATTAACATCACTAGGTTTCAAACTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGATTCTTCTTCAGGTTGGACAGCTGGTGCTGCAGCTTATTATGTGGGTTATCTTCAACCTAGGACTTTTCTATTAAAATATAATGAAAATGGAACCATTACAGATGCTGTAGACTGTGCACTTGACCCTCTCTCAGAAACAAAGTGTACGTTGAAATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTGATTATTCTGTCCTATATAATTCCGCATCATTTTCCACTTTTAAGTGTTATGGAGTGTCTCCTACTAAATTAAATGATCTCTGCTTTACTAATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAAATTACCAGATGATTTTACAGGCTGCGTTATAGCTTGGAATTCTAACAATCTTGATTCTAAGGTTGGTGGTAATTATAATTACCTGTATAGATTGTTTAGGAAGTCTAATCTCAAACCTTTTGAGAGAGATATTTCAACTGAAATCTATCAGGCCGGTAGCACACCTTGTAATGGTGTTGAAGGTTTTAATTGTTACTTTCCTTTACAATCATATGGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCAACTGTTTGTGGACCTAAAAAGTCTACTAATTTGGTTAAAAACAAATGTGTCAATTTCAACTTCAATGGTTTAACAGGCACAGGTGTTCTTACTGAGTCTAACAAAAAGTTTCTGCCTTTCCAACAATTTGGCAGAGACATTGCTGACACTACTGATGCTGTCCGTGATCCACAGACACTTGAGATTCTTGACATTACACCATGTTCTTTTGGTGGTGTCAGTGTTATAACACCAGGAACAAATACTTCTAACCAGGTTGCTGTTCTTTATCAGGATGTTAACTGCACAGAAGTCCCTGTTGCTATTCATGCAGATCAACTTACTCCTACTTGGCGTGTTTATTCTACAGGTTCTAATGTTTTTCAAACACGTGCAGGCTGTTTAATAGGGGCTGAACATGTCAACAACTCATATGAGTGTGACATACCCATTGGTGCAGGTATATGCGCTAGTTATCAGACTCAGACTAATTCTCCTCGGCGGGCACGTAGTGTAGCTAGTCAATCCATCATTGCCTACACTATGTCACTTGGTGCAGAAAATTCAGTTGCTTACTCTAATAACTCTATTGCCATACCCACAAATTTTACTATTAGTGTTACCACAGAAATTCTACCAGTGTCTATGACCAAGACATCAGTAGATTGTACAATGTACATTTGTGGTGATTCAACTGAATGCAGCAATCTTTTGTTGCAATATGGCAGTTTTTGTACACAATTAAACCGTGCTTTAACTGGAATAGCTGTTGAACAAGACAAAAACACCCAAGAAGTTTTTGCACAAGTCAAACAAATTTACAAAACACCACCAATTAAAGATTTTGGTGGTTTTAATTTTTCACAAATATTACCAGATCCATCAAAACCAAGCAAGAGGTCATTTATTGAAGATCTACTTTTCAACAAAGTGACACTTGCAGATGCTGGCTTCATCAAACAATATGGTGATTGCCTTGGTGATATTGCTGCTAGAGACCTCATTTGTGCACAAAAGTTTAACGGCCTTACTGTTTTGCCACCTTTGCTCACAGATGAAATGATTGCTCAATACACTTCTGCACTGTTAGCGGGTACAATCACTTCTGGTTGGACCTTTGGTGCAGGTGCTGCATTACAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGTATTGGAGTTACACAGAATGTTCTCTATGAGAACCAAAAATTGATTGCCAACCAATTTAATAGTGCTATTGGCAAAATTCAAGACTCACTTTCTTCCACAGCAAGTGCACTTGGAAAACTTCAAGATGTGGTCAACCAAAATGCACAAGCTTTAAACACGCTTGTTAAACAACTTAGCTCCAATTTTGGTGCAATTTCAAGTGTTTTAAATGATATCCTTTCACGTCTTGACAAAGTTGAGGCTGAAGTGCAAATTGATAGGTTGATCACAGGCAGACTTCAAAGTTTGCAGACATATGTGACTCAACAATTAATTAGAGCTGCAGAAATCAGAGCTTCTGCTAATCTTGCTGCTACTAAAATGTCAGAGTGTGTACTTGGACAATCAAAAAGAGTTGATTTTTGTGGAAAGGGCTATCATCTTATGTCCTTCCCTCAGTCAGCACCTCATGGTGTAGTCTTCTTGCATGTGACTTATGTCCCTGCACAAGAAAAGAACTTCACAACTGCTCCTGCCATTTGTCATGATGGAAAAGCACACTTTCCTCGTGAAGGTGTCTTTGTTTCAAATGGCACACACTGGTTTGTAACACAAAGGAATTTTTATGAACCACAAATCATTACTACAGACAACACATTTGTGTCTGGTAACTGTGATGTTGTAATAGGAATTGTCAACAACACAGTTTATGATCCTTTGCAACCTGAATTAGACTCATTCAAGGAGGAGTTAGATAAATATTTTAAGAATCATACATCACCAGATGTTGATTTAGGTGACATCTCTGGCATTAATGCTTCAGTTGTAAACATTCAAAAAGAAATTGACCGCCTCAATGAGGTTGCCAAGAATTTAAATGAATCTCTCATCGATCTCCAAGAACTTGGAAAGTATGAGCAGTATATAAAATGGCCATGGTACATTTGGCTAGGTTTTATAGCTGGCTTGATTGCCATAGTAATGGTGACAATTATGCTTTGCTGTATGACCAGTTGCTGTAGTTGTCTCAAGGGCTGTTGTTCTTGTGGATCCTGCTGCAAATTTGATGAAGACGACTCTGAGCCAGTGCTCAAAGGAGTCAAATTACATTACACATAA

2.40

To create a file named spike_CDS.fa containing the selected sequence, you can use the redirection symbol > between the previous command and the path to the desired file.

username@host:~/Documents/HTS/exercises/work$ grep -A 1 ">cds-YP_009724390.1" nCov_features.fa > spike_CDS.fa

Instead of being displayed in the terminal, the fasta header and the following sequence were redirected into a file.

Note that if a file already exists with this name, the redirection with > will overwrite it. It is possible to “append” content at the end of an existing file by using >> instead of >.

2.41

Check that the file has indeed been created.

username@host:~/Documents/HTS/exercises/work$ ls
nCov.bed
nCov.fasta
nCov.fasta.fai
nCov_features.fa
spike_CDS.fa

2.42

Use the cat command to display the full content of the file.

username@host:~/Documents/HTS/exercises/work$ cat spike_CDS.fa
>cds-YP_009724390.1(+)
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTCGAAGACCCAGTCCCTACTTATTGTTAATAACGCTACTAATGTTGTTATTAAAGTCTGTGAATTTCAATTTTGTAATGATCCATTTTTGGGTGTTTATTACCACAAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGCGAATAATTGCACTTTTGAATATGTCTCTCAGCCTTTTCTTATGGACCTTGAAGGAAAACAGGGTAATTTCAAAAATCTTAGGGAATTTGTGTTTAAGAATATTGATGGTTATTTTAAAATATATTCTAAGCACACGCCTATTAATTTAGTGCGTGATCTCCCTCAGGGTTTTTCGGCTTTAGAACCATTGGTAGATTTGCCAATAGGTATTAACATCACTAGGTTTCAAACTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGATTCTTCTTCAGGTTGGACAGCTGGTGCTGCAGCTTATTATGTGGGTTATCTTCAACCTAGGACTTTTCTATTAAAATATAATGAAAATGGAACCATTACAGATGCTGTAGACTGTGCACTTGACCCTCTCTCAGAAACAAAGTGTACGTTGAAATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTGATTATTCTGTCCTATATAATTCCGCATCATTTTCCACTTTTAAGTGTTATGGAGTGTCTCCTACTAAATTAAATGATCTCTGCTTTACTAATGTCTATGCAGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGATTATAATTATAAATTACCAGATGATTTTACAGGCTGCGTTATAGCTTGGAATTCTAACAATCTTGATTCTAAGGTTGGTGGTAATTATAATTACCTGTATAGATTGTTTAGGAAGTCTAATCTCAAACCTTTTGAGAGAGATATTTCAACTGAAATCTATCAGGCCGGTAGCACACCTTGTAATGGTGTTGAAGGTTTTAATTGTTACTTTCCTTTACAATCATATGGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTTCTTTTGAACTTCTACATGCACCAGCAACTGTTTGTGGACCTAAAAAGTCTACTAATTTGGTTAAAAACAAATGTGTCAATTTCAACTTCAATGGTTTAACAGGCACAGGTGTTCTTACTGAGTCTAACAAAAAGTTTCTGCCTTTCCAACAATTTGGCAGAGACATTGCTGACACTACTGATGCTGTCCGTGATCCACAGACACTTGAGATTCTTGACATTACACCATGTTCTTTTGGTGGTGTCAGTGTTATAACACCAGGAACAAATACTTCTAACCAGGTTGCTGTTCTTTATCAGGATGTTAACTGCACAGAAGTCCCTGTTGCTATTCATGCAGATCAACTTACTCCTACTTGGCGTGTTTATTCTACAGGTTCTAATGTTTTTCAAACACGTGCAGGCTGTTTAATAGGGGCTGAACATGTCAACAACTCATATGAGTGTGACATACCCATTGGTGCAGGTATATGCGCTAGTTATCAGACTCAGACTAATTCTCCTCGGCGGGCACGTAGTGTAGCTAGTCAATCCATCATTGCCTACACTATGTCACTTGGTGCAGAAAATTCAGTTGCTTACTCTAATAACTCTATTGCCATACCCACAAATTTTACTATTAGTGTTACCACAGAAATTCTACCAGTGTCTATGACCAAGACATCAGTAGATTGTACAATGTACATTTGTGGTGATTCAACTGAATGCAGCAATCTTTTGTTGCAATATGGCAGTTTTTGTACACAATTAAACCGTGCTTTAACTGGAATAGCTGTTGAACAAGACAAAAACACCCAAGAAGTTTTTGCACAAGTCAAACAAATTTACAAAACACCACCAATTAAAGATTTTGGTGGTTTTAATTTTTCACAAATATTACCAGATCCATCAAAACCAAGCAAGAGGTCATTTATTGAAGATCTACTTTTCAACAAAGTGACACTTGCAGATGCTGGCTTCATCAAACAATATGGTGATTGCCTTGGTGATATTGCTGCTAGAGACCTCATTTGTGCACAAAAGTTTAACGGCCTTACTGTTTTGCCACCTTTGCTCACAGATGAAATGATTGCTCAATACACTTCTGCACTGTTAGCGGGTACAATCACTTCTGGTTGGACCTTTGGTGCAGGTGCTGCATTACAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGTATTGGAGTTACACAGAATGTTCTCTATGAGAACCAAAAATTGATTGCCAACCAATTTAATAGTGCTATTGGCAAAATTCAAGACTCACTTTCTTCCACAGCAAGTGCACTTGGAAAACTTCAAGATGTGGTCAACCAAAATGCACAAGCTTTAAACACGCTTGTTAAACAACTTAGCTCCAATTTTGGTGCAATTTCAAGTGTTTTAAATGATATCCTTTCACGTCTTGACAAAGTTGAGGCTGAAGTGCAAATTGATAGGTTGATCACAGGCAGACTTCAAAGTTTGCAGACATATGTGACTCAACAATTAATTAGAGCTGCAGAAATCAGAGCTTCTGCTAATCTTGCTGCTACTAAAATGTCAGAGTGTGTACTTGGACAATCAAAAAGAGTTGATTTTTGTGGAAAGGGCTATCATCTTATGTCCTTCCCTCAGTCAGCACCTCATGGTGTAGTCTTCTTGCATGTGACTTATGTCCCTGCACAAGAAAAGAACTTCACAACTGCTCCTGCCATTTGTCATGATGGAAAAGCACACTTTCCTCGTGAAGGTGTCTTTGTTTCAAATGGCACACACTGGTTTGTAACACAAAGGAATTTTTATGAACCACAAATCATTACTACAGACAACACATTTGTGTCTGGTAACTGTGATGTTGTAATAGGAATTGTCAACAACACAGTTTATGATCCTTTGCAACCTGAATTAGACTCATTCAAGGAGGAGTTAGATAAATATTTTAAGAATCATACATCACCAGATGTTGATTTAGGTGACATCTCTGGCATTAATGCTTCAGTTGTAAACATTCAAAAAGAAATTGACCGCCTCAATGAGGTTGCCAAGAATTTAAATGAATCTCTCATCGATCTCCAAGAACTTGGAAAGTATGAGCAGTATATAAAATGGCCATGGTACATTTGGCTAGGTTTTATAGCTGGCTTGATTGCCATAGTAATGGTGACAATTATGCTTTGCTGTATGACCAGTTGCTGTAGTTGTCTCAAGGGCTGTTGTTCTTGTGGATCCTGCTGCAAATTTGATGAAGACGACTCTGAGCCAGTGCTCAAAGGAGTCAAATTACATTACACATAA

2.43


Command control

We will not go much into these details, but it is important to know that the shell doesn’t just consists in commands.

It also provides a ways to have an extended control over the effects of the commands, such as > (to send command results to a file), | (to establish communication between commands) or & (to run several commands simultaneously) and advanced ways to deal with command failure or to automatically control what command should be run depending on chosen conditions.

The shell is a programming language.


Just to show one possible usage of the | (“pipe”), we can check that the output of the grep -A 1 "cds-YP_009724390.1" nCov_features.fa command is the same as the content of the file we created by redirecting this output to a file, using the md5sum command. Given a content, md5sum should return a number almost guaranteed to be unique: a checksum of this content.

We can use this command in two ways.

  • Using the path of a file as argument:
username@host:~/Documents/HTS/exercises/work$ md5sum spike_CDS.fa
68ef4caa1902b4e999cf7db644f49dfa  spike_CDS.fa
  • Piping the “standard output” of another command to the “standard input” of the md5sum command:
username@host:~/Documents/HTS/exercises/work$ grep -A 1 "cds-YP_009724390.1" nCov_features.fa | md5sum
68ef4caa1902b4e999cf7db644f49dfa  -

2.44

Now, move back to the parent of the current working directory (the one containing the data and the work directories.

username@host:~/Documents/HTS/exercises/work$ cd ..

2.45

From this new working directory, look at the first two lines of the mi.csv file contained in the data folder.

username@host:~/Documents/HTS/exercises$ head -2 data/mi.csv
"","Age","OwnsHouse","PhysicalActivity","Sex","LivesWithPartner","LivesWithKids","BornInCity","Inbreeding","BMI","CMVPositiveSerology","FluIgG","MetabolicScore","LowAppetite","TroubleConcentrating","TroubleSleeping","HoursOfSleep","Listless","UsesCannabis","RecentPersonalCrisis","Smoking","Employed","Education","DustExposure","Income","HadMeasles","HadRubella","HadChickenPox","HadMumps","HadTonsillectomy","HadAppendicectomy","VaccineHepA","VaccineMMR","VaccineTyphoid","VaccineWhoopingCough","VaccineYellowFever","VaccineHepB","VaccineFlu","SUBJID","DepressionScore","HeartRate","Temperature","HourOfSampling","DayOfSampling"
"1",22.33,"Yes",3,"Female","No","No","Yes",94.9627,20.13,"No",0.464318721724723,0,0,0,1,9,3,"No","No","Never","No","PhD","No","(1000-2000]","No","No","Yes","No","No","No","No","No","No","Yes","No","Yes","No",2,0,66,36.8,8.883,"40"

What we see are textual data representing a table where columns are separated by commas. The first line contains column headers, the other lines contain the corresponding observations for various individuals.

2.46

How many individuals are there in the table?

username@host:~/Documents/HTS/exercises$ wc -l data/mi.csv
817 data/mi.csv

Remembering that the first line contains column headers, we can now calculate that the table contains data for 816 individuals.

This kind of table can be processed in various useful ways using R. We will open it using RStudio, a program with a GUI that makes R easier to use.

2.47

Now start RStudio using the rstudio command.

username@host:~/Documents/HTS/exercises$ rstudio

It is now time to learn a little about R and RStudio: https://hub-courses.pages.pasteur.fr/R_pasteur_phd/First_steps_RStudio.html#2_Smooth_introduction_to_R