replace pattern

tr

tr 'a-z' 'A-Z' <allmysequences
tr -d '\r'

sed

sed read data (file or from pipe) and return the filtered data on the stdout.

the sed command is very versatile, so we just cover a few of its features.

The two commands to remember are:

remove lines

replace pattern

special characters

special characters

  • Repetition.
    • ? match at most once the preceding element. ie A? represent zero or 1 A => "" or "A"
    • + match one or more times the preceding element. ie A+ represent 1 A to any number of successive A => "A" or "AA" or "AAA" etc ...
    • * match zero or more times the preceding element. ie a* represent 0 to any number of successive A => "" or "A" or "AA" or "AAA" etc ...
    • {n} match exactly n times the preceding element. ie A{5} represent exactly 5 successives A => "AAAAA"
    • {m,n} match m to n times the preceding element. ie A{2,5} represent from 2 to 5 successives A => "AA" or "AAA" or "AAAA" or "AAAAA"
  • Logical OR.
    • (one|two) match the strings "one" or "two"

special characters

  • Recording
    • \(pattern\) : will record the pattern that matched
  • Negation
    • ! : negate the pattern it precedes
  • more information on regular expression in man sed

expression

Recording

sed can memorize the pattern that match, and this one can be resused in the replace pattern. memorization is done using the special syntax '(pattern)' matching pattern can be used in the replace pattern using the number N

example, add 'SA-' to the id of the sequence in a fasta file * sed 's/^>\(.*\)/>ID_\1/' file

Exercises

Work on the blast result: sed_play/BlastResultIl2.align

Work on the file sed_play/Q8I1R6_PLAF7_alu.blastp.txt * Replace "A" by "a" in all the names of the sequence from the bank (not in the name of the query)

work on file sed_play/seq.fasta associate as a comment the specie 'S Aureus', 'Y Pestis' on the id lines of the relevant sequences

Exercices

  1. Work on sed_play/BlastResultIl2.align. Change the name of the bank sp into uniprot, in file.
  2. keep only the second field (like cut), extract the ID from this file
  3. Clean the blast report to keep only the bank and the entry name like this: input: IL2_HUMAN       sp|Q9XS38|IL2_PAPHA     92.86   154     10      1       1       153     1       154     5e-74    224 output: sp:Q9XS38

Exercises

We want to create a file containing the sequences from the 10 most similar sequences to il2_human and align them (first step to modelize a sequence by homology).

  1. get the il2_human sequence in fasta format
  2. perform a blastp (blastall -p blastp -d uniprot_sprot -i the-input -m8)
  3. filter the output according the % of identity
  4. keep the 10 best hit
  5. reformat the line to keep only the bank and entry name in format (bk:entry_name)

Exercises (continued)

  1. use golden to get the sequences
  2. transform each sequence in fasta format and concatenate them in one file
  3. run clustalw (clustalw -align -infile=filename)

Addresses

Sed commands can be given with no addresses, in which case the command will be executed for all input lines otherwise command will only be executed for input lines which match that address.

Adresses examples

Exercises

Transform sample.fastq fastaq file in fasta

step by step

  1. print only header line (every 4th lines starting at the first line)
  2. replace @ by > at the begining of every header
  3. print every 4th lines starting at the second line

Extract paragaph

sed can extract a paragraph, contained beeetween 2 matchin regular expressions

sed -n '/RE_START/,/RE_END/p'

rememeber sed works line by line, it will starts looking for RE_START, then 'memorize' the text until it will find the next pattern RE_STOP. when RE_STOP is found, the memorized text is then printed.

Exercises

Using the bank.embl extract the entry for LISIN11262