replace pattern

tr

translate
replaces or removes characters (NOT strings) from its input dataset (standard input)

tr 'a-z' 'A-Z' <allmysequences

-d removes the characters

tr -d '\r'

sed

stream editor for filtering and transforming text

sed read data (file or from pipe) and return the filtered data on the stdout.

the sed command is very versatile, so we just cover a few of its features.

The two commands to remember are:

remove lines
replace patterns
print lines

remove lines

sed '/pattern/d'
remove lines where pattern is present (like grep -v)

sed 'sSEPpatternSEPreplace' file
sed 's/chromosome/chr/' arrayAnnot.txt replaces only the 1st occurence of chromosome by 'chr' in each line of arrayAnnot.txt
sed 's#chromosome#chr#' arrayAnnot.txt does the same.
sed 's/chromosome/chr/g' arrayAnnot.txt replaces all occurences of chromosome by 'chr' in arrayAnnot.txt
add the -r option to use extended regular expressions in the script (sometimes you should find -E)

print lines

sed '/pattern/p'
print (duplicate) lines where pattern is present

usually used in combination with -n option

The "-n" option will not print anything unless an explicit request to print is found

sed -n -e'/pattern/p' will print only line containing pattern

special characters

useful metacharacters:
- . any character
- [] bracket expressions: defines a set of characters
  
  [abc] : matches character "a" or "b" or "c"
  
  [a-z] : matches on character in the range from "a" to "z", ie "a" or "b" ... or "z"
  
  \w : matches word (alphanumeric and _)
  
  \s : matches blank characters (spaces, tabulation)
- Anchoring.
  
  ^ represent start of the string.
  
  $ represent end of the string.

special characters

Repetition.

? match at most once the preceding element. ie A? represent zero or 1 A => "" or "A"

+ match one or more times the preceding element. ie A+ represent 1 A to any number of successive A => "A" or "AA" or "AAA" etc ...

* match zero or more times the preceding element. ie a* represent 0 to any number of successive A => "" or "A" or "AA" or "AAA" etc ...

{n} match exactly n times the preceding element. ie A{5} represent exactly 5 successives A => "AAAAA"

{m,n} match m to n times the preceding element. ie A{2,5} represent from 2 to 5 successives A => "AA" or "AAA" or "AAAA" or "AAAAA"

Logical OR.

(one|two) match the strings "one" or "two"

special characters

Recording

$pattern$ : will record the pattern that matched

Negation

! : negate the pattern it precedes

more information on regular expression in man sed

expression

keep only the field bank|accession number|entry_name (sp|P80874|GS69_BACSU)
sed -r 's/.*\t(\w*\|\w*\|\w*).*/\1/g' blast2.txt

Recording

sed can memorize the pattern that match, and this one can be resused in the replace pattern. memorization is done using the special syntax '(pattern)' matching pattern can be used in the replace pattern using the number N

example, add 'SA-' to the id of the sequence in a fasta file * sed 's/^>$.*$/>ID_\1/' file

Exercises

Work on the blast result: sed_play/BlastResultIl2.align

replace all '|' characters with '_' character
replace Replace the first "|" by space and the second by "_"

Work on the file sed_play/Q8I1R6_PLAF7_alu.blastp.txt * Replace "A" by "a" in all the names of the sequence from the bank (not in the name of the query)

work on file sed_play/seq.fasta associate as a comment the specie 'S Aureus', 'Y Pestis' on the id lines of the relevant sequences

Exercices

Work on sed_play/BlastResultIl2.align. Change the name of the bank sp into uniprot, in file.
keep only the second field (like cut), extract the ID from this file
Clean the blast report to keep only the bank and the entry name like this: input: IL2_HUMAN sp|Q9XS38|IL2_PAPHA 92.86 154 10 1 1 153 1 154 5e-74 224 output: sp:Q9XS38

Exercises

We want to create a file containing the sequences from the 10 most similar sequences to il2_human and align them (first step to modelize a sequence by homology).

get the il2_human sequence in fasta format
perform a blastp (blastall -p blastp -d uniprot_sprot -i the-input -m8)
filter the output according the % of identity
keep the 10 best hit
reformat the line to keep only the bank and entry name in format (bk:entry_name)

Exercises (continued)

use golden to get the sequences
transform each sequence in fasta format and concatenate them in one file
run clustalw (clustalw -align -infile=filename)

Addresses

Sed commands can be given with no addresses, in which case the command will be executed for all input lines otherwise command will only be executed for input lines which match that address.

number Match only the specified line number.
first~step Match every step'th line starting with line first
$ Match the last line.
/regexp/ Match lines matching the regular expression regexp.

Adresses examples

sed '2p' will duplicate the second line
sed -n '1~2p' will print all the odd-numbered lines
sed -n '2~5p' will match every fifth line, starting with the second
sed -n '$p' will print only the last line
sed -n '4,$p' will print from the 4th to the last line (included)
sed '3,5s/t/T/g' replace every t by T from the 3rd to the fifth line
sed '/motif/s/line/sentence/' => ??
sed '1!s/l/L/g' => ??

Exercises

Transform sample.fastq fastaq file in fasta

step by step

print only header line (every 4th lines starting at the first line)
replace @ by > at the begining of every header
print every 4th lines starting at the second line

http://www.grymoire.com/Unix/Sed.html

sed in bioinfo

Extract paragaph

sed can extract a paragraph, contained beeetween 2 matchin regular expressions

sed -n '/RE_START/,/RE_END/p'

rememeber sed works line by line, it will starts looking for RE_START, then 'memorize' the text until it will find the next pattern RE_STOP. when RE_STOP is found, the memorized text is then printed.

Exercises

Using the bank.embl extract the entry for LISIN11262