tr 'a-z' 'A-Z' <allmysequences
tr -d '\r'
sed read data (file or from pipe) and return the filtered data on the stdout.
the sed command is very versatile, so we just cover a few of its features.
The two commands to remember are:
sed 's/chromosome/chr/' arrayAnnot.txt
replaces only the 1st occurence of chromosome by 'chr' in each line of arrayAnnot.txtsed 's#chromosome#chr#' arrayAnnot.txt
does the same.sed 's/chromosome/chr/g' arrayAnnot.txt
replaces all occurences of chromosome by 'chr' in arrayAnnot.txtusually used in combination with -n option
The "-n" option will not print anything unless an explicit request to print is found
sed -n -e'/pattern/p' will print only line containing pattern
useful metacharacters:
- . any character
- [] bracket expressions: defines a set of characters
- [abc] : matches character "a" or "b" or "c"
- [a-z] : matches on character in the range from "a" to "z", ie "a" or "b" ... or "z"
- \w : matches word (alphanumeric and _)
- \s : matches blank characters (spaces, tabulation)
- Anchoring.
- ^ represent start of the string.
- $ represent end of the string.
- Repetition.
- ? match at most once the preceding element. ie A? represent zero or 1 A => "" or "A"
- + match one or more times the preceding element. ie A+ represent 1 A to any number of successive A => "A" or "AA" or "AAA" etc ...
- * match zero or more times the preceding element. ie a* represent 0 to any number of successive A => "" or "A" or "AA" or "AAA" etc ...
- {n} match exactly n times the preceding element. ie A{5} represent exactly 5 successives A => "AAAAA"
- {m,n} match m to n times the preceding element. ie A{2,5} represent from 2 to 5 successives A => "AA" or "AAA" or "AAAA" or "AAAAA"
- Logical OR.
- (one|two) match the strings "one" or "two"
- Recording
- \(pattern\) : will record the pattern that matched
- Negation
- ! : negate the pattern it precedes
- more information on regular expression in man sed
sed -r 's/.*\t(\w*\|\w*\|\w*).*/\1/g' blast2.txt
sed can memorize the pattern that match, and this one can be resused in the replace pattern. memorization is done using the special syntax '(pattern)' matching pattern can be used in the replace pattern using the number N
example, add 'SA-' to the id of the sequence in a fasta file * sed 's/^>\(.*\)/>ID_\1/' file
Work on the blast result: sed_play/BlastResultIl2.align
Work on the file sed_play/Q8I1R6_PLAF7_alu.blastp.txt * Replace "A" by "a" in all the names of the sequence from the bank (not in the name of the query)
work on file sed_play/seq.fasta associate as a comment the specie 'S Aureus', 'Y Pestis' on the id lines of the relevant sequences
IL2_HUMAN sp|Q9XS38|IL2_PAPHA 92.86 154 10 1 1 153 1 154 5e-74 224
output:
sp:Q9XS38
We want to create a file containing the sequences from the 10 most similar sequences to il2_human and align them (first step to modelize a sequence by homology).
blastall -p blastp -d uniprot_sprot -i the-input -m8
)clustalw -align -infile=filename
)Sed commands can be given with no addresses, in which case the command will be executed for all input lines otherwise command will only be executed for input lines which match that address.
Transform sample.fastq fastaq file in fasta
step by step
sed can extract a paragraph, contained beeetween 2 matchin regular expressions
sed -n '/RE_START/,/RE_END/p'
rememeber sed works line by line, it will starts looking for RE_START, then 'memorize' the text until it will find the next pattern RE_STOP. when RE_STOP is found, the memorized text is then printed.
Using the bank.embl extract the entry for LISIN11262