Searching and Filtering

Searching In Files

grep

print lines matching a pattern

FASTA file format

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

FASTQ file format

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Exercises

grep with Regular Expressions

Exercises

Filtering/sorting results

cut

who | cut -d ' ' -f 1
find . -atime 0 | cut -d '/' -f 2,3

sort

who | cut -d ' ' -f 1 | sort | uniq -c | sort -k 1n,2

Exercise

We ran a blast with -m8 output. So the following fields are displayed

id, percent identity, alignment length, number of mismatches,
number of gap openings, query start, query end,
subject start, subject end,
Expect value, HSP bit score

Exercise (continued)

separated by tab.

  1. copy the file blast2_m8.txt in the project of the course in your home.
  2. sort the output folowing the % of identity (the highest identity to the top)
  3. display only columns id of hit, percent identity, Expect value, HSP bit score
  4. store the results in a new file.

uniq

report or omit repeated lines (Filter adjacent matching lines)

Exercise

xargs

build and execute command lines from standard input xargs executes the command with any initial-arguments followed by items read from standard input.

Do not forget to add -l to process standard input line by line

find . -name '*.fasta' | xargs grep ">"
find . -name '*.fasta' | xargs -I fic cat fic >>allmysequences

tee

wget

wget http://www.uniprot.org/uniprot/ABCD1_MOUSE.fasta