Searching and Filtering

Searching In Files

grep

print lines matching a pattern

pattern is expressed must be an exact text chain or a regular expression describing text in a formalised syntax that consists in a sequence of characters that define a search pattern.

FASTA file format

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

FASTQ file format

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Exercises

grep with Regular Expressions

sed

Regular expression vs shell special characters.

warning

  • remember some characters have a special meaning for the shell, eg >, >>, 2>, *, ?, |, ...
  • remember shell evaluation is executed before command execution
example: what do you expect in the following wrong examples:
  • grep > file
  • grep * file

regular expression and shell share some metacharacters that do not have the same meaning, when you want to search for those characters you have to protect them.

hint: always protect your search pattern by enclosing it in double quotes "

Exercises

Filtering/sorting results

cut

who | cut -d ' ' -f 1
find . -atime 0 | cut -d '/' -f 2,3

sort

who | cut -d ' ' -f 1 | sort | uniq -c | sort -k 1n,2

Exercise

We ran a blast with -m8 output. So the following fields are displayed

id, percent identity, alignment length, number of mismatches,
number of gap openings, query start, query end,
subject start, subject end,
Expect value, HSP bit score

Exercise (continued)

separated by tab.

  1. copy the file blast2_m8.txt in the project of the course in your home.
  2. sort the output following the % of identity (the highest identity to the top)
  3. display only columns id of hit, percent identity, Expect value, HSP bit score
  4. store the results in a new file.

uniq

report or omit repeated lines (Filter adjacent matching lines)

Exercise