replace pattern

tr

tr 'a-z' 'A-Z' <allmysequences

newlines nigthmare

MacOS, Windows, Unix uses different newlines characters that are non compatible.

  • UNIX: LF (aka \n)
  • Windows: CR+LF (aka \r\n)
  • Mac OSX: LF (aka \n)
  • Macos: CR (aka \r)

those CR characters when viewing files may display as ^M or <cr> at the end of each line or as a second line break

this can cause problems to your programs that may be unable to interpret newlines correctly.

you can use tr in order to change all newlines to restore your file.

tr '\r' '\n' < crappy_file > correct_file
tr -d '\r' < crappy_file > correct_file
cat crappy_file | tr '\r' '\n' > correct_file
cat crappy_file | tr -d '\r' > correct_file

or the program dos2unix

dos2unix crappy_file

dos2unix transforms the file inplace.

sed.

sed read data (file or from pipe) and return the filtered data on the stdout.

the sed command is very versatile, so we just cover a few of its features.

The three commands to remember are:

remove lines

replace pattern

expression

Exercices

  1. Copy blast2_m8.txt from projets on central-bio in your local machine. Change the name of the bank sp into uniprot, in the blast2_m8.t file.
  2. keep only the first field (like cut)
  3. Clean the blast report to keep only the bank and the entry name like this: input: AK1BA_HUMAN sp|O08782|ALD2_CRIGR 83.23 316   53 0  1  316   1  316   0.0    537 output: sp:ALD2_CRIGR

Exercises

We want to create a file containing the sequences from the 10 most similar sequences to il2_human and align them (first step to modelize a sequence by homology).

  1. get the il2_human sequence in fasta format
  2. perform a blastp (blastall -p blastp -d uniprot_sprot -i the-input -m8)
  3. filter the output according the % of identity
  4. keep the 10 bests hit
  5. reformat the line to keep only the bank and entry name in format (bk:entry_name)

Exercises (continued)

  1. use golden to get the sequences
  2. transform each sequence in fasta format and concatenate them in one file
  3. run clustalw (clustalw -align -infile=filename)

Addresses

Sed commands can be given with no addresses, in which case the command will be executed for all input lines otherwise command will only be executed for input lines which match that address.

Adresses examples

sed_play.txt .

Transform filenames

Transform brca.example.illumina.0.1.fastq fastaq file in fasta (try your sed expression on test.fastaq before to use it on the real file)

step by step

  1. print only header line (every 4th lines starting at the first line)
  2. replace @ by > at the begining of every header
  3. print every 4th lines starting at the second line
  4. How to show second read in brca.example.illumina.0.1.fastq?

filename are strings that can be manipulated thru previous tools (tr and sed) but Unix provides you some builtin tools in order to manipulate and transform filenames in an easy way.

basename

strip directory and suffix from filenames

syntax:

basename filename [suffix]

remove any directory componements from filename. if suffix is specified, also remove the trailing suffix

examples:

basename file.txt => file.txt
basename /xxxxx/yyyyy/zzzzz/file.txt => file.txt
basename /xxxxx/yyyyy/zzzzz/file.txt .txt => file

dirname

strip non-directory suffix from file name

examples:

dirname /some/directory/path/to/file =>  /some/directory/path/to