Scripting

What we call scripting is the execution of external programs such as blast, hmmsearch, bwa, … from our python programs. The scripting is a important part for bioinformatics.

In Python there is a module Subprocess which dedicated to run and communicate with external programs. Subprocess has been add in Python since 2.4 version. This is the recommended method to execute a program within a python script. This module intend to standardize and replace several other modules and functions.

  • os.system
  • os.spawn*
  • os.popen*
  • popen2.*
  • commands.*

Although the API has been easier than previous one, there still traps to avoid.

Shell True vs False

The shell parameter specify if the subprocess is executed in a sub-shell or directly. But this also influence the way the command line is passed to the Popen constructor.

  • shell = False : the program is executed directly, the first argument is a list of string and arg[0] must be the binary to execute. The others items are the options of this binary.

    from subprocess import Popen
    blast_process = Popen( args = ['/local/gensoft/scripts/blastall',
                                   '-p', 'blastp',
                                   '-d', 'uniprot_sprot',
                                   '-i', 'DataBio/Sequences/Proteique/abcd2_mouse.fa',
                                   '-b', '2',
                                   '-v', '5'])
    
  • shell = True : in this case the subprocessus is executed in a sub-shell, args must be a string formatted as the command is typed in a terminal. This include the quote or backslash to escape spaces etc. If args is a list, the first item will be executed in a sub-shell but the other items will be consider as options of the shell itself.

    from subprocess import Popen
    blast_process = Popen(args = 'blastall -p blastp -d uniprot_sprot \
                                     -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
                          shell = True)
    

Environment Variables

By default the subprocess inherits of the main process environment. If we want that the subprocess inherits of a environment variable simply add it in the main environment.

import os
os.environ['BLASTDB'] = '/home/toto/BioBank/blast'
from subprocess import Popen
blast_process = Popen(args = 'blastall -p blastp -d my_bank \
                                  -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
                      shell = True)

It is also possible to specify a new environment to the subprocess via the env argument. But be careful, in this case all the environment is replaced. BEWARE to the PATH. In this case the env is a dictionary.

new_env = { 'PATH' : '/home/toto/bin' ,
            'BLASTDB' : '/home/toto/BioBank/blast',
            'BLASTMAT' : '/home/toto/share/Matrix'
           }
blast_process = Popen(args = 'blastall -p blastp -d my_bank \
                                  -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
                       env = new_env, shell = True)

Get the standard and error output

By default Popen redirect subprocess the standard and error outputs on sys.stdout and sys.stderr respectively.

import sys
blast_process = Popen('blastall -p blastp -d my_bank \
                                -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
                       shell = True,
                       stdout = blast_out,
                       stderr = blast_err)

Redirect outputs in files

instead to diplay stderr and stdout it’s often useful to harvest results in the following of the script in files.

from subprocess import Popen
blast_out = open('blast.out', 'w')
blast_err = open('blast.err', 'w')
try:
   blast_process = Popen('blastall -p blastp -d uniprot_sprot \
                                   -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
                          shell=True,
                          stdout = blast_out,
                          stderr = blast_err)
   blast_process.wait()
finally:
   blast_out.close()
   blast_out.close()
if blast_process.returncode != 0:
   msg = "probleme durant l'execution du blast:\n"
   with open('blast.err', 'r') as blast_err:
      for line in blast_err:
         msg = msg + line
   raise RuntimeError( msg )
else:
   print "le blast c'est bien fini, suite du script"
   with open('blast.out', 'r') as blast_out:
      for line in blast_out:
         print line,

PIPEs Problems

Sometimes we want to get the standard and/or error output directly without using files. To do this we need to pass the constant subprocress.PIPE to the arguments stdout and stderr. BEWARE in this case the subprocess write in a buffer available via the property stdout or stderr of the subprocess object. BUT if ONE of the buffer become to be full the process is blocked. This situation can induced a dead lock.

from subprocess import Popen, PIPE

blast_process = Popen('blastall -p blastp -d uniprot_sprot \
                                -i DataBio/Sequences/Proteique/abcd2_mouse.fa',
                      shell = True,
                      stdout = PIPE,
                      stderr = PIPE)
blast_process.wait()

print "This code could never be executed"

The call tho the wait method block the python script execution until the subprocess is finished. But the subprocess filled the buffer if this one is full. We are in a deadlock. python wait the subprocess which wait python consume the buffers. So we should not use wait the end of subprocess but use a loop while and the poll method. The poll method return None while the subprocess is running. and we have to consume the both output in the same time. To consume several flow at the same time we can use the select module.

This module provide 2 functions select and poll available for most of the operating system and epoll for linux > kernel 2.5 and kqueue on BSD. On windows select and poll work on sockets, for the others OS it works also on the files and pipes.

poll implementation

pseudo code of poll using

create a poll object
register flow we want to watch with the right corresponding filter
start the flow watching
at each event on a flux
  check wich event happened
  check which flow generate this event
      provide an adequate response
import select
process_ = Popen(
                  'blastall -p blastp -d uniprot_sprot \
                  -i DataBio/Sequences/Proteique/abcd2_mouse.fa',
                  shell = True,
                  shell = True ,
                  stdout = PIPE ,
                  stdin = None ,
                  stderr = PIPE ,
                  )
READ_ONLY = select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
# create a poll object
poller = select.poll()
# register the flow with reading filter
poller.register(process_.stdout, READ_ONLY)
poller.register(process_.stderr, READ_ONLY)
#start watching the flows
while process_.poll() is None:
    # at each poll call we have a list of tuple with 2 int.
    # [(fd1, flag) , (fd2,flag)]
    # fd is a file descriptor
    # flag match a combination of
    # select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
    # this list match with the fd ready to be processed in
    # reading or writing depending of their creation.
    # beware this is a blocking call while a fd is not ready (we provide a timeout as argument)
    events =  poller.poll()
    while events :
        for fd, flag in events:
            if flag & (select.POLLIN | select.EPOLLPRI): # some data are ready to be read
                if fd == process_.stdout.fileno():
                    sys.stdout.write( process_.stdout.read() )
                if fd == process_.stderr.fileno():
                    sys.stderr.write( process_.stderr.read() )
            elif flag & select.EPOLLHUP: # the fd has been closed by the source
                poller.unregister(fd)
            elif flag & select.EPOLLERR: # an error on the fd has occurred
                poller.unregister(fd)
                # handle the error
        events =  poller.poll(1)
        # the number as argument is the timeout (in millisecond)
        # if we deregister the 2 flow at this point, we stay blocked at this instruction.

if process_.returncode != 0:
    raise RuntimeError

poll usage example by Doug Hellmann

select implementation

It is possible to implement the solution using select.select()

import select
process_ = Popen(
                  'blastall -p blastp -d uniprot_sprot \
                            -i DataBio/Sequences/Proteique/abcd2_mouse.fa',
                  shell=True,
                  shell = True ,
                  stdout = PIPE ,
                  stdin = None ,
                  stderr = PIPE ,
               )
inputs = [process_.stdout, process_.stderr]
while process_.poll() is None:
    # select has 3 parameters, 3 lists, the sockets, the fileobject to watch
    # in reading, writing, the errors
    # in addition a timeout option (the call is blocking while a fileObject
    # is not ready to be processed)
    # by return we get 3 lists with the fileObject to be processed
    # in reading, writing, errors.
    readable , writable, exceptional = select.select(inputs, [], [] , 1)
    while readable and inputs:
        for flow in readable:
            data = flow.read()
            if not data:
                # the flow ready in reading which has no data
                # is a closed flow
                # thus we must stop to watch it
                inputs.remove(flow)
            if flow is process_.stdout:
                sys.stdout.write(data)
            elif flow is process_.stderr:
                sys.stdout.write(data)
        readable , writable, exceptional = select.select( inputs, [], [] , 1 )
if process_.returncode != 0:
    raise RuntimeError

select usage example by Doug Hellmann

using communicate

Popen.communicate(input=None) allow to read data from stdout and stderr at the same time. This method interact with process: Send data to stdin. Read data from stdout and stderr, until end-of-file is reached.

Wait for process to terminate.
communicate() returns a tuple (stdoutdata, stderrdata).

Warning

The data read is buffered in memory, so do NOT use this method if the data size is large or unlimited.

from subprocess import Popen, PIPE

blast_process = Popen('blastall -p blastp -d uniprot_sprot \
                                -i DataBio/Sequences/Proteique/abcd2_mouse.fa',
                       shell = True,
                       stdout = PIPE,
                       stderr = PIPE)
stdout, stderr = blast_process.communicate()

return_code = blast_process.poll()
if return_code != 0 :
   raise RuntimeError("something goes wrong with blastp :" + stderr)