.. _Scripting:


*********
Scripting
*********

What we call *scripting* is the execution of external programs such as *blast*, *hmmsearch*, *bwa*, ... 
from our python programs. The scripting is a important part for bioinformatics.

In Python there is a module ``Subprocess`` which dedicated to run and communicate with external programs.
``Subprocess`` has been add in Python since 2.4 version. This is the recommended method to execute a program within a python script.
This module intend to standardize and replace several other modules and functions.

* os.system
* os.spawn\*
* os.popen\*
* popen2.\*
* commands.\*

Although the API has been easier than previous one, there still traps to avoid.

Shell True vs False
===================

The **shell** parameter specify if the subprocess is executed in a sub-shell or directly. 
But this also influence the way the command line is passed to the Popen constructor.

* **shell = False** : the program is executed directly, the first argument is a list of string and arg[0] must be the binary to execute.
  The others items are the options of this binary.
  
  ::

   from subprocess import Popen
   blast_process = Popen( args = ['/local/gensoft/scripts/blastall', 
                                  '-p', 'blastp', 
                                  '-d', 'uniprot_sprot', 
                                  '-i', 'DataBio/Sequences/Proteique/abcd2_mouse.fa', 
                                  '-b', '2', 
                                  '-v', '5'])

* **shell = True** : in this case the subprocessus is executed in a sub-shell, args must be a ``string`` formatted as the command is typed in a terminal.
  This include the *quote* or *backslash* to escape spaces etc.
  If args is a list, the first item will be executed in a sub-shell but the other items will be consider as options of the shell itself.
   
  ::

   from subprocess import Popen
   blast_process = Popen(args = 'blastall -p blastp -d uniprot_sprot \
                                    -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5', 
                         shell = True)

Environment Variables
=====================

By default the subprocess inherits of the main process environment. 
If we want that the subprocess inherits of a environment variable simply add it in the main environment.

::

   import os
   os.environ['BLASTDB'] = '/home/toto/BioBank/blast'
   from subprocess import Popen
   blast_process = Popen(args = 'blastall -p blastp -d my_bank \
                                     -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5', 
                         shell = True)

It is also possible to specify a new environment to the subprocess *via* the ``env`` argument.
But be careful, in this case all the environment is replaced. **BEWARE** to the PATH. In this case the ``env`` is a dictionary.

.. code-block:: python

   new_env = { 'PATH' : '/home/toto/bin' ,
               'BLASTDB' : '/home/toto/BioBank/blast', 
               'BLASTMAT' : '/home/toto/share/Matrix'
              }
   blast_process = Popen(args = 'blastall -p blastp -d my_bank \
                                     -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5', 
                          env = new_env, shell = True)


Get the standard and error output
=================================

By default ``Popen`` redirect subprocess the *standard* and *error* outputs on ``sys.stdout`` and ``sys.stderr`` respectively.

::

   import sys
   blast_process = Popen('blastall -p blastp -d my_bank \
                                   -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5', 
                          shell = True, 
                          stdout = blast_out, 
                          stderr = blast_err)

Redirect outputs in files
-------------------------

instead to diplay ``stderr`` and ``stdout`` it's often useful to harvest results in the following of the script in files.

.. code-block:: python

   from subprocess import Popen
   blast_out = open('blast.out', 'w')
   blast_err = open('blast.err', 'w')
   try:
      blast_process = Popen('blastall -p blastp -d uniprot_sprot \
                                      -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5', 
                             shell=True, 
                             stdout = blast_out, 
                             stderr = blast_err)
      blast_process.wait()
   finally:
      blast_out.close()
      blast_out.close()
   if blast_process.returncode != 0:
      msg = "probleme durant l'execution du blast:\n"
      with open('blast.err', 'r') as blast_err:
         for line in blast_err:
            msg = msg + line
      raise RuntimeError( msg )
   else:
      print "le blast c'est bien fini, suite du script"
      with open('blast.out', 'r') as blast_out:
         for line in blast_out:
            print line,


PIPEs Problems
--------------

Sometimes we want to get the standard and/or error output  directly without using files.
To do this we need to pass the constant ``subprocress.PIPE`` to the arguments *stdout* and *stderr*.
**BEWARE** in this case the subprocess write in a buffer available *via* the property ``stdout`` or ``stderr`` of the subprocess object.
**BUT** if **ONE** of the buffer become to be full the process is blocked. This situation can induced a dead lock. 

.. code-block:: python

   from subprocess import Popen, PIPE
   
   blast_process = Popen('blastall -p blastp -d uniprot_sprot \
                                   -i DataBio/Sequences/Proteique/abcd2_mouse.fa', 
                         shell = True, 
                         stdout = PIPE, 
                         stderr = PIPE)
   blast_process.wait()
 
   print "This code could never be executed"

The call tho the ``wait`` method block the python script execution until the subprocess is finished. But the subprocess
filled the buffer if this one is full. We are in a deadlock. python wait the subprocess which wait python consume the buffers.
So we should not use wait the end of subprocess but use a loop while and the *poll* method.
The *poll* method return None while the subprocess is running. and we have to consume the both output in the same time.
To consume several flow at the same time we can use the ``select`` module.

This module provide 2 functions ``select`` and ``poll`` available for most of the operating system and ``epoll`` for linux > kernel 2.5 and kqueue on BSD.
On windows ``select`` and ``poll`` work on sockets, for the others OS it works also on the files and pipes.

poll implementation
"""""""""""""""""""

pseudo code of poll using

.. code-block:: python

    create a poll object
    register flow we want to watch with the right corresponding filter
    start the flow watching
    at each event on a flux
      check wich event happened
      check which flow generate this event
          provide an adequate response


.. code-block:: python

   import select
   process_ = Popen(
                     'blastall -p blastp -d uniprot_sprot \
                     -i DataBio/Sequences/Proteique/abcd2_mouse.fa', 
                     shell = True,
                     shell = True ,
                     stdout = PIPE ,
                     stdin = None ,
                     stderr = PIPE ,
                     )
   READ_ONLY = select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
   # create a poll object 
   poller = select.poll()
   # register the flow with reading filter
   poller.register(process_.stdout, READ_ONLY)
   poller.register(process_.stderr, READ_ONLY)
   #start watching the flows
   while process_.poll() is None:
       # at each poll call we have a list of tuple with 2 int.
       # [(fd1, flag) , (fd2,flag)]
       # fd is a file descriptor
       # flag match a combination of 
       # select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
       # this list match with the fd ready to be processed in 
       # reading or writing depending of their creation.
       # beware this is a blocking call while a fd is not ready (we provide a timeout as argument)
       events =  poller.poll()
       while events :
           for fd, flag in events:
               if flag & (select.POLLIN | select.EPOLLPRI): # some data are ready to be read
                   if fd == process_.stdout.fileno():
                       sys.stdout.write( process_.stdout.read() )
                   if fd == process_.stderr.fileno():
                       sys.stderr.write( process_.stderr.read() )
               elif flag & select.EPOLLHUP: # the fd has been closed by the source
                   poller.unregister(fd)
               elif flag & select.EPOLLERR: # an error on the fd has occurred
                   poller.unregister(fd)
                   # handle the error
           events =  poller.poll(1)
           # the number as argument is the timeout (in millisecond)
           # if we deregister the 2 flow at this point, we stay blocked at this instruction.
    
   if process_.returncode != 0:
       raise RuntimeError

poll usage example by `Doug Hellmann <http://pymotw.com/2/select/#poll>`_

select implementation
"""""""""""""""""""""

It is possible to implement the solution using select.select()

.. code-block:: python

   import select
   process_ = Popen(
                     'blastall -p blastp -d uniprot_sprot \
                               -i DataBio/Sequences/Proteique/abcd2_mouse.fa',
                     shell=True,
                     shell = True ,
                     stdout = PIPE ,
                     stdin = None ,
                     stderr = PIPE ,
                  )
   inputs = [process_.stdout, process_.stderr]
   while process_.poll() is None:
       # select has 3 parameters, 3 lists, the sockets, the fileobject to watch
       # in reading, writing, the errors
       # in addition a timeout option (the call is blocking while a fileObject
       # is not ready to be processed)
       # by return we get 3 lists with the fileObject to be processed
       # in reading, writing, errors.
       readable , writable, exceptional = select.select(inputs, [], [] , 1)
       while readable and inputs:
           for flow in readable:
               data = flow.read()
               if not data:
                   # the flow ready in reading which has no data
                   # is a closed flow
                   # thus we must stop to watch it
                   inputs.remove(flow)
               if flow is process_.stdout:
                   sys.stdout.write(data)
               elif flow is process_.stderr:
                   sys.stdout.write(data)
           readable , writable, exceptional = select.select( inputs, [], [] , 1 )
   if process_.returncode != 0:
       raise RuntimeError

select usage example by `Doug Hellmann <http://pymotw.com/2/select/>`_

using communicate
"""""""""""""""""
 
Popen.communicate(input=None) allow to read data from stdout and stderr at the same time.
This method interact with process: Send data to stdin. Read data from stdout and stderr, **until end-of-file is reached**. 

| Wait for process to terminate. 
| communicate() returns a tuple (stdoutdata, stderrdata).
    
    
.. warning::
   The data read is buffered in memory, 
   so do **NOT** use this method if the data size is large or unlimited.
    
.. code-block:: python

   from subprocess import Popen, PIPE
   
   blast_process = Popen('blastall -p blastp -d uniprot_sprot \
                                   -i DataBio/Sequences/Proteique/abcd2_mouse.fa', 
                          shell = True, 
                          stdout = PIPE, 
                          stderr = PIPE)
   stdout, stderr = blast_process.communicate()

   return_code = blast_process.poll()
   if return_code != 0 :
      raise RuntimeError("something goes wrong with blastp :" + stderr)