Scripting¶
What we call scripting is the execution of external programs such as blast, hmmsearch, bwa, … from our python programs. The scripting is a important part for bioinformatics.
In Python there is a module Subprocess
which dedicated to run and communicate with external programs.
Subprocess
has been add in Python since 2.4 version. This is the recommended method to execute a program within a python script.
This module intend to standardize and replace several other modules and functions.
- os.system
- os.spawn*
- os.popen*
- popen2.*
- commands.*
Although the API has been easier than previous one, there still traps to avoid.
Shell True vs False¶
The shell parameter specify if the subprocess is executed in a sub-shell or directly. But this also influence the way the command line is passed to the Popen constructor.
shell = False : the program is executed directly, the first argument is a list of string and arg[0] must be the binary to execute. The others items are the options of this binary.
from subprocess import Popen blast_process = Popen( args = ['/local/gensoft/scripts/blastall', '-p', 'blastp', '-d', 'uniprot_sprot', '-i', 'DataBio/Sequences/Proteique/abcd2_mouse.fa', '-b', '2', '-v', '5'])
shell = True : in this case the subprocessus is executed in a sub-shell, args must be a
string
formatted as the command is typed in a terminal. This include the quote or backslash to escape spaces etc. If args is a list, the first item will be executed in a sub-shell but the other items will be consider as options of the shell itself.from subprocess import Popen blast_process = Popen(args = 'blastall -p blastp -d uniprot_sprot \ -i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5', shell = True)
Environment Variables¶
By default the subprocess inherits of the main process environment. If we want that the subprocess inherits of a environment variable simply add it in the main environment.
import os
os.environ['BLASTDB'] = '/home/toto/BioBank/blast'
from subprocess import Popen
blast_process = Popen(args = 'blastall -p blastp -d my_bank \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
shell = True)
It is also possible to specify a new environment to the subprocess via the env
argument.
But be careful, in this case all the environment is replaced. BEWARE to the PATH. In this case the env
is a dictionary.
new_env = { 'PATH' : '/home/toto/bin' ,
'BLASTDB' : '/home/toto/BioBank/blast',
'BLASTMAT' : '/home/toto/share/Matrix'
}
blast_process = Popen(args = 'blastall -p blastp -d my_bank \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
env = new_env, shell = True)
Get the standard and error output¶
By default Popen
redirect subprocess the standard and error outputs on sys.stdout
and sys.stderr
respectively.
import sys
blast_process = Popen('blastall -p blastp -d my_bank \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
shell = True,
stdout = blast_out,
stderr = blast_err)
Redirect outputs in files¶
instead to diplay stderr
and stdout
it’s often useful to harvest results in the following of the script in files.
from subprocess import Popen
blast_out = open('blast.out', 'w')
blast_err = open('blast.err', 'w')
try:
blast_process = Popen('blastall -p blastp -d uniprot_sprot \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa -b 2 -v 5',
shell=True,
stdout = blast_out,
stderr = blast_err)
blast_process.wait()
finally:
blast_out.close()
blast_out.close()
if blast_process.returncode != 0:
msg = "probleme durant l'execution du blast:\n"
with open('blast.err', 'r') as blast_err:
for line in blast_err:
msg = msg + line
raise RuntimeError( msg )
else:
print "le blast c'est bien fini, suite du script"
with open('blast.out', 'r') as blast_out:
for line in blast_out:
print line,
PIPEs Problems¶
Sometimes we want to get the standard and/or error output directly without using files.
To do this we need to pass the constant subprocress.PIPE
to the arguments stdout and stderr.
BEWARE in this case the subprocess write in a buffer available via the property stdout
or stderr
of the subprocess object.
BUT if ONE of the buffer become to be full the process is blocked. This situation can induced a dead lock.
from subprocess import Popen, PIPE
blast_process = Popen('blastall -p blastp -d uniprot_sprot \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa',
shell = True,
stdout = PIPE,
stderr = PIPE)
blast_process.wait()
print "This code could never be executed"
The call tho the wait
method block the python script execution until the subprocess is finished. But the subprocess
filled the buffer if this one is full. We are in a deadlock. python wait the subprocess which wait python consume the buffers.
So we should not use wait the end of subprocess but use a loop while and the poll method.
The poll method return None while the subprocess is running. and we have to consume the both output in the same time.
To consume several flow at the same time we can use the select
module.
This module provide 2 functions select
and poll
available for most of the operating system and epoll
for linux > kernel 2.5 and kqueue on BSD.
On windows select
and poll
work on sockets, for the others OS it works also on the files and pipes.
poll implementation¶
pseudo code of poll using
create a poll object
register flow we want to watch with the right corresponding filter
start the flow watching
at each event on a flux
check wich event happened
check which flow generate this event
provide an adequate response
import select
process_ = Popen(
'blastall -p blastp -d uniprot_sprot \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa',
shell = True,
shell = True ,
stdout = PIPE ,
stdin = None ,
stderr = PIPE ,
)
READ_ONLY = select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
# create a poll object
poller = select.poll()
# register the flow with reading filter
poller.register(process_.stdout, READ_ONLY)
poller.register(process_.stderr, READ_ONLY)
#start watching the flows
while process_.poll() is None:
# at each poll call we have a list of tuple with 2 int.
# [(fd1, flag) , (fd2,flag)]
# fd is a file descriptor
# flag match a combination of
# select.POLLIN | select.POLLPRI | select.POLLHUP | select.POLLERR
# this list match with the fd ready to be processed in
# reading or writing depending of their creation.
# beware this is a blocking call while a fd is not ready (we provide a timeout as argument)
events = poller.poll()
while events :
for fd, flag in events:
if flag & (select.POLLIN | select.EPOLLPRI): # some data are ready to be read
if fd == process_.stdout.fileno():
sys.stdout.write( process_.stdout.read() )
if fd == process_.stderr.fileno():
sys.stderr.write( process_.stderr.read() )
elif flag & select.EPOLLHUP: # the fd has been closed by the source
poller.unregister(fd)
elif flag & select.EPOLLERR: # an error on the fd has occurred
poller.unregister(fd)
# handle the error
events = poller.poll(1)
# the number as argument is the timeout (in millisecond)
# if we deregister the 2 flow at this point, we stay blocked at this instruction.
if process_.returncode != 0:
raise RuntimeError
poll usage example by Doug Hellmann
select implementation¶
It is possible to implement the solution using select.select()
import select
process_ = Popen(
'blastall -p blastp -d uniprot_sprot \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa',
shell=True,
shell = True ,
stdout = PIPE ,
stdin = None ,
stderr = PIPE ,
)
inputs = [process_.stdout, process_.stderr]
while process_.poll() is None:
# select has 3 parameters, 3 lists, the sockets, the fileobject to watch
# in reading, writing, the errors
# in addition a timeout option (the call is blocking while a fileObject
# is not ready to be processed)
# by return we get 3 lists with the fileObject to be processed
# in reading, writing, errors.
readable , writable, exceptional = select.select(inputs, [], [] , 1)
while readable and inputs:
for flow in readable:
data = flow.read()
if not data:
# the flow ready in reading which has no data
# is a closed flow
# thus we must stop to watch it
inputs.remove(flow)
if flow is process_.stdout:
sys.stdout.write(data)
elif flow is process_.stderr:
sys.stdout.write(data)
readable , writable, exceptional = select.select( inputs, [], [] , 1 )
if process_.returncode != 0:
raise RuntimeError
select usage example by Doug Hellmann
using communicate¶
Popen.communicate(input=None) allow to read data from stdout and stderr at the same time. This method interact with process: Send data to stdin. Read data from stdout and stderr, until end-of-file is reached.
Warning
The data read is buffered in memory, so do NOT use this method if the data size is large or unlimited.
from subprocess import Popen, PIPE
blast_process = Popen('blastall -p blastp -d uniprot_sprot \
-i DataBio/Sequences/Proteique/abcd2_mouse.fa',
shell = True,
stdout = PIPE,
stderr = PIPE)
stdout, stderr = blast_process.communicate()
return_code = blast_process.poll()
if return_code != 0 :
raise RuntimeError("something goes wrong with blastp :" + stderr)