9. Input/Output

Most of the programs we have seen so far are transient in the sense that they run for a short time and produce some output, but when they end, their data disappears. If you run the program again, it starts in a “clean” state.

Other programs are persistent: they run for a long time (or all the time); they keep at least some of their data in permanent storage (a hard drive, for instance); and if they shut down and restart, they pick up where they left off. Examples of persistent programs are operating systems, which run pretty much whenever a computer is on, and web servers, which run all the time, waiting for requests to come in on the network.

One of the simplest ways for programs to maintain their data is by reading and writing text files. In this chapter we will see programs that read and write text files. An alternative is to store the state of the program in a database. This aspect of persistent storage will not be covered in this course.

9.1. File Object

A text file is a sequence of characters stored on a permanent medium like a hard drive, flash memory, or CD-ROM.

You do not manipulate directly these media. Python provides basic functions and methods necessary to manipulate files by default. You can do most of the file manipulation using a file object. A file object is a high level representation of a file that abstract the media on which the data are stored, and provides methods to interact with a file.

Files are like books. You open them to start working, then read or write in them and you close them when you have finished your work. However, you need to know where you are in the book. As children use their fingers to follow characters on a page when they start to learn reading, the file object manipulates a file pointer which indicates the current position in the file.

9.1.1. Opening a file

The first step is to get a file object. This is done use the open built-in function, which returns a file object.

The first two arguments of this function are file, a string indicating the path to the file you want to read or write, and mode, a string indicating the “access mode”, that is, how the file is to be opened.

More arguments exist for the open function, for instance pertaining to the file encoding: https://docs.python.org/3/library/functions.html#open

The most commonly used values for access mode are the following:

Character

Meaning

‘r’

open for reading (default)

‘w’

open for writing, truncating the file first

‘a’

open for writing, appending to the end of the file if it exists

‘+’

enable both reading and writing, in interaction with r, w or a

‘b’

binary mode

‘t’

text mode (default)

These letters are used in combinations to indicate whether you will read or write data to the file, and whether the data corresponds to text or binary data.

To read a simple text file, you should use the "rt" mode, which you can actually omit, because it corresponds to the default values:

>>> fasta_file = open("data/sv40.fasta")
>>> type(fasta_file)
<class '_io.TextIOWrapper'>

The above example assumes that you have an existing sv40.fasta file in the data subdirectory of your current working directory.

If you try to open in “read mode” a file that does not exist, a FileNotFoundError exception is raised:

>>> wrong_path_fasta_file = open("data/sv41.fasta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'data/sv41.fasta'

If the file cannot be opened because of inadequate permissions on the file system, a PermissionError exception is raised:

>>> unreadable_file = open("/root/.profile")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
PermissionError: [Errno 13] Permission denied: '/root/.profile'

In the above example, the user of the Python program happens to not have the read permissions on the existing /root folder.

If you want to create a new text file or overwrite an existing one, you should use the "wt" mode, but you can omit the "t", because it is the default value:

>>> results_file = open("data/results.txt", "w")
>>> type(results_file)
<class '_io.TextIOWrapper'>

A FileNotFoundError exception will be raised if the directory in which you want to write the file does not exist:

>>> wrong_path_results_file = open("dato/results.txt", "w")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'dato/results.txt'

And a PermissionError will occur if you do not have the write permission on the destination directory:

>>> unwritable_file = open("/etc/test_forbidden.txt", "w")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
PermissionError: [Errno 13] Permission denied: '/etc/test_forbidden.txt'

The above example assumes that the /etc directory exists, but the user does not have the write permission on this part of the file system.

If you opened the file using "r" or "w", the file object will start its life with the file pointer at the beginning of the file. If you used "a", the file pointer will be at the end.

It is possible to obtain a file object able to both read and write, by using "r+", "w+" or "a+". For more details, you can refer to this discussion about the exact effects of these modes and mode combinations.

Python distinguishes between binary and text I/O. Files opened in binary mode (including "b" in the mode argument) return contents as byte objects without any decoding. In text mode (the default, or when "t" is included in the mode argument), the contents of the file will be represented as str.

Note

Python 3 has “universal newlines” support for files opened in text mode. This means that lines may be terminated by any of the following: the Unix end-of-line convention ‘\n’, the (old) Macintosh convention ‘\r’, or the Windows convention ‘\r\n’. All of these external representations are seen as ‘\n’ by the Python program. For more details about newlines: https://en.wikipedia.org/wiki/Newline

After a file object has been used, it should be closed using its close method:

>>> results_file.close()
>>> fasta_file.close()

Note

The file object is sometimes also called a “file handle”. This explains why it is common to see code that uses fh as an identifier when binding the return value of a call to open.

Now, we will see what you can do with the file object between the open and close operations.

9.1.2. File object methods and attributes

Note

In the following function description, square brackets are used to indicate optional arguments, not lists. This is actually a common convention.

9.1.2.1. read([size])

Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object.

Note

“EOF” stands for end of file.

9.1.2.2. readline([size])

Read one entire line from the file. A trailing newline character is kept in the returned string (but may be absent when a file ends with an incomplete line).

9.1.2.3. readlines()

Read until EOF using readline and returns a list containing the lines thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read.

9.1.2.4. close()

Close the file. A closed file cannot be read or written any more. Any operation which requires that the file be open will raise a ValueError after the file has been closed. Calling close more than once is allowed.

9.1.2.5. flush()

Flush the internal buffer (i.e. the string to be written in the file).

Note

In order to optimize I/O operations, Python does not necessarily write data to the file immediately. This is why the file object keeps a buffer. flush can be used to force the transmission of the content of the buffer to the operating system. However, the operating system can still hold it’s own buffer. Use os.fsync() force it to really write the file.

9.1.2.6. tell()

Return the current position of the file pointer.

9.1.2.7. seek(offset[, whence])

Set the file pointer position to offset relatively to a position depending on the value of whence.

The whence argument is optional. Possible values are:

  • os.SEEK_SET or 0 (absolute file positioning, default);

  • os.SEEK_CUR or 1 (seek relative to the current position);

  • os.SEEK_END or 2 (seek relative to the end of the file).

There is no return value.

>>> f = open('workfile', 'r+') #  '+' allows to both read and write the file
>>> f.write('0123456789abcdef')
>>> f.seek(5)     #  Go to the 6th byte in the file
>>> f.read(1)
'5'
>>> f.seek(-3, 2) #  Go to the 3rd byte before the end
>>> f.read(1)
'd'

9.1.2.8. next()

A file object is its own iterator. When a file is used as an iterator, typically in a for loop, the next method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing):

f = open('foobar', 'r')
for line in f:
   do_something

9.1.2.9. write()

Write a string to the file. There is no return value. Due to buffering, the string may not actually show up in the file until the flush or close method is called.

9.1.2.10. writelines(sequence)

Write a sequence of strings to the file. The sequence can be any iterable object producing strings, typically a list of strings. There is no return value. (The name is intended to match readlines; writelines does not add line separators.)

Note

When dealing with large amount of data, it may be impossible to hold it as a full list of strings in memory. In this case, a possibility is to use a generator that will produce the strings to be written “on demand”.

9.1.2.11. closed

Boolean indicating the current state of the file object. This is a read-only attribute; the close method changes the value.

9.1.2.12. name

The name of the file.

9.2. Why it is important to close files

Python does not promise that it will close the files for you. The operating system does, when the program exits. In some very trivial program, that’s right away. But if your program does something else for a while, or repeats this sequence of steps dozens of times, you could run out of resources, or overwrite something.

The place to close the input file is right after you’re done reading the data, and before opening the output file.

But examine in practice how to close a file.

try:
    f = open('/tmp/foo' ,'r')
except IOError as err:
    print("cannot open)

for line in f:
    fields = line.split()
    id = fields[1]
    comment = fields[2]

In this piece of code we protect the open in a try/except but what happens if an error occurred during the data processing, if one line does not have a fields[2] for instance? An error is raised but the file is not closed. If this piece of code is en closed in a larger try/except, the file will stay open until the end of the script.

try:
    f = open('/tmp/foo' ,'r')
    for line in f:
        fields = line.split()
        id = fields[1]
        comment = fields[2]
except IOError as err:
    print("cannot open)
except Exception:
    f.close()

OK now we can catch all errors and we close the file if an error occurred during the data reading/processing but it is not closed if the process finishes normally. So we can again improve our code:

try:
    f = open('/tmp/foo' ,'r')
    for line in f:
        fields = line.split()
        id = fields[1]
        comment = fields[2]
except IOError as err:
    print("cannot open)
finally:
    f.close()

It works perfectly but we have to write technical code (try/except/finally) and the close can be far away from the open. This does not contribute positively to the readability of the code.

Python introduced the notion of context manager in Python 2.6 with the with keyword. This is a generalization of:

try:
    use a ressource
except Exception as err:
    rollback
    close ressource
finally
    commit ressource
    close ressource

The code to open a file using a with becomes:

with open('/tmp/foo' ,'r') as f:
    for line in f:
        fields = line.split()
        id = fields[1]
        comment = fields[2]

It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter and readable than writing equivalent try-finally blocks.

9.3. Helpful functions to manipulate files

9.3.1. os.path

The os.path module provides a lot of useful functions to manipulate paths.

9.3.1.1. os.path.abspath(path)

Return a normalized absolutized version of the pathname path.

Note

An absolute path is one starting at the root of the directory hierachy, in contrast with relative paths, which are given relative to the current working directory.

9.3.1.2. os.path.exists(path)

Return True if path refers to an existing path. Returns False for broken symbolic links. On some platforms, this function may return False if permission is not granted to execute os.stat on the requested file, even if the path physically exists.

9.3.1.3. os.path.realpath(path)

Return the canonical path of the specified filename, eliminating any symbolic links encountered in the path (if they are supported by the operating system).

9.3.1.4. os.path.join(path1[, path2[, …]])

Join one or more path components intelligently. If any component is an absolute path, all previous components (on Windows, including the previous drive letter, if there was one) are thrown away, and joining continues. The return value is the concatenation of path1, and optionally path2, etc., with exactly one directory separator (os.sep) following each non-empty part except the last.

For complete descriptions: https://docs.python.org/3/library/os.path.html#module-os.path

9.3.2. os

The os package provides functions to manipulate files:

9.3.2.2. os.rename(src, dst)

Rename the file or directory src to dst. If dst is a directory, OSError will be raised. On Unix, if dst exists and is a file, it will be replaced silently if the user has permission.

9.3.2.3. os.mkdir(path[, mode])

Create a directory named path with numeric mode mode. The default permission mode is 0777 (octal). On some systems, mode is ignored. Where it is used, the current umask value is first masked out. If the directory already exists, OSError is raised.

Note

umask is a value that determines default permissions for files created by the user.

9.3.2.4. os.makedirs(path[, mode[, exist_ok]])

Recursive directory creation function. Like mkdir, but makes all intermediate-level directories needed to contain the leaf directory. If exist_ok is set to True, the command will silently ignore already existing directories in path. Otherwise (default), an OSError is raised.

9.3.2.5. os.getcwd()

Return a string representing the current working directory.

9.3.2.6. os.listdir(path)

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order. It does not include the special entries '.' and '..' even if they are present in the directory.

9.3.2.7. os.rmdir(path)

Remove (delete) the directory path. Only works when the directory is empty, otherwise, OSError is raised.

9.3.2.8. os.removedirs(path)

Remove directories recursively. Works like rmdir except that, if the leaf directory is successfully removed, removedirs tries to successively remove every parent directory mentioned in path until an error is raised (which is ignored, because it generally means that a parent directory is not empty).

To access to all functions and whole descriptions: https://docs.python.org/3/library/os.html#files-and-directories

9.3.3. shutil

If you have to manipulate non empty directories tree use rmtree from shutil module.

9.3.3.1. shutil.rmtree(path[, ignore_errors[, onerror]])

Delete an entire directory tree; path must point to a directory (but not a symbolic link to a directory).

https://docs.python.org/3/library/shutil.html#shutil.rmtree

Just an example to illustrate a typical work with files. In this example we reading a file containing several numbers per line, and write the average in a file named ‘my_script.out’ in a new directory ‘results’:

 1import os
 2
 3def average(input_filename):
 4    """
 5    compute the average of data contained in file input_filename
 6    and write the result in file results/monscript.out
 7
 8    :param input_filename: the path to the file containing data
 9    :type input_filename: string
10    """
11    here = os.getcwd()
12    result_dir = os.path.join(here, 'results')
13    if not os.path.exists(result_dir):
14        os.mkdir(result_dir)
15    result_path = os.path.join(result_dir, 'mon_script.out')
16    input_path = os.path.realpath(input_filename)
17    with open(input_path, 'r') as inputs:
18        with open(result_path, 'w') as output:
19            for line in inputs:
20                fields = line.split()
21                if not fields:
22                    continue
23                numbers = [float(items) for items in fields]
24                ave = sum(numbers) / float(len(numbers))
25                output.write(str(ave) + "\n")
26
27average('average_inputs')

script average.py

9.4. Exercises

9.4.1. Exercise

Write a function that takes the path of file as parameter and displays it’s content on the screen.

We expect the same behavior as the shell cat command.

9.4.2. Exercise

Write a function that takes the path of a file in rebase format (see example file below) and returns the collection of the enzymes contained in the file, as a dictionary. The sequence of the binding site must be cleaned up.

Example file: rebase_light.txt

9.4.3. Exercise

Write a function that takes the path of a fasta file and returns a data structure of your choice that allows to store the id of the sequence and the sequence itself.

Example file: seq.fasta

9.4.4. Exercise

Modify the code of the previous exercise to read a fasta file containing multiple sequences. Use the file abcd.fasta to test your code.

9.4.5. Exercise

Read a multiple sequence file in fasta format and write to a new file, one sequence by file, only sequences starting with methionine and containing at least six tryptophanes (W).

Use the same file as in the previous exercise to test you code. (you should create files for sequences: ABCD1_HUMAN, ABCD1_MOUSE, ABCD2_HUMAN, ABCD2_MOUSE, ABCD2_RAT, ABCD4_HUMAN, ABCD4_MOUSE)

9.4.5.1. bonus

Write sequences with 80 aa per line.

9.4.6. Exercise

We ran a blast with the following command blastall -p blastp -d uniprot_sprot -i query_seq.fasta -e 1e-05 -m 8 -o blast2.txt

-m 8 is to ask for tabular output: Fields are separated from one another by a ‘t’.

The fields are: query id, database sequence (subject) id, percent identity, alignment length, number of mismatches, number of gap openings, query start, query end, subject start, subject end, Expect value, HSP bit score.

blast2.txt.

  • Parse the file.

  • Sort the hits by their percent identity in the descending order.

  • Write the results in a new file.

(adapted from managing your biological data with python)

9.4.6.1. Hint:

Have a look at operator.itemgetter on the Python website.

9.4.7. Exercise

  • Parse the files exp1.csv and exp2.csv (exp1.csv, exp2.csv) (create a function to parse a file and keep only the following fields: GenAge ID, symbol, name, entrez gene id, uniprot)

  • Get the genes which are in the exp1 but not in the exp2. The two files are in csv format based on the uniprot identifier.

  • Write the result in a file in csv format.

9.4.7.1. Hint:

Use the csv Python module: https://docs.python.org/3/library/csv.html#module-csv

Use a reader as below:

>>> reader = csv.reader(input, quotechar='"')

9.5. Command-line arguments

A function can vary its behavior based on the value given to its arguments when it is called.

Similarly, a full program can vary its behavior based on values given by the user on the command line when executing it. This can be used to specify the names of input and output files, or select options or parameters.

You can have access to the command line arguments in your Python program using the sys.argv list (you need to import the sys module to access it):

 1#!/usr/bin/env python3.6
 2"""
 3This program demonstrates some useful features of the sys module.
 4
 5It should run in Python 3.6 or above.
 6"""
 7
 8import sys
 9
10# sys.argv is a list of strings representing the command line.
11# The first element in the list is the program name:
12prog_name = sys.argv[0]
13
14# Then come the command line arguments:
15input_filename = sys.argv[1]
16if len(sys.argv) > 2:
17    output_filename = sys.argv[2]
18else:
19    output_filename = None
20
21# Using f-strings (Python 3.6 or above needed)
22print(f"I'm program {prog_name}.")
23print(f"I will read file {input_filename}.")
24
25# The with context manager will ensure the input file will be closed
26with open(input_filename) as input_file:
27    try:
28        if output_filename is not None:
29            output_file = open(output_filename, "w")
30            print(f"I will write to file {output_filename}.")
31        else:
32            # sys.stdout can be used like a file object,
33            # but text will be printed to the screen
34            output_file = sys.stdout
35            print("I will write to standard output.")
36        # Copying the input line by line to the output
37        for line in input_file:
38            output_file.write(line)
39    except IOError as err:
40        print(f"Problem while trying to write output.\n{err}")
41        if output_filename is not None:
42            # Ensure the output file is closed
43            output_file.close()
44        # Exit with non-zero code to signal a problem
45        sys.exit(1)
46    finally:
47        if output_filename is not None:
48            # Ensure the output file is closed
49            output_file.close()
50
51sys.exit(0)

script average.py

Here is how you would use this program with two arguments:

$ ./command_line.py test_table.txt test_output.txt
I'm program ./command_line.py.
I will read file test_table.txt.
I will write to file test_output.txt.

And with only one argument:

$ ./command_line.py test_table.txt
I'm program ./command_line.py.
I will read file test_table.txt.
I will write to standard output.
A B C
1 2 3
4 5 6
7 8 9