.. _Input_Output: ************ Input/Output ************ Most of the programs we have seen so far are transient in the sense that they run for a short time and produce some output, but when they end, their data disappears. If you run the program again, it starts in a "clean" state. Other programs are persistent: they run for a long time (or all the time); they keep at least some of their data in permanent storage (a hard drive, for instance); and if they shut down and restart, they pick up where they left off. Examples of persistent programs are operating systems, which run pretty much whenever a computer is on, and web servers, which run all the time, waiting for requests to come in on the network. One of the simplest ways for programs to maintain their data is by reading and writing text files. In this chapter we will see programs that read and write text files. An alternative is to store the state of the program in a database. This aspect of persistent storage will not be covered in this course. File Object =========== A text file is a sequence of characters stored on a permanent medium like a hard drive, flash memory, or CD-ROM. You do not manipulate directly these media. Python provides basic functions and methods necessary to manipulate files by default. You can do most of the file manipulation using a **file object**. A file object is a high level representation of a file that abstract the media on which the data are stored, and provides methods to interact with a file. Files are like books. You open them to start working, then read or write in them and you close them when you have finished your work. However, you need to know where you are in the book. As children use their fingers to follow characters on a page when they start to learn reading, the file object manipulates a **file pointer** which indicates the current position in the file. Opening a file -------------- The first step is to get a file object. This is done use the ``open`` built-in function, which returns a file object. The first two arguments of this function are ``file``, a string indicating the path to the file you want to read or write, and ``mode``, a string indicating the "access mode", that is, how the file is to be opened. More arguments exist for the ``open`` function, for instance pertaining to the file encoding: https://docs.python.org/3/library/functions.html#open The most commonly used values for access mode are the following: .. tabularcolumns:: |p{1cm}|p{7cm}|p{1cm}|p{7cm}| +-----------+-----------------------------------------------------------------+ | Character | Meaning | +===========+=================================================================+ | 'r' | open for reading (default) | +-----------+-----------------------------------------------------------------+ | 'w' | open for writing, truncating the file first | +-----------+-----------------------------------------------------------------+ | 'a' | open for writing, appending to the end of the file if it exists | +-----------+-----------------------------------------------------------------+ | '+' | enable both reading and writing, in interaction with r, w or a | +-----------+-----------------------------------------------------------------+ | 'b' | binary mode | +-----------+-----------------------------------------------------------------+ | 't' | text mode (default) | +-----------+-----------------------------------------------------------------+ These letters are used in combinations to indicate whether you will read or write data to the file, and whether the data corresponds to text or binary data. To read a simple text file, you should use the ``"rt"`` mode, which you can actually omit, because it corresponds to the default values:: >>> fasta_file = open("data/sv40.fasta") >>> type(fasta_file) The above example assumes that you have an existing ``sv40.fasta`` file in the ``data`` subdirectory of your current working directory. If you try to open in "read mode" a file that does not exist, a ``FileNotFoundError`` exception is raised:: >>> wrong_path_fasta_file = open("data/sv41.fasta") Traceback (most recent call last): File "", line 1, in FileNotFoundError: [Errno 2] No such file or directory: 'data/sv41.fasta' If the file cannot be opened because of inadequate permissions on the file system, a ``PermissionError`` exception is raised:: >>> unreadable_file = open("/root/.profile") Traceback (most recent call last): File "", line 1, in PermissionError: [Errno 13] Permission denied: '/root/.profile' In the above example, the user of the Python program happens to not have the read permissions on the existing ``/root`` folder. If you want to create a new text file or overwrite an existing one, you should use the ``"wt"`` mode, but you can omit the ``"t"``, because it is the default value:: >>> results_file = open("data/results.txt", "w") >>> type(results_file) A ``FileNotFoundError`` exception will be raised if the directory in which you want to write the file does not exist:: >>> wrong_path_results_file = open("dato/results.txt", "w") Traceback (most recent call last): File "", line 1, in FileNotFoundError: [Errno 2] No such file or directory: 'dato/results.txt' And a ``PermissionError`` will occur if you do not have the write permission on the destination directory:: >>> unwritable_file = open("/etc/test_forbidden.txt", "w") Traceback (most recent call last): File "", line 1, in PermissionError: [Errno 13] Permission denied: '/etc/test_forbidden.txt' The above example assumes that the ``/etc`` directory exists, but the user does not have the write permission on this part of the file system. If you opened the file using ``"r"`` or ``"w"``, the file object will start its life with the file pointer at the beginning of the file. If you used ``"a"``, the file pointer will be at the end. It is possible to obtain a file object able to both read and write, by using ``"r+"``, ``"w+"`` or ``"a+"``. For more details, you can refer to `this discussion `_ about the exact effects of these modes and mode combinations. Python distinguishes between binary and text I/O. Files opened in binary mode (including ``"b"`` in the mode argument) return contents as byte objects without any decoding. In text mode (the default, or when ``"t"`` is included in the mode argument), the contents of the file will be represented as ``str``. .. note:: Python 3 has "universal newlines" support for files opened in text mode. This means that lines may be terminated by any of the following: the Unix end-of-line convention '\\n', the (old) Macintosh convention '\\r', or the Windows convention '\\r\\n'. All of these external representations are seen as '\\n' by the Python program. For more details about newlines: https://en.wikipedia.org/wiki/Newline After a file object has been used, it should be closed using its ``close`` method:: >>> results_file.close() >>> fasta_file.close() .. note:: The file object is sometimes also called a "file handle". This explains why it is common to see code that uses ``fh`` as an identifier when binding the return value of a call to ``open``. Now, we will see what you can do with the file object between the ``open`` and ``close`` operations. File object methods and attributes ---------------------------------- .. note:: In the following function description, square brackets are used to indicate optional arguments, not lists. This is actually a common convention. read([size]) ^^^^^^^^^^^^ Read at most ``size`` bytes from the file (less if the read hits EOF before obtaining ``size`` bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. .. note:: "EOF" stands for `end of file `_. readline([size]) ^^^^^^^^^^^^^^^^ Read one entire line from the file. A trailing newline character is kept in the returned string (but may be absent when a file ends with an incomplete line). readlines() ^^^^^^^^^^^ Read until EOF using ``readline`` and returns a list containing the lines thus read. If the optional ``sizehint`` argument is present, instead of reading up to EOF, whole lines totalling approximately ``sizehint bytes`` (possibly after rounding up to an internal buffer size) are read. close() ^^^^^^^ Close the file. A closed file cannot be read or written any more. Any operation which requires that the file be open will raise a ``ValueError`` after the file has been closed. Calling ``close`` more than once is allowed. flush() ^^^^^^^ Flush the internal buffer (i.e. the string to be written in the file). .. note:: In order to optimize I/O operations, Python does not necessarily write data to the file immediately. This is why the file object keeps a buffer. ``flush`` can be used to force the transmission of the content of the buffer to the operating system. However, the operating system can still hold it's own buffer. Use ``os.fsync()`` force it to really write the file. tell() ^^^^^^ Return the current position of the file pointer. seek(offset[, whence]) ^^^^^^^^^^^^^^^^^^^^^^ Set the file pointer position to ``offset`` relatively to a position depending on the value of ``whence``. The ``whence`` argument is optional. Possible values are: * ``os.SEEK_SET`` or ``0`` (absolute file positioning, default); * ``os.SEEK_CUR`` or ``1`` (seek relative to the current position); * ``os.SEEK_END`` or ``2`` (seek relative to the end of the file). There is no return value. :: >>> f = open('workfile', 'r+') # '+' allows to both read and write the file >>> f.write('0123456789abcdef') >>> f.seek(5) # Go to the 6th byte in the file >>> f.read(1) '5' >>> f.seek(-3, 2) # Go to the 3rd byte before the end >>> f.read(1) 'd' next() ^^^^^^ A file object is its own iterator. When a file is used as an iterator, typically in a ``for`` loop, the ``next`` method is called repeatedly. This method returns the next input line, or raises ``StopIteration`` when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing):: f = open('foobar', 'r') for line in f: do_something write() ^^^^^^^ Write a string to the file. There is no return value. Due to buffering, the string may not actually show up in the file until the ``flush`` or ``close`` method is called. writelines(sequence) ^^^^^^^^^^^^^^^^^^^^ Write a sequence of strings to the file. The sequence can be any iterable object producing strings, typically a list of strings. There is no return value. (The name is intended to match ``readlines``; ``writelines`` does not add line separators.) .. note:: When dealing with large amount of data, it may be impossible to hold it as a full list of strings in memory. In this case, a possibility is to use a **generator** that will produce the strings to be written "on demand". closed ^^^^^^ Boolean indicating the current state of the file object. This is a read-only attribute; the ``close`` method changes the value. name ^^^^ The name of the file. Why it is important to close files ================================== Python does not promise that it will close the files for you. The operating system does, when the program exits. In some very trivial program, that's right away. But if your program does something else for a while, or repeats this sequence of steps dozens of times, you could **run out** of resources, or **overwrite** something. The place to close the input file is right after you're done reading the data, and before opening the output file. But examine in practice how to close a file. .. code-block:: python try: f = open('/tmp/foo' ,'r') except IOError as err: print("cannot open) for line in f: fields = line.split() id = fields[1] comment = fields[2] In this piece of code we protect the open in a ``try``/``except`` but what happens if an error occurred during the data processing, if one line does not have a ``fields[2]`` for instance? An error is raised but the file is not closed. If this piece of code is en closed in a larger ``try``/``except``, the file will stay open until the end of the script. .. code-block:: python try: f = open('/tmp/foo' ,'r') for line in f: fields = line.split() id = fields[1] comment = fields[2] except IOError as err: print("cannot open) except Exception: f.close() OK now we can catch all errors and we close the file if an error occurred during the data reading/processing but it is not closed if the process finishes normally. So we can again improve our code: .. code-block:: python try: f = open('/tmp/foo' ,'r') for line in f: fields = line.split() id = fields[1] comment = fields[2] except IOError as err: print("cannot open) finally: f.close() It works perfectly but we have to write technical code (``try``/``except``/``finally``) and the ``close`` can be far away from the ``open``. This does not contribute positively to the readability of the code. Python introduced the notion of context manager in Python 2.6 with the ``with`` keyword. This is a generalization of: .. code-block:: python try: use a ressource except Exception as err: rollback close ressource finally commit ressource close ressource The code to open a file using a ``with`` becomes:: with open('/tmp/foo' ,'r') as f: for line in f: fields = line.split() id = fields[1] comment = fields[2] It is **good practice** to use the ``with`` keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter and readable than writing equivalent *try-finally* blocks. Helpful functions to manipulate files ===================================== os.path ------- | The ``os.path`` module provides a lot of useful functions to manipulate paths. | (Official documentation at: https://docs.python.org/3/library/os.path.html) os.path.abspath(path) ^^^^^^^^^^^^^^^^^^^^^ Return a normalized absolutized version of the pathname ``path``. .. note:: An absolute path is one starting at the root of the directory hierachy, in contrast with relative paths, which are given relative to the current working directory. .. TODO: os.path.basename is also useful os.path.exists(path) ^^^^^^^^^^^^^^^^^^^^ Return ``True`` if ``path`` refers to an existing path. Returns ``False`` for broken symbolic links. On some platforms, this function may return ``False`` if permission is not granted to execute ``os.stat`` on the requested file, even if the path physically exists. os.path.realpath(path) ^^^^^^^^^^^^^^^^^^^^^^ Return the canonical path of the specified filename, eliminating any symbolic links encountered in the path (if they are supported by the operating system). os.path.join(path1[, path2[, ...]]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Join one or more path components intelligently. If any component is an absolute path, all previous components (on Windows, including the previous drive letter, if there was one) are thrown away, and joining continues. The return value is the concatenation of ``path1``, and optionally ``path2``, etc., with exactly one directory separator (``os.sep``) following each non-empty part except the last. For complete descriptions: https://docs.python.org/3/library/os.path.html#module-os.path os -- The ``os`` package provides functions to manipulate files: os.remove(path) or os.unlink(path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Remove (delete) the file ``path``. If ``path`` is a directory, ``OSError`` is raised. See ``rmdir`` below to remove a directory. os.rename(src, dst) ^^^^^^^^^^^^^^^^^^^ Rename the file or directory ``src`` to ``dst``. If ``dst`` is a directory, ``OSError`` will be raised. On Unix, if ``dst`` exists and is a file, it will be replaced silently if the user has permission. os.mkdir(path[, mode]) ^^^^^^^^^^^^^^^^^^^^^^ Create a directory named ``path`` with numeric mode ``mode``. The default permission mode is 0777 (octal). On some systems, mode is ignored. Where it is used, the current ``umask`` value is first masked out. If the directory already exists, ``OSError`` is raised. .. note:: ``umask`` is a value that determines default permissions for files created by the user. os.makedirs(path[, mode[, exist_ok]]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Recursive directory creation function. Like ``mkdir``, but makes all intermediate-level directories needed to contain the leaf directory. If ``exist_ok`` is set to ``True``, the command will silently ignore already existing directories in ``path``. Otherwise (default), an ``OSError`` is raised. os.getcwd() ^^^^^^^^^^^ Return a string representing the current working directory. os.listdir(path) ^^^^^^^^^^^^^^^^ Return a list containing the names of the entries in the directory given by ``path``. The list is in arbitrary order. It does not include the special entries ``'.'`` and ``'..'`` even if they are present in the directory. os.rmdir(path) ^^^^^^^^^^^^^^ Remove (delete) the directory ``path``. Only works when the directory is empty, otherwise, ``OSError`` is raised. os.removedirs(path) ^^^^^^^^^^^^^^^^^^^ Remove directories recursively. Works like ``rmdir`` except that, if the leaf directory is successfully removed, ``removedirs`` tries to successively remove every parent directory mentioned in path until an error is raised (which is ignored, because it generally means that a parent directory is not empty). To access to all functions and whole descriptions: https://docs.python.org/3/library/os.html#files-and-directories shutil ------ If you have to manipulate non empty directories tree use ``rmtree`` from ``shutil`` module. shutil.rmtree(path[, ignore_errors[, onerror]]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Delete an entire directory tree; ``path`` must point to a directory (but not a symbolic link to a directory). https://docs.python.org/3/library/shutil.html#shutil.rmtree Just an example to illustrate a typical work with files. In this example we reading a file containing several numbers per line, and write the average in a file named 'my_script.out' in a new directory 'results': .. literalinclude:: _static/code/average.py :linenos: :language: python :download:`script average.py<_static/code/average.py>` Exercises ========= Exercise -------- Write a function that takes the path of file as parameter and displays it's content on the screen. We expect the same behavior as the shell ``cat`` command. Exercise -------- Write a function that takes the path of a file in rebase format (see example file below) and returns the collection of the enzymes contained in the file, as a dictionary. The sequence of the binding site must be cleaned up. Example file: :download:`rebase_light.txt <_static/data/rebase_light.txt>` Exercise -------- Write a function that takes the path of a fasta file and returns a data structure of your choice that allows to store the id of the sequence and the sequence itself. Example file: :download:`seq.fasta <_static/data/seq.fasta>` Exercise -------- Modify the code of the previous exercise to read a fasta file containing multiple sequences. Use the file :download:`abcd.fasta <_static/data/abcd.fasta>` to test your code. Exercise -------- Read a multiple sequence file in fasta format and write to a new file, one sequence by file, only sequences starting with methionine and containing at least six tryptophanes (W). Use the same file as in the previous exercise to test you code. (*you should create files for sequences: ABCD1_HUMAN, ABCD1_MOUSE, ABCD2_HUMAN, ABCD2_MOUSE, ABCD2_RAT, ABCD4_HUMAN, ABCD4_MOUSE*) bonus ^^^^^ Write sequences with 80 aa per line. Exercise -------- We ran a blast with the following command *blastall -p blastp -d uniprot_sprot -i query_seq.fasta -e 1e-05 -m 8 -o blast2.txt* -m 8 is to ask for tabular output: Fields are separated from one another by a '\t'. The fields are: query id, database sequence (subject) id, percent identity, alignment length, number of mismatches, number of gap openings, query start, query end, subject start, subject end, Expect value, HSP bit score. :download:`blast2.txt <_static/data/blast2.txt>`. * Parse the file. * Sort the hits by their *percent identity* in the descending order. * Write the results in a new file. (adapted from *managing your biological data with python*) Hint: ^^^^^ Have a look at ``operator.itemgetter`` on the Python website. Exercise -------- * Parse the files exp1.csv and exp2.csv (:download:`exp1.csv <_static/data/exp1.csv>`, :download:`exp2.csv <_static/data/exp2.csv>`) (create a function to parse a file and keep only the following fields: GenAge ID, symbol, name, entrez gene id, uniprot) * Get the genes which are in the exp1 but not in the exp2. The two files are in csv format based on the uniprot identifier. * Write the result in a file in csv format. Hint: ^^^^^ Use the ``csv`` Python module: https://docs.python.org/3/library/csv.html#module-csv Use a reader as below:: >>> reader = csv.reader(input, quotechar='"') Command-line arguments ====================== A function can vary its behavior based on the value given to its arguments when it is called. Similarly, a full program can vary its behavior based on values given by the user on the command line when executing it. This can be used to specify the names of input and output files, or select options or parameters. You can have access to the command line arguments in your Python program using the ``sys.argv`` list (you need to import the ``sys`` module to access it): .. literalinclude:: _static/code/command_line.py :linenos: :language: python :download:`script average.py<_static/code/command_line.py>` Here is how you would use this program with two arguments:: $ ./command_line.py test_table.txt test_output.txt I'm program ./command_line.py. I will read file test_table.txt. I will write to file test_output.txt. And with only one argument:: $ ./command_line.py test_table.txt I'm program ./command_line.py. I will read file test_table.txt. I will write to standard output. A B C 1 2 3 4 5 6 7 8 9