TC, JBM, BN, AZ
Institut Pasteur, Paris, 20-31 March 2017

Motivation

  • Python is a general-purpose language. It was not designed for scientific computing.
  • NumPy is the core tools for performant numerical computing with Python. It is used by many other scientific libraries including SciPy, Pandas, matplotlib.

Installation

(in terminal)

conda install numpy

Convention

In [2]:
import numpy as np

Arrays

an array is like a sequence but contains a homogeneous collections of items (e.g. all integers)

In [5]:
import numpy as np
x = np.array([1, 2, 3])
x
Out[5]:
array([1, 2, 3])

NB1:

x is an instance of the object numpy.ndarray. The constructor takes as argument a sequence. Here we provided a list hence the ([ ]) syntax.

NB2: Following the previous nota bene about the syntax we have:

a = np.array(1, 2, 3, 4)    # WRONG
a = np.array([1, 2, 3, 4])  # RIGHT

NumPy provides fast and memory efficient data structures

Pure Python

In [3]:
l = range(1000000)
%time sum(l)
CPU times: user 19.9 ms, sys: 0 ns, total: 19.9 ms
Wall time: 19.8 ms
Out[3]:
499999500000

Numpy

In [7]:
x = np.array(l)
%time x.sum()
CPU times: user 1.34 ms, sys: 1.05 ms, total: 2.38 ms
Wall time: 1.32 ms
Out[7]:
499999500000

About 15 times faster

Another example:

How would you compute $\sum X_i^2$ given $X$ is a Python list ?

In [9]:
l = range(1000000)
%time sum([x**2 for x in l])
CPU times: user 292 ms, sys: 10.9 ms, total: 303 ms
Wall time: 302 ms
Out[9]:
333332833333500000
In [10]:
x = np.array(l)
%time (x**2).sum()
CPU times: user 3.62 ms, sys: 1.07 ms, total: 4.69 ms
Wall time: 3.62 ms
Out[10]:
333332833333500000

!!! about 70-80 times faster

Create N-D arrays

1-D array

In [9]:
x = np.array([1, 2, 10, 2, 1 ])
x.shape
Out[9]:
(5,)
In [10]:
len(x)
Out[10]:
5

Indexing follows the same syntax as with Python sequences.

In [11]:
x[2]
Out[11]:
10
In [12]:
x.ndim
Out[12]:
1

2-D arrays

In [11]:
np.array([[1,2], [3,4]])
Out[11]:
array([[1, 2],
       [3, 4]])

Here is a naive way to build a 2D matrix with values going from 1 to 12. Later, we will use the arange function + reshape method

In [3]:
n1 = [1, 2, 3]
n2 = [4, 5, 6]
n3 = [7, 8, 9]
n4 = [10, 11, 12]
m = np.array([n1, n2, n3, n4])
In [4]:
m.ndim
Out[4]:
2
In [5]:
m.shape
Out[5]:
(4, 3)
In [6]:
len(m)
Out[6]:
4

Indexing: LC convention (Line / Column)

For a 5x5 matrix, the indexing works as follows

In [46]:
m
Out[46]:
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
In [57]:
# To get 11, last row, second column:
m[3,1]
Out[57]:
11
In [58]:
#equivalent but a bit slower:
m[3][1]
Out[58]:
11

4-D arrays ? why not ?

Given a cubic volume of air, we decompose it in a 2x2x2 smaller cubes. In each small cube, we measure 3 variables: eg, pressure, temperature...

In [7]:
c1 = [1,2,3]; c2 = [1,2,3]; c3 = [1,2,3]; c4 = [1,2,3]; 
c5 = [1,2,3]; c6 = [1,2,3]; c7 = [1,2,3]; c8 = [1,2,3]; 
x = np.array(
    [                   # first dimension (2 slices)
        [               # second dimension (2 rows)
            [           # third dimension (2 columns)
                c1, c2
            ],  
            [
                c3, c4
            ]
        ],
        [
            [
                c5, c6
            ],
            [
                c7, c8
            ]
        ]
    ])        
In [24]:
x.shape
Out[24]:
(2, 2, 2, 3)

Functions to create arrays

The previous 4D example was not created by hand but based on:

The ones function

In [11]:
np.ones((2, 2, 2, 3))
Out[11]:
array([[[[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]],

        [[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]]],


       [[[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]],

        [[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]]]])
In [12]:
# a 2D example
np.ones((4, 2))
Out[12]:
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

The arange function

Evenly spaced values within a given interval based on a step

In [13]:
np.arange(1, 10) # not that the end is exclusive and the step is 1 by default
Out[13]:
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [14]:
np.arange(0, 10, step=2)
Out[14]:
array([0, 2, 4, 6, 8])
In [15]:
# Remember how we created to first 2D example array ?
# here is the proper way with arange and reshape
np.arange(12).reshape(4,3)
Out[15]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

the linspace function

Evenly spaced values within a given interval based on a number of points

In [8]:
np.linspace(0, 1, 10)
Out[8]:
array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

ones, zeros, diag, eye, empty

In [25]:
np.diag((5,5,1,1))
Out[25]:
array([[5, 0, 0, 0],
       [0, 5, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]])
In [26]:
np.ones((2,2))
Out[26]:
array([[ 1.,  1.],
       [ 1.,  1.]])
In [27]:
np.zeros((2,2))
Out[27]:
array([[ 0.,  0.],
       [ 0.,  0.]])
In [28]:
np.eye(3)
Out[28]:
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Random values

Python language has its own random module but numpy has more functionalities.

In [29]:
# uniform random values between 0 and 1
np.random.rand()
Out[29]:
0.7959159810777837
In [12]:
# normal distribution. Note the **n** after rand
np.random.randn()
Out[12]:
0.36012177378781063
In [16]:
# array of normally distributed values
np.random.normal(size=10)
Out[16]:
array([-0.14782388, -1.27359207,  0.50721445, -0.77446479,  1.69835029,
       -0.37707183, -0.91804253,  0.54775832, -0.10592372, -0.58636633])

Array creation

  1. Create a 4x4 matrix with 2's on the diagonal. What about a 100x100 matrix ?
  2. Create a matrix 5x5 with random number uniformly distributed
  3. Creata a 1D array of 1000 points normally distributed. Compute the mean and standard deviation.
  4. Same question for a uniform distribution.
  5. can you compute the maximum and miminum of the previously created arrays. Does it make sense ?
  6. Create these matrices:
    1 0 0 0 0
    0 1 0 0 0
    0 0 5 0 0
    0 0 0 1 0
    0 0 0 0 1
    and
    0 1  1  1  1
    1 0  1  1  1
    1 1 -4  1  1
    1 1  1  0  1
    1 1  1  1  0

Hints and solutions

In [4]:
# 1 -
x = np.diag([2,2,2,2]); 
x = np.diag([2]*100);
# 2 -
x = np.random.normal(size=25).reshape(5,5)
# 3 -
x = np.random.normal(size=1000)
x.mean()
x.std()
# 4 - for uniform distribution, replace 'normal' by 'uniform'
# 5 - 
x.min()
x.max()
# takes the min/max of U should give [0,1] and on N should give
# -2, +2 but random so may change. does it make sense -> yes
# 6 - 
A = np.diag([1]*5); A[2,2]=5
B = np.ones((5,5)) - A

data types

In [14]:
x = np.array([1, 2, 3])
x.dtype
Out[14]:
dtype('int64')
In [37]:
x = np.array([1., 2, 3.5])
x.dtype
Out[37]:
dtype('float64')
In [15]:
x = np.array([1, 2, 3], dtype=float)
x.dtype
Out[15]:
dtype('float64')
In [33]:
x = np.array([1, 2, 3])
x = x.astype(float)
x
Out[33]:
array([ 1.,  2.,  3.])
If you mix types, the most complex type is used
In [35]:
np.array([1.0, 1, "oups"])
Out[35]:
array(['1.0', '1', 'oups'], 
      dtype='<U32')

Basic indexing and slicing

  • Create an array that goes from 0 to 9 (included)

Using slicing and indexing:

  • Select only odd values
  • Select only even values
  • Select last two values only
  • Select all values except first 2

Note that the syntax is the same as in pure Python slicing and indexing

Solutions

In [11]:
data = np.arange(10)
# you may use pure python and cast to array:
data = np.array(range(10))
print("even: {}".format(data[::2]))
print("odd: {}".format(data[1::2]))
print("last two : {}".format(data[-2:]))
print("from pos 1: {}".format(data[2:]))
even: [0 2 4 6 8]
odd: [1 3 5 7 9]
last two : [8 9]
from pos 1: [2 3 4 5 6 7 8 9]

Example in 2D

Syntax. First index is for rows and second for columns:

    Matrix[0, :]
    Matrix[:, :]
    Matrix[:, 0]

Create the array shown above. Then, with slicing and indexing,

  • extract first row,
  • extract first column (orange cells)
  • extract even values only, odd values only
  • extract the 4 blue cells
  • extract the 2 green cells
  • extract the 2x2 sub-matrix in bottom right corner

Solutions

In [12]:
# blue cells
x = np.arange(5)
M = np.array([x, x+10, x+20, x+30, x+40])
# first row:
M[0,:]
# first colum,n
M[:,0]
# even values
M[:,::2]
# odd values
M[:, 1::2] 
Out[12]:
array([[ 1,  3],
       [11, 13],
       [21, 23],
       [31, 33],
       [41, 43]])
In [17]:
# sub corner: 
M[-2:, -2:]
# green cells: 
M[2, 2:]
# blue cells: 
M[1::2, 1::2]
Out[17]:
array([[11, 13],
       [31, 33]])

Copies and views

Like in Python, we are manipulating objects. So be careful with the references

In [158]:
a = np.array([1,2,3,4,5])
In [160]:
b = a[:]
In [164]:
a[2] = 30
In [165]:
b
Out[165]:
array([ 1,  2, 30,  4,  5])
In [166]:
b[-2:] = 0
In [167]:
b
Out[167]:
array([ 1,  2, 30,  0,  0])
In [168]:
a
Out[168]:
array([ 1,  2, 30,  0,  0])
In Numpy, the slice : is a view (reference), not a copy

copies

In [27]:
a = np.array([1,2,3])
b = a.copy()
In [28]:
b[1] = 100
b
Out[28]:
array([  1, 100,   3])
In [29]:
a
Out[29]:
array([1, 2, 3])

Fancy indexing

As we have seen before, standard Python slicing and indexing works on NumpPy array. Yet, NumPy provides more indexing, which can be performed with boolean or integer arrays, also called masked

Indexing with boolean masks

Boolean mask is a very powerful feature in NumPy. It can be used to index an array, and assign new values to a sub-array. Note also that it creates copies not views

In [21]:
data = np.arange(16)
data
Out[21]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

Find all multiple of 7

In [22]:
(data % 7 == 0)
Out[22]:
array([ True, False, False, False, False, False, False,  True, False,
       False, False, False, False, False,  True, False], dtype=bool)
In [23]:
mask = (data % 7 == 0)
data[mask]
Out[23]:
array([ 0,  7, 14])
In [24]:
# Replaces values: 
data[mask] = -100
data
Out[24]:
array([-100,    1,    2,    3,    4,    5,    6, -100,    8,    9,   10,
         11,   12,   13, -100,   15])

Indexing with an array of integers

In [40]:
X = np.array([-1, 2, -3, -4, -5, 10, 20])
indices = [0, 1, 2, 3]
X[indices]
Out[40]:
array([-1,  2, -3, -4])

Indices are nice if you have uneven selection and you know what you want before hand

In [41]:
indices = [0,4]
X[indices]
Out[41]:
array([-1, -5])

Of course in the first example, we could have been smarter with just a mask:

In [122]:
X[X<0]
Out[122]:
array([-1, -3, -4, -5])

Fancy indexing

Create the array above and extract the following data sets:

  • the 9 blue cells
  • the 5 orange cells
  • the green cells

Solutions

In [21]:
x = np.arange(6)
M = np.array([x, x+10, x+20, x+30, x+40, x+50])
M
Out[21]:
array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])
In [123]:
# Creation using tiles
M = np.tile(range(6), (6, 1))
for i in range(6):
    M[i,:] += i*10
M
Out[123]:
array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])
In [22]:
# orange: 
M[(0,1,2,3,4), (1,2,3,4,5)]
# blue: 
M[3:, [0,2,5]]
# green: 
M[np.array([1, 0,1,0,0,1], dtype=bool),4]
# green: 
M[np.array([True, False,True,False,False,True]),4]
Out[22]:
array([ 4, 24, 54])
  • Create a range from -5 to 5 (included)
  • With a mask, find the positive values and replace them with +1
  • With a mask, find the negative values and replace them with -1

Solutions

In [26]:
M = np.arange(-5,6)
M[M<0]
M[M>0]
Out[26]:
array([1, 2, 3, 4, 5])

Numerical operations

In [41]:
A = np.array([[4, 7], 
              [2, 6]])

B = np.array([[0.6, -0.7],
              [-0.2, 0.4]])
In [42]:
A + B
Out[42]:
array([[ 4.6,  6.3],
       [ 1.8,  6.4]])
Again be careful with copies and views
In [43]:
C = B.copy()
In [44]:
C *= 2
B         # unchanged because we did a copy !
Out[44]:
array([[ 0.6, -0.7],
       [-0.2,  0.4]])
In [45]:
# elementwise product
A * B
Out[45]:
array([[ 2.4, -4.9],
       [-0.4,  2.4]])
This is not a matrix product
In [116]:
# matrix product
A.dot(B)
Out[116]:
array([[  1.00000000e+00,   4.44089210e-16],
       [ -2.22044605e-16,   1.00000000e+00]])
In [124]:
A.dot(B).round()
Out[124]:
array([[ 1.,  0.],
       [-0.,  1.]])

Comparisons

In [148]:
a = np.array([1,2,3,4])
b = np.array([1,2,3,100])
In [149]:
a == b
Out[149]:
array([ True,  True,  True, False], dtype=bool)
In [151]:
a > b
Out[151]:
array([False, False, False, False], dtype=bool)

array-wise comparison

In [153]:
# method1 
np.array_equal(a, b)
Out[153]:
False
In [154]:
# method2
np.all(a==b)
Out[154]:
False

Logical operations

In [43]:
a = np.array([1,1,0,0])
b = np.array([1,1,0,1])

OR / AND operations

In [44]:
np.logical_or(a,b)
Out[44]:
array([ True,  True, False,  True], dtype=bool)

Comparison and logical operations return array of boolean type, so can be used as mask to select a sub set

Exercices

a = [1,1,0,1,1]
b = [0,1,0,1,0]
c = [1,2,3,4,5]
  1. In pure Python, check that a*b does not work.
  2. In pure Python, compute the a*b on an element-wise basis For instance [1,2][2,3] should give [2,6]. Apply to **ab**
  3. In pure Python, compute the logical AND of a with respect to b (element-wise)
  4. Same question: compute a*b in Pure Numpy
  5. Guess (without typing the code) the output of
    np.array([True, 1, 2]) + np.array([3, 4, False])
    
  6. Guess the output of
    np.array([[1,2], [3,4], [5,6]]).shape
    

Solutions

In [36]:
a = [1,1,0,1,1]
b = [0,1,0,1,0]
c = [1,2,3,4,5]
# 1. 
# 2.
[x*y for x,y in zip(a,b)]
# 3. 
a and b # we get ones and zeros, but no boolean values
[x and y for x,y in zip(a,b)] 
[bool(x and y) for x,y in zip(a,b)] 
# 4. in numpy:
np.logical_and(a,b)
# 5. 
# [4,5,2] 
# 6.
# the dimension of the matrix
(3, 2)
Out[36]:
(3, 2)

Some linear algebra ?

In [118]:
from numpy import linalg

\begin{equation} A * A^{-1} = I \end{equation}

In [145]:
A = np.array([[4, 7], 
              [2, 6]])
B = linalg.inv(A)
B
Out[145]:
array([[ 0.6, -0.7],
       [-0.2,  0.4]])
In [146]:
A.dot(B).round()
Out[146]:
array([[ 1.,  0.],
       [-0.,  1.]])

Reductions (function applied on 1 axis)

In [29]:
x = np.array([1, 2, 3, 4, -1, -2, -3, -4])
In [171]:
x.sum()
Out[171]:
0
In [172]:
x.min()
Out[172]:
-4
In [173]:
x.max()
Out[173]:
4
In [174]:
# Position of the max
x.argmax()
Out[174]:
3

Reduction in higher dimensions

In [179]:
A = np.array([
    [1,10,1],
    [2,8,3]
])
In [180]:
A.sum()
Out[180]:
25

What about sum along the first axis (Line)

What about sum along the second axis (Column)

In [181]:
# Compute the sum along the axis 0 (rows). You reduce the row, 
# so in this example you should end up with 3 values
A.sum(axis=0)
Out[181]:
array([ 3, 18,  4])
In [183]:
# Here, we sum along axis 1 so we should get the sum for each line.
A.sum(axis=1)
Out[183]:
array([12, 13])

Reductions

  1. Create a 1,000,000 long vector (whatever distribution) but with positives and negatives values.
  2. Write 3 functions to get the max using
    • numpy max method,
    • pure Python max function,
    • a loop searching for the max (keeping track of the max). Compare the computation times
  3. Create a 2x3 matrix with random values (negative/positive). Experiments with the mean, max, argmin, argmax absolute, std.
  4. On the previous example, check that the transpose() method is equivalent to reshape(3,2)

Solutions

In [39]:
#1.
X = np.random.randn(1000000)

#2.
def max_numpy(X):
    return np.max(X)
def max_python(X):
    return max(X)
def max_list(X):
    M = -1e6
    for this in X:
        if this > M:
            M = this
%timeit max_numpy(X)
%timeit max_python(X)
%timeit max_list(X)
# 3.
X = np.random.rand(6).reshape(3,2)
X.mean(axis=0)
1000 loops, best of 3: 520 µs per loop
10 loops, best of 3: 58.6 ms per loop
10 loops, best of 3: 69.4 ms per loop
Out[39]:
array([ 0.58224888,  0.72470475])

Iterations

In [194]:
X = np.random.normal(size=12).reshape(6,2)
X
Out[194]:
array([[-0.41710565, -1.26354034],
       [-0.46755957, -0.45431898],
       [ 0.93871098, -0.67561552],
       [-0.05776603, -1.45499429],
       [ 0.15186369,  0.47294566],
       [ 1.04601121, -0.14891296]])
In [196]:
for row in X:
    if row.sum() > 0 :
        print(row)
[ 0.93871098 -0.67561552]
[ 0.15186369  0.47294566]
[ 1.04601121 -0.14891296]
In [197]:
for item in X.flat:
    if item >1:
        print(item)
1.04601121441

Iterations

  • Given an array 6x2 of random numbers normally distributed, loop over the row and print the row when there is at least one positive value
  • Given the same 6x2 matrix, loop over the rows, and print the index of the row when there is at least one positive value

solutions

In [43]:
np.random.seed(15)
m = np.random.randn(12).reshape(6,2)
print(m)
print("-------*---------")
for row in m:
    if sum(row>0) >0:
        print(row)
print("-------*---------")        
for i,row in enumerate(m):
    if sum(row>0) > 0:
        print(i)
    
[[-0.31232848  0.33928471]
 [-0.15590853 -0.50178967]
 [ 0.23556889 -1.76360526]
 [-1.09586204 -1.08776574]
 [-0.30517005 -0.47374837]
 [-0.20059454  0.35519677]]
-------*---------
[-0.31232848  0.33928471]
[ 0.23556889 -1.76360526]
[-0.20059454  0.35519677]
-------*---------
0
2
5

inf and nan

In [50]:
res = np.sqrt(-1)
/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in sqrt
  if __name__ == '__main__':
In [51]:
res
Out[51]:
nan
In [52]:
np.isnan(res)
Out[52]:
True
In [83]:
import sys
sys.float_info.max
Out[83]:
1.7976931348623157e+308
In [85]:
res = np.array([1e308,1e300]) * 10
/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: overflow encountered in multiply
  if __name__ == '__main__':
In [86]:
res
Out[86]:
array([              inf,   1.00000000e+301])

Resizing

In [24]:
a = np.diag([1,2,3,4])
In [26]:
# Can be used to add a column or a row
a.resize((5,4))
a
Out[26]:
array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4],
       [0, 0, 0, 0]])

Restriction: no reference permitted

Sorting

In [30]:
X = np.array([5,1,10,2,7,8])
X.sort() # inplace
X
Out[30]:
array([ 1,  2,  5,  7,  8, 10])
In [31]:
X = np.array([5,1,10,2,7,8])
np.sort(X)
Out[31]:
array([ 1,  2,  5,  7,  8, 10])
In [32]:
X = np.array([5,1,10,2,7,8])
X.argsort()
Out[32]:
array([1, 3, 0, 4, 5, 2])

Loading data

Numpy has its own reader of tabulated data sets (e.g. genfromtext)

In [61]:
data = np.genfromtxt("data/numpy.csv", delimiter=",")
data[:,1]
Out[61]:
array([ 0.,  1.,  2.,  1.,  1.,  3.,  0.])

However, we will use pandas read_csv function that is far better

CSV that genfromtext does not handle:

- strings that contains comma (see data/chembl.csv)
- mix of integers and strings (see data/brain_size.csv)

vstack and hstack (will be used in other notebooks)

In [53]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack([a, b])
Out[53]:
array([[1, 2, 3],
       [4, 5, 6]])
In [52]:
# equivalent of the + operator with list
np.hstack([a, b])
Out[52]:
array([1, 1, 0, 1, 1, 0, 1, 0, 1, 0])

Summary

For scientific computation with large data sets, use NumPy what you should keep in mind:

  • How to create arrays with : array, arange, ones, diag, zeros, ...
  • Know the shape and dimensions
  • Perform slicing and indexing at least on 2D arrays
  • How to use operators. Keep in mind that * operator is an element-wise operation.
  • Reduction operations: min,max, argmax, argmin, std, var, mean, absolute
  • Maths operators: abs, add, arccos,round, floor, cum, cumsum
  • ordering, sorting: argmax, sort, argsort

To go further:

- Broadcasting: operations on array with differen sizes
- splitting arrays
- check out some other functions: convolve, correlate, corrcoef, cov

References: NumPy documentation and Scipy Lectures Notes