TC, JBM, BN, AZ
Institut Pasteur, Paris, 20-31 March 2017

Motivation¶

Python is a general-purpose language. It was not designed for scientific computing.

NumPy is the core tools for performant numerical computing with Python. It is used by many other scientific libraries including SciPy, Pandas, matplotlib.

Installation¶

(in terminal)

conda install numpy

Convention¶

In [2]:

import numpy as np

Arrays¶

an array is like a sequence but contains a homogeneous collections of items (e.g. all integers)

In [5]:

import numpy as np
x = np.array([1, 2, 3])
x

Out[5]:

array([1, 2, 3])

NB1:

x is an instance of the object numpy.ndarray. The constructor takes as argument a sequence. Here we provided a list hence the ([ ]) syntax.

NB2: Following the previous nota bene about the syntax we have:

a = np.array(1, 2, 3, 4)    # WRONG
a = np.array([1, 2, 3, 4])  # RIGHT

NumPy provides fast and memory efficient data structures

Pure Python¶

In [3]:

l = range(1000000)
%time sum(l)

CPU times: user 19.9 ms, sys: 0 ns, total: 19.9 ms
Wall time: 19.8 ms

Out[3]:

499999500000

Numpy¶

In [7]:

x = np.array(l)
%time x.sum()

CPU times: user 1.34 ms, sys: 1.05 ms, total: 2.38 ms
Wall time: 1.32 ms

Out[7]:

499999500000

About 15 times faster

Another example:

How would you compute $\sum X_i^2$ given $X$ is a Python list ?

In [9]:

l = range(1000000)
%time sum([x**2 for x in l])

CPU times: user 292 ms, sys: 10.9 ms, total: 303 ms
Wall time: 302 ms

Out[9]:

333332833333500000

In [10]:

x = np.array(l)
%time (x**2).sum()

CPU times: user 3.62 ms, sys: 1.07 ms, total: 4.69 ms
Wall time: 3.62 ms

Out[10]:

333332833333500000

!!! about 70-80 times faster

Create N-D arrays¶

1-D array¶

In [9]:

x = np.array([1, 2, 10, 2, 1 ])
x.shape

Out[9]:

(5,)

In [10]:

len(x)

Out[10]:

Indexing follows the same syntax as with Python sequences.

In [11]:

x[2]

Out[11]:

In [12]:

x.ndim

Out[12]:

2-D arrays¶

In [11]:

np.array([[1,2], [3,4]])

Out[11]:

array([[1, 2],
       [3, 4]])

Here is a naive way to build a 2D matrix with values going from 1 to 12. Later, we will use the arange function + reshape method

In [3]:

n1 = [1, 2, 3]
n2 = [4, 5, 6]
n3 = [7, 8, 9]
n4 = [10, 11, 12]
m = np.array([n1, n2, n3, n4])

In [4]:

m.ndim

Out[4]:

In [5]:

m.shape

Out[5]:

(4, 3)

In [6]:

len(m)

Out[6]:

Indexing: LC convention (Line / Column)¶

For a 5x5 matrix, the indexing works as follows

In [46]:

Out[46]:

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [57]:

# To get 11, last row, second column:
m[3,1]

Out[57]:

In [58]:

#equivalent but a bit slower:
m[3][1]

Out[58]:

4-D arrays ? why not ?¶

Given a cubic volume of air, we decompose it in a 2x2x2 smaller cubes. In each small cube, we measure 3 variables: eg, pressure, temperature...

In [7]:

c1 = [1,2,3]; c2 = [1,2,3]; c3 = [1,2,3]; c4 = [1,2,3]; 
c5 = [1,2,3]; c6 = [1,2,3]; c7 = [1,2,3]; c8 = [1,2,3]; 
x = np.array(
    [                   # first dimension (2 slices)
        [               # second dimension (2 rows)
            [           # third dimension (2 columns)
                c1, c2
            ],  
            [
                c3, c4
            ]
        ],
        [
            [
                c5, c6
            ],
            [
                c7, c8
            ]
        ]
    ])

In [24]:

x.shape

Out[24]:

(2, 2, 2, 3)

Functions to create arrays¶

The previous 4D example was not created by hand but based on:

The ones function¶

In [11]:

np.ones((2, 2, 2, 3))

Out[11]:

array([[[[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]],

        [[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]]],


       [[[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]],

        [[ 1.,  1.,  1.],
         [ 1.,  1.,  1.]]]])

In [12]:

# a 2D example
np.ones((4, 2))

Out[12]:

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

The arange function¶

Evenly spaced values within a given interval based on a step

In [13]:

np.arange(1, 10) # not that the end is exclusive and the step is 1 by default

Out[13]:

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [14]:

np.arange(0, 10, step=2)

Out[14]:

array([0, 2, 4, 6, 8])

In [15]:

# Remember how we created to first 2D example array ?
# here is the proper way with arange and reshape
np.arange(12).reshape(4,3)

Out[15]:

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

the linspace function¶

Evenly spaced values within a given interval based on a number of points

In [8]:

np.linspace(0, 1, 10)

Out[8]:

array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

ones, zeros, diag, eye, empty¶

In [25]:

np.diag((5,5,1,1))

Out[25]:

array([[5, 0, 0, 0],
       [0, 5, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]])

In [26]:

np.ones((2,2))

Out[26]:

array([[ 1.,  1.],
       [ 1.,  1.]])

In [27]:

np.zeros((2,2))

Out[27]:

array([[ 0.,  0.],
       [ 0.,  0.]])

In [28]:

np.eye(3)

Out[28]:

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Random values¶

Python language has its own random module but numpy has more functionalities.

In [29]:

# uniform random values between 0 and 1
np.random.rand()

Out[29]:

0.7959159810777837

In [12]:

# normal distribution. Note the **n** after rand
np.random.randn()

Out[12]:

0.36012177378781063

In [16]:

# array of normally distributed values
np.random.normal(size=10)

Out[16]:

array([-0.14782388, -1.27359207,  0.50721445, -0.77446479,  1.69835029,
       -0.37707183, -0.91804253,  0.54775832, -0.10592372, -0.58636633])

Array creation

Create a 4x4 matrix with 2's on the diagonal. What about a 100x100 matrix ?

Create a matrix 5x5 with random number uniformly distributed

Creata a 1D array of 1000 points normally distributed. Compute the mean and standard deviation.

Same question for a uniform distribution.

can you compute the maximum and miminum of the previously created arrays. Does it make sense ?
Create these matrices:
1 0 0 0 0
0 1 0 0 0
0 0 5 0 0
0 0 0 1 0
0 0 0 0 1
and
0 1  1  1  1
1 0  1  1  1
1 1 -4  1  1
1 1  1  0  1
1 1  1  1  0

Hints and solutions¶

In [4]:

# 1 -
x = np.diag([2,2,2,2]); 
x = np.diag([2]*100);
# 2 -
x = np.random.normal(size=25).reshape(5,5)
# 3 -
x = np.random.normal(size=1000)
x.mean()
x.std()
# 4 - for uniform distribution, replace 'normal' by 'uniform'
# 5 - 
x.min()
x.max()
# takes the min/max of U should give [0,1] and on N should give
# -2, +2 but random so may change. does it make sense -> yes
# 6 - 
A = np.diag([1]*5); A[2,2]=5
B = np.ones((5,5)) - A

data types¶

In [14]:

x = np.array([1, 2, 3])
x.dtype

Out[14]:

dtype('int64')

In [37]:

x = np.array([1., 2, 3.5])
x.dtype

Out[37]:

dtype('float64')

In [15]:

x = np.array([1, 2, 3], dtype=float)
x.dtype

Out[15]:

dtype('float64')

In [33]:

x = np.array([1, 2, 3])
x = x.astype(float)
x

Out[33]:

array([ 1.,  2.,  3.])

If you mix types, the most complex type is used

In [35]:

np.array([1.0, 1, "oups"])

Out[35]:

array(['1.0', '1', 'oups'], 
      dtype='<U32')

Basic indexing and slicing

Create an array that goes from 0 to 9 (included)

Using slicing and indexing:

Select only odd values

Select only even values

Select last two values only

Select all values except first 2

Note that the syntax is the same as in pure Python slicing and indexing

Solutions¶

In [11]:

data = np.arange(10)
# you may use pure python and cast to array:
data = np.array(range(10))
print("even: {}".format(data[::2]))
print("odd: {}".format(data[1::2]))
print("last two : {}".format(data[-2:]))
print("from pos 1: {}".format(data[2:]))

even: [0 2 4 6 8]
odd: [1 3 5 7 9]
last two : [8 9]
from pos 1: [2 3 4 5 6 7 8 9]

Example in 2D

Syntax. First index is for rows and second for columns:

    Matrix[0, :]
    Matrix[:, :]
    Matrix[:, 0]

Create the array shown above. Then, with slicing and indexing,

extract first row,

extract first column (orange cells)

extract even values only, odd values only

extract the 4 blue cells

extract the 2 green cells

extract the 2x2 sub-matrix in bottom right corner

Solutions¶

In [12]:

# blue cells
x = np.arange(5)
M = np.array([x, x+10, x+20, x+30, x+40])
# first row:
M[0,:]
# first colum,n
M[:,0]
# even values
M[:,::2]
# odd values
M[:, 1::2]

Out[12]:

array([[ 1,  3],
       [11, 13],
       [21, 23],
       [31, 33],
       [41, 43]])

In [17]:

# sub corner: 
M[-2:, -2:]
# green cells: 
M[2, 2:]
# blue cells: 
M[1::2, 1::2]

Out[17]:

array([[11, 13],
       [31, 33]])

Copies and views¶

Like in Python, we are manipulating objects. So be careful with the references

In [158]:

a = np.array([1,2,3,4,5])

In [160]:

b = a[:]

In [164]:

a[2] = 30

In [165]:

Out[165]:

array([ 1,  2, 30,  4,  5])

In [166]:

b[-2:] = 0

In [167]:

Out[167]:

array([ 1,  2, 30,  0,  0])

In [168]:

Out[168]:

array([ 1,  2, 30,  0,  0])

In Numpy, the slice : is a view (reference), not a copy

copies¶

In [27]:

a = np.array([1,2,3])
b = a.copy()

In [28]:

b[1] = 100
b

Out[28]:

array([  1, 100,   3])

In [29]:

Out[29]:

array([1, 2, 3])

Fancy indexing¶

As we have seen before, standard Python slicing and indexing works on NumpPy array. Yet, NumPy provides more indexing, which can be performed with boolean or integer arrays, also called masked

Indexing with boolean masks¶

Boolean mask is a very powerful feature in NumPy. It can be used to index an array, and assign new values to a sub-array. Note also that it creates copies not views

In [21]:

data = np.arange(16)
data

Out[21]:

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

Find all multiple of 7

In [22]:

(data % 7 == 0)

Out[22]:

array([ True, False, False, False, False, False, False,  True, False,
       False, False, False, False, False,  True, False], dtype=bool)

In [23]:

mask = (data % 7 == 0)
data[mask]

Out[23]:

array([ 0,  7, 14])

In [24]:

# Replaces values: 
data[mask] = -100
data

Out[24]:

array([-100,    1,    2,    3,    4,    5,    6, -100,    8,    9,   10,
         11,   12,   13, -100,   15])

Indexing with an array of integers¶

In [40]:

X = np.array([-1, 2, -3, -4, -5, 10, 20])
indices = [0, 1, 2, 3]
X[indices]

Out[40]:

array([-1,  2, -3, -4])

Indices are nice if you have uneven selection and you know what you want before hand

In [41]:

indices = [0,4]
X[indices]

Out[41]:

array([-1, -5])

Of course in the first example, we could have been smarter with just a mask:

In [122]:

X[X<0]

Out[122]:

array([-1, -3, -4, -5])

Fancy indexing

Create the array above and extract the following data sets:

the 9 blue cells

the 5 orange cells

the green cells

Solutions¶

In [21]:

x = np.arange(6)
M = np.array([x, x+10, x+20, x+30, x+40, x+50])
M

Out[21]:

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [123]:

# Creation using tiles
M = np.tile(range(6), (6, 1))
for i in range(6):
    M[i,:] += i*10
M

Out[123]:

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [22]:

# orange: 
M[(0,1,2,3,4), (1,2,3,4,5)]
# blue: 
M[3:, [0,2,5]]
# green: 
M[np.array([1, 0,1,0,0,1], dtype=bool),4]
# green: 
M[np.array([True, False,True,False,False,True]),4]

Out[22]:

array([ 4, 24, 54])

Create a range from -5 to 5 (included)

With a mask, find the positive values and replace them with +1

With a mask, find the negative values and replace them with -1

Solutions¶

In [26]:

M = np.arange(-5,6)
M[M<0]
M[M>0]

Out[26]:

array([1, 2, 3, 4, 5])

Numerical operations¶

In [41]:

A = np.array([[4, 7], 
              [2, 6]])

B = np.array([[0.6, -0.7],
              [-0.2, 0.4]])

In [42]:

A + B

Out[42]:

array([[ 4.6,  6.3],
       [ 1.8,  6.4]])

Again be careful with copies and views

In [43]:

C = B.copy()

In [44]:

C *= 2
B         # unchanged because we did a copy !

Out[44]:

array([[ 0.6, -0.7],
       [-0.2,  0.4]])

In [45]:

# elementwise product
A * B

Out[45]:

array([[ 2.4, -4.9],
       [-0.4,  2.4]])

This is not a matrix product

In [116]:

# matrix product
A.dot(B)

Out[116]:

array([[  1.00000000e+00,   4.44089210e-16],
       [ -2.22044605e-16,   1.00000000e+00]])

In [124]:

A.dot(B).round()

Out[124]:

array([[ 1.,  0.],
       [-0.,  1.]])

Comparisons¶

In [148]:

a = np.array([1,2,3,4])
b = np.array([1,2,3,100])

In [149]:

a == b

Out[149]:

array([ True,  True,  True, False], dtype=bool)

In [151]:

a > b

Out[151]:

array([False, False, False, False], dtype=bool)

array-wise comparison¶

In [153]:

# method1 
np.array_equal(a, b)

Out[153]:

False

In [154]:

# method2
np.all(a==b)

Out[154]:

False

Logical operations¶

In [43]:

a = np.array([1,1,0,0])
b = np.array([1,1,0,1])

OR / AND operations¶

In [44]:

np.logical_or(a,b)

Out[44]:

array([ True,  True, False,  True], dtype=bool)

Comparison and logical operations return array of boolean type, so can be used as mask to select a sub set

Exercices

a = [1,1,0,1,1]
b = [0,1,0,1,0]
c = [1,2,3,4,5]

In pure Python, check that a*b does not work.

In pure Python, compute the a*b on an element-wise basis For instance [1,2][2,3] should give [2,6]. Apply to **ab**

In pure Python, compute the logical AND of a with respect to b (element-wise)

Same question: compute a*b in Pure Numpy
Guess (without typing the code) the output of
np.array([True, 1, 2]) + np.array([3, 4, False])
Guess the output of
np.array([[1,2], [3,4], [5,6]]).shape

Solutions¶

In [36]:

a = [1,1,0,1,1]
b = [0,1,0,1,0]
c = [1,2,3,4,5]
# 1. 
# 2.
[x*y for x,y in zip(a,b)]
# 3. 
a and b # we get ones and zeros, but no boolean values
[x and y for x,y in zip(a,b)] 
[bool(x and y) for x,y in zip(a,b)] 
# 4. in numpy:
np.logical_and(a,b)
# 5. 
# [4,5,2] 
# 6.
# the dimension of the matrix
(3, 2)

Out[36]:

(3, 2)

Some linear algebra ?¶

In [118]:

from numpy import linalg

\begin{equation} A * A^{-1} = I \end{equation}

In [145]:

A = np.array([[4, 7], 
              [2, 6]])
B = linalg.inv(A)
B

Out[145]:

array([[ 0.6, -0.7],
       [-0.2,  0.4]])

In [146]:

A.dot(B).round()

Out[146]:

array([[ 1.,  0.],
       [-0.,  1.]])

Reductions (function applied on 1 axis)¶

In [29]:

x = np.array([1, 2, 3, 4, -1, -2, -3, -4])

In [171]:

x.sum()

Out[171]:

In [172]:

x.min()

Out[172]:

-4

In [173]:

x.max()

Out[173]:

In [174]:

# Position of the max
x.argmax()

Out[174]:

Reduction in higher dimensions¶

In [179]:

A = np.array([
    [1,10,1],
    [2,8,3]
])

In [180]:

A.sum()

Out[180]:

What about sum along the first axis (Line)

What about sum along the second axis (Column)

In [181]:

# Compute the sum along the axis 0 (rows). You reduce the row, 
# so in this example you should end up with 3 values
A.sum(axis=0)

Out[181]:

array([ 3, 18,  4])

In [183]:

# Here, we sum along axis 1 so we should get the sum for each line.
A.sum(axis=1)

Out[183]:

array([12, 13])

Reductions

Create a 1,000,000 long vector (whatever distribution) but with positives and negatives values.

Write 3 functions to get the max using

numpy max method,

pure Python max function,

a loop searching for the max (keeping track of the max). Compare the computation times

Create a 2x3 matrix with random values (negative/positive). Experiments with the mean, max, argmin, argmax absolute, std.

On the previous example, check that the transpose() method is equivalent to reshape(3,2)

Solutions¶

In [39]:

#1.
X = np.random.randn(1000000)

#2.
def max_numpy(X):
    return np.max(X)
def max_python(X):
    return max(X)
def max_list(X):
    M = -1e6
    for this in X:
        if this > M:
            M = this
%timeit max_numpy(X)
%timeit max_python(X)
%timeit max_list(X)
# 3.
X = np.random.rand(6).reshape(3,2)
X.mean(axis=0)

1000 loops, best of 3: 520 µs per loop
10 loops, best of 3: 58.6 ms per loop
10 loops, best of 3: 69.4 ms per loop

Out[39]:

array([ 0.58224888,  0.72470475])

Iterations¶

In [194]:

X = np.random.normal(size=12).reshape(6,2)
X

Out[194]:

array([[-0.41710565, -1.26354034],
       [-0.46755957, -0.45431898],
       [ 0.93871098, -0.67561552],
       [-0.05776603, -1.45499429],
       [ 0.15186369,  0.47294566],
       [ 1.04601121, -0.14891296]])

In [196]:

for row in X:
    if row.sum() > 0 :
        print(row)

[ 0.93871098 -0.67561552]
[ 0.15186369  0.47294566]
[ 1.04601121 -0.14891296]

In [197]:

for item in X.flat:
    if item >1:
        print(item)

1.04601121441

Iterations

Given an array 6x2 of random numbers normally distributed, loop over the row and print the row when there is at least one positive value

Given the same 6x2 matrix, loop over the rows, and print the index of the row when there is at least one positive value

solutions¶

In [43]:

np.random.seed(15)
m = np.random.randn(12).reshape(6,2)
print(m)
print("-------*---------")
for row in m:
    if sum(row>0) >0:
        print(row)
print("-------*---------")        
for i,row in enumerate(m):
    if sum(row>0) > 0:
        print(i)

[[-0.31232848  0.33928471]
 [-0.15590853 -0.50178967]
 [ 0.23556889 -1.76360526]
 [-1.09586204 -1.08776574]
 [-0.30517005 -0.47374837]
 [-0.20059454  0.35519677]]
-------*---------
[-0.31232848  0.33928471]
[ 0.23556889 -1.76360526]
[-0.20059454  0.35519677]
-------*---------
0
2
5

inf and nan¶

In [50]:

res = np.sqrt(-1)

/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in sqrt
  if __name__ == '__main__':

In [51]:

res

Out[51]:

nan

In [52]:

np.isnan(res)

Out[52]:

True

In [83]:

import sys
sys.float_info.max

Out[83]:

1.7976931348623157e+308

In [85]:

res = np.array([1e308,1e300]) * 10

/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: overflow encountered in multiply
  if __name__ == '__main__':

In [86]:

res

Out[86]:

array([              inf,   1.00000000e+301])

Resizing¶

In [24]:

a = np.diag([1,2,3,4])

In [26]:

# Can be used to add a column or a row
a.resize((5,4))
a

Out[26]:

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4],
       [0, 0, 0, 0]])

Restriction: no reference permitted

Sorting¶

In [30]:

X = np.array([5,1,10,2,7,8])
X.sort() # inplace
X

Out[30]:

array([ 1,  2,  5,  7,  8, 10])

In [31]:

X = np.array([5,1,10,2,7,8])
np.sort(X)

Out[31]:

array([ 1,  2,  5,  7,  8, 10])

In [32]:

X = np.array([5,1,10,2,7,8])
X.argsort()

Out[32]:

array([1, 3, 0, 4, 5, 2])

Loading data¶

Numpy has its own reader of tabulated data sets (e.g. genfromtext)

In [61]:

data = np.genfromtxt("data/numpy.csv", delimiter=",")
data[:,1]

Out[61]:

array([ 0.,  1.,  2.,  1.,  1.,  3.,  0.])

However, we will use pandas read_csv function that is far better

CSV that genfromtext does not handle:

- strings that contains comma (see data/chembl.csv)
- mix of integers and strings (see data/brain_size.csv)

vstack and hstack (will be used in other notebooks)¶

In [53]:

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack([a, b])

Out[53]:

array([[1, 2, 3],
       [4, 5, 6]])

In [52]:

# equivalent of the + operator with list
np.hstack([a, b])

Out[52]:

array([1, 1, 0, 1, 1, 0, 1, 0, 1, 0])

Summary¶

For scientific computation with large data sets, use NumPy what you should keep in mind:

How to create arrays with : array, arange, ones, diag, zeros, ...

Know the shape and dimensions

Perform slicing and indexing at least on 2D arrays

How to use operators. Keep in mind that * operator is an element-wise operation.

Reduction operations: min,max, argmax, argmin, std, var, mean, absolute

Maths operators: abs, add, arccos,round, floor, cum, cumsum

ordering, sorting: argmax, sort, argsort

To go further:

- Broadcasting: operations on array with differen sizes
- splitting arrays
- check out some other functions: convolve, correlate, corrcoef, cov

References: NumPy documentation and Scipy Lectures Notes