- Python is a general-purpose language. It was not designed for scientific computing.
- NumPy is the core tools for performant numerical computing with Python. It is used by many other scientific libraries including SciPy, Pandas, matplotlib.
(in terminal)
conda install numpy
import numpy as np
an array is like a sequence but contains a homogeneous collections of items (e.g. all integers)
import numpy as np
x = np.array([1, 2, 3])
x
array([1, 2, 3])
NB1:
x is an instance of the object numpy.ndarray. The constructor takes as argument a sequence. Here we provided a list hence the ([ ]) syntax.
NB2: Following the previous nota bene about the syntax we have:
a = np.array(1, 2, 3, 4) # WRONG
a = np.array([1, 2, 3, 4]) # RIGHT
NumPy provides fast and memory efficient data structures
l = range(1000000)
%time sum(l)
CPU times: user 19.9 ms, sys: 0 ns, total: 19.9 ms Wall time: 19.8 ms
499999500000
x = np.array(l)
%time x.sum()
CPU times: user 1.34 ms, sys: 1.05 ms, total: 2.38 ms Wall time: 1.32 ms
499999500000
About 15 times faster
Another example:
How would you compute $\sum X_i^2$ given $X$ is a Python list ?
l = range(1000000)
%time sum([x**2 for x in l])
CPU times: user 292 ms, sys: 10.9 ms, total: 303 ms Wall time: 302 ms
333332833333500000
x = np.array(l)
%time (x**2).sum()
CPU times: user 3.62 ms, sys: 1.07 ms, total: 4.69 ms Wall time: 3.62 ms
333332833333500000
!!! about 70-80 times faster
x = np.array([1, 2, 10, 2, 1 ])
x.shape
(5,)
len(x)
5
Indexing follows the same syntax as with Python sequences.
x[2]
10
x.ndim
1
np.array([[1,2], [3,4]])
array([[1, 2], [3, 4]])
Here is a naive way to build a 2D matrix with values going from 1 to 12. Later, we will use the arange function + reshape method
n1 = [1, 2, 3]
n2 = [4, 5, 6]
n3 = [7, 8, 9]
n4 = [10, 11, 12]
m = np.array([n1, n2, n3, n4])
m.ndim
2
m.shape
(4, 3)
len(m)
4
For a 5x5 matrix, the indexing works as follows
m
array([[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9], [10, 11, 12]])
# To get 11, last row, second column:
m[3,1]
11
#equivalent but a bit slower:
m[3][1]
11
Given a cubic volume of air, we decompose it in a 2x2x2 smaller cubes. In each small cube, we measure 3 variables: eg, pressure, temperature...
c1 = [1,2,3]; c2 = [1,2,3]; c3 = [1,2,3]; c4 = [1,2,3];
c5 = [1,2,3]; c6 = [1,2,3]; c7 = [1,2,3]; c8 = [1,2,3];
x = np.array(
[ # first dimension (2 slices)
[ # second dimension (2 rows)
[ # third dimension (2 columns)
c1, c2
],
[
c3, c4
]
],
[
[
c5, c6
],
[
c7, c8
]
]
])
x.shape
(2, 2, 2, 3)
The previous 4D example was not created by hand but based on:
np.ones((2, 2, 2, 3))
array([[[[ 1., 1., 1.], [ 1., 1., 1.]], [[ 1., 1., 1.], [ 1., 1., 1.]]], [[[ 1., 1., 1.], [ 1., 1., 1.]], [[ 1., 1., 1.], [ 1., 1., 1.]]]])
# a 2D example
np.ones((4, 2))
array([[ 1., 1.], [ 1., 1.], [ 1., 1.], [ 1., 1.]])
Evenly spaced values within a given interval based on a step
np.arange(1, 10) # not that the end is exclusive and the step is 1 by default
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(0, 10, step=2)
array([0, 2, 4, 6, 8])
# Remember how we created to first 2D example array ?
# here is the proper way with arange and reshape
np.arange(12).reshape(4,3)
array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]])
Evenly spaced values within a given interval based on a number of points
np.linspace(0, 1, 10)
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444, 0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
np.diag((5,5,1,1))
array([[5, 0, 0, 0], [0, 5, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]])
np.ones((2,2))
array([[ 1., 1.], [ 1., 1.]])
np.zeros((2,2))
array([[ 0., 0.], [ 0., 0.]])
np.eye(3)
array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]])
Python language has its own random module but numpy has more functionalities.
# uniform random values between 0 and 1
np.random.rand()
0.7959159810777837
# normal distribution. Note the **n** after rand
np.random.randn()
0.36012177378781063
# array of normally distributed values
np.random.normal(size=10)
array([-0.14782388, -1.27359207, 0.50721445, -0.77446479, 1.69835029, -0.37707183, -0.91804253, 0.54775832, -0.10592372, -0.58636633])
- Create a 4x4 matrix with 2's on the diagonal. What about a 100x100 matrix ?
- Create a matrix 5x5 with random number uniformly distributed
- Creata a 1D array of 1000 points normally distributed. Compute the mean and standard deviation.
- Same question for a uniform distribution.
- can you compute the maximum and miminum of the previously created arrays. Does it make sense ?
- Create these matrices:
and1 0 0 0 0 0 1 0 0 0 0 0 5 0 0 0 0 0 1 0 0 0 0 0 1
0 1 1 1 1 1 0 1 1 1 1 1 -4 1 1 1 1 1 0 1 1 1 1 1 0
# 1 -
x = np.diag([2,2,2,2]);
x = np.diag([2]*100);
# 2 -
x = np.random.normal(size=25).reshape(5,5)
# 3 -
x = np.random.normal(size=1000)
x.mean()
x.std()
# 4 - for uniform distribution, replace 'normal' by 'uniform'
# 5 -
x.min()
x.max()
# takes the min/max of U should give [0,1] and on N should give
# -2, +2 but random so may change. does it make sense -> yes
# 6 -
A = np.diag([1]*5); A[2,2]=5
B = np.ones((5,5)) - A
x = np.array([1, 2, 3])
x.dtype
dtype('int64')
x = np.array([1., 2, 3.5])
x.dtype
dtype('float64')
x = np.array([1, 2, 3], dtype=float)
x.dtype
dtype('float64')
x = np.array([1, 2, 3])
x = x.astype(float)
x
array([ 1., 2., 3.])
np.array([1.0, 1, "oups"])
array(['1.0', '1', 'oups'], dtype='<U32')
- Create an array that goes from 0 to 9 (included)
Using slicing and indexing:
- Select only odd values
- Select only even values
- Select last two values only
- Select all values except first 2
Note that the syntax is the same as in pure Python slicing and indexing
data = np.arange(10)
# you may use pure python and cast to array:
data = np.array(range(10))
print("even: {}".format(data[::2]))
print("odd: {}".format(data[1::2]))
print("last two : {}".format(data[-2:]))
print("from pos 1: {}".format(data[2:]))
even: [0 2 4 6 8] odd: [1 3 5 7 9] last two : [8 9] from pos 1: [2 3 4 5 6 7 8 9]
Syntax. First index is for rows and second for columns:
Matrix[0, :]
Matrix[:, :]
Matrix[:, 0]
Create the array shown above. Then, with slicing and indexing,
- extract first row,
- extract first column (orange cells)
- extract even values only, odd values only
- extract the 4 blue cells
- extract the 2 green cells
- extract the 2x2 sub-matrix in bottom right corner
# blue cells
x = np.arange(5)
M = np.array([x, x+10, x+20, x+30, x+40])
# first row:
M[0,:]
# first colum,n
M[:,0]
# even values
M[:,::2]
# odd values
M[:, 1::2]
array([[ 1, 3], [11, 13], [21, 23], [31, 33], [41, 43]])
# sub corner:
M[-2:, -2:]
# green cells:
M[2, 2:]
# blue cells:
M[1::2, 1::2]
array([[11, 13], [31, 33]])
Like in Python, we are manipulating objects. So be careful with the references
a = np.array([1,2,3,4,5])
b = a[:]
a[2] = 30
b
array([ 1, 2, 30, 4, 5])
b[-2:] = 0
b
array([ 1, 2, 30, 0, 0])
a
array([ 1, 2, 30, 0, 0])
a = np.array([1,2,3])
b = a.copy()
b[1] = 100
b
array([ 1, 100, 3])
a
array([1, 2, 3])
As we have seen before, standard Python slicing and indexing works on NumpPy array. Yet, NumPy provides more indexing, which can be performed with boolean or integer arrays, also called masked
Boolean mask is a very powerful feature in NumPy. It can be used to index an array, and assign new values to a sub-array. Note also that it creates copies not views
data = np.arange(16)
data
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
Find all multiple of 7
(data % 7 == 0)
array([ True, False, False, False, False, False, False, True, False, False, False, False, False, False, True, False], dtype=bool)
mask = (data % 7 == 0)
data[mask]
array([ 0, 7, 14])
# Replaces values:
data[mask] = -100
data
array([-100, 1, 2, 3, 4, 5, 6, -100, 8, 9, 10, 11, 12, 13, -100, 15])
X = np.array([-1, 2, -3, -4, -5, 10, 20])
indices = [0, 1, 2, 3]
X[indices]
array([-1, 2, -3, -4])
Indices are nice if you have uneven selection and you know what you want before hand
indices = [0,4]
X[indices]
array([-1, -5])
Of course in the first example, we could have been smarter with just a mask:
X[X<0]
array([-1, -3, -4, -5])
Create the array above and extract the following data sets:
- the 9 blue cells
- the 5 orange cells
- the green cells
x = np.arange(6)
M = np.array([x, x+10, x+20, x+30, x+40, x+50])
M
array([[ 0, 1, 2, 3, 4, 5], [10, 11, 12, 13, 14, 15], [20, 21, 22, 23, 24, 25], [30, 31, 32, 33, 34, 35], [40, 41, 42, 43, 44, 45], [50, 51, 52, 53, 54, 55]])
# Creation using tiles
M = np.tile(range(6), (6, 1))
for i in range(6):
M[i,:] += i*10
M
array([[ 0, 1, 2, 3, 4, 5], [10, 11, 12, 13, 14, 15], [20, 21, 22, 23, 24, 25], [30, 31, 32, 33, 34, 35], [40, 41, 42, 43, 44, 45], [50, 51, 52, 53, 54, 55]])
# orange:
M[(0,1,2,3,4), (1,2,3,4,5)]
# blue:
M[3:, [0,2,5]]
# green:
M[np.array([1, 0,1,0,0,1], dtype=bool),4]
# green:
M[np.array([True, False,True,False,False,True]),4]
array([ 4, 24, 54])
- Create a range from -5 to 5 (included)
- With a mask, find the positive values and replace them with +1
- With a mask, find the negative values and replace them with -1
M = np.arange(-5,6)
M[M<0]
M[M>0]
array([1, 2, 3, 4, 5])
A = np.array([[4, 7],
[2, 6]])
B = np.array([[0.6, -0.7],
[-0.2, 0.4]])
A + B
array([[ 4.6, 6.3], [ 1.8, 6.4]])
C = B.copy()
C *= 2
B # unchanged because we did a copy !
array([[ 0.6, -0.7], [-0.2, 0.4]])
# elementwise product
A * B
array([[ 2.4, -4.9], [-0.4, 2.4]])
# matrix product
A.dot(B)
array([[ 1.00000000e+00, 4.44089210e-16], [ -2.22044605e-16, 1.00000000e+00]])
A.dot(B).round()
array([[ 1., 0.], [-0., 1.]])
a = np.array([1,2,3,4])
b = np.array([1,2,3,100])
a == b
array([ True, True, True, False], dtype=bool)
a > b
array([False, False, False, False], dtype=bool)
# method1
np.array_equal(a, b)
False
# method2
np.all(a==b)
False
a = np.array([1,1,0,0])
b = np.array([1,1,0,1])
np.logical_or(a,b)
array([ True, True, False, True], dtype=bool)
Comparison and logical operations return array of boolean type, so can be used as mask to select a sub set
a = [1,1,0,1,1]
b = [0,1,0,1,0]
c = [1,2,3,4,5]
- In pure Python, check that a*b does not work.
- In pure Python, compute the a*b on an element-wise basis For instance [1,2][2,3] should give [2,6]. Apply to **ab**
- In pure Python, compute the logical AND of a with respect to b (element-wise)
- Same question: compute a*b in Pure Numpy
- Guess (without typing the code) the output of
np.array([True, 1, 2]) + np.array([3, 4, False])- Guess the output of
np.array([[1,2], [3,4], [5,6]]).shape
a = [1,1,0,1,1]
b = [0,1,0,1,0]
c = [1,2,3,4,5]
# 1.
# 2.
[x*y for x,y in zip(a,b)]
# 3.
a and b # we get ones and zeros, but no boolean values
[x and y for x,y in zip(a,b)]
[bool(x and y) for x,y in zip(a,b)]
# 4. in numpy:
np.logical_and(a,b)
# 5.
# [4,5,2]
# 6.
# the dimension of the matrix
(3, 2)
(3, 2)
from numpy import linalg
\begin{equation} A * A^{-1} = I \end{equation}
A = np.array([[4, 7],
[2, 6]])
B = linalg.inv(A)
B
array([[ 0.6, -0.7], [-0.2, 0.4]])
A.dot(B).round()
array([[ 1., 0.], [-0., 1.]])
x = np.array([1, 2, 3, 4, -1, -2, -3, -4])
x.sum()
0
x.min()
-4
x.max()
4
# Position of the max
x.argmax()
3
A = np.array([
[1,10,1],
[2,8,3]
])
A.sum()
25
What about sum along the first axis (Line)
What about sum along the second axis (Column)
# Compute the sum along the axis 0 (rows). You reduce the row,
# so in this example you should end up with 3 values
A.sum(axis=0)
array([ 3, 18, 4])
# Here, we sum along axis 1 so we should get the sum for each line.
A.sum(axis=1)
array([12, 13])
- Create a 1,000,000 long vector (whatever distribution) but with positives and negatives values.
- Write 3 functions to get the max using
- numpy max method,
- pure Python max function,
- a loop searching for the max (keeping track of the max). Compare the computation times
- Create a 2x3 matrix with random values (negative/positive). Experiments with the mean, max, argmin, argmax absolute, std.
- On the previous example, check that the transpose() method is equivalent to reshape(3,2)
#1.
X = np.random.randn(1000000)
#2.
def max_numpy(X):
return np.max(X)
def max_python(X):
return max(X)
def max_list(X):
M = -1e6
for this in X:
if this > M:
M = this
%timeit max_numpy(X)
%timeit max_python(X)
%timeit max_list(X)
# 3.
X = np.random.rand(6).reshape(3,2)
X.mean(axis=0)
1000 loops, best of 3: 520 µs per loop 10 loops, best of 3: 58.6 ms per loop 10 loops, best of 3: 69.4 ms per loop
array([ 0.58224888, 0.72470475])
X = np.random.normal(size=12).reshape(6,2)
X
array([[-0.41710565, -1.26354034], [-0.46755957, -0.45431898], [ 0.93871098, -0.67561552], [-0.05776603, -1.45499429], [ 0.15186369, 0.47294566], [ 1.04601121, -0.14891296]])
for row in X:
if row.sum() > 0 :
print(row)
[ 0.93871098 -0.67561552] [ 0.15186369 0.47294566] [ 1.04601121 -0.14891296]
for item in X.flat:
if item >1:
print(item)
1.04601121441
- Given an array 6x2 of random numbers normally distributed, loop over the row and print the row when there is at least one positive value
- Given the same 6x2 matrix, loop over the rows, and print the index of the row when there is at least one positive value
np.random.seed(15)
m = np.random.randn(12).reshape(6,2)
print(m)
print("-------*---------")
for row in m:
if sum(row>0) >0:
print(row)
print("-------*---------")
for i,row in enumerate(m):
if sum(row>0) > 0:
print(i)
[[-0.31232848 0.33928471] [-0.15590853 -0.50178967] [ 0.23556889 -1.76360526] [-1.09586204 -1.08776574] [-0.30517005 -0.47374837] [-0.20059454 0.35519677]] -------*--------- [-0.31232848 0.33928471] [ 0.23556889 -1.76360526] [-0.20059454 0.35519677] -------*--------- 0 2 5
res = np.sqrt(-1)
/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in sqrt if __name__ == '__main__':
res
nan
np.isnan(res)
True
import sys
sys.float_info.max
1.7976931348623157e+308
res = np.array([1e308,1e300]) * 10
/home/cokelaer/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: overflow encountered in multiply if __name__ == '__main__':
res
array([ inf, 1.00000000e+301])
a = np.diag([1,2,3,4])
# Can be used to add a column or a row
a.resize((5,4))
a
array([[1, 0, 0, 0], [0, 2, 0, 0], [0, 0, 3, 0], [0, 0, 0, 4], [0, 0, 0, 0]])
Restriction: no reference permitted
X = np.array([5,1,10,2,7,8])
X.sort() # inplace
X
array([ 1, 2, 5, 7, 8, 10])
X = np.array([5,1,10,2,7,8])
np.sort(X)
array([ 1, 2, 5, 7, 8, 10])
X = np.array([5,1,10,2,7,8])
X.argsort()
array([1, 3, 0, 4, 5, 2])
Numpy has its own reader of tabulated data sets (e.g. genfromtext)
data = np.genfromtxt("data/numpy.csv", delimiter=",")
data[:,1]
array([ 0., 1., 2., 1., 1., 3., 0.])
However, we will use pandas read_csv function that is far better
CSV that genfromtext does not handle:
- strings that contains comma (see data/chembl.csv)
- mix of integers and strings (see data/brain_size.csv)
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack([a, b])
array([[1, 2, 3], [4, 5, 6]])
# equivalent of the + operator with list
np.hstack([a, b])
array([1, 1, 0, 1, 1, 0, 1, 0, 1, 0])
For scientific computation with large data sets, use NumPy what you should keep in mind:
- How to create arrays with : array, arange, ones, diag, zeros, ...
- Know the shape and dimensions
- Perform slicing and indexing at least on 2D arrays
- How to use operators. Keep in mind that * operator is an element-wise operation.
- Reduction operations: min,max, argmax, argmin, std, var, mean, absolute
- Maths operators: abs, add, arccos,round, floor, cum, cumsum
- ordering, sorting: argmax, sort, argsort
To go further:
- Broadcasting: operations on array with differen sizes
- splitting arrays
- check out some other functions: convolve, correlate, corrcoef, cov
References: NumPy documentation and Scipy Lectures Notes