Pandas
TC, BN, JBM, AZ

(in terminal)

conda install pandas
conda install xlrd     # for excel reading support
conda install openpyxl # for excel writing support

Import pandas

In [1]:
import pandas as pd

Part 1. Panda's metabolism

Problem: Let's build a metabolic model of a panda (Ailuropoda melanoleuca, taxon id 9646) using a reaction database.

Glycolysis

Glycose + ATP = Glucose-6-phosphate + ADP

(catalysed by hexokinase)

Resources:

  • Rhea -- a curated reaction database: cross-reference table between reactions and Uniprot ids of catalyzing enzymes
  • OrthoDB -- a comprehensive catalog of orthologous genes: gene table

Plan

Reading tables

Table

Reaction-to-Uniprot cross-reference table

In [7]:
# from tab/csv ...
rxn_df = pd.read_csv('./data/rhea2uniprot.tsv', index_col=None, header=0,
                     sep='\t')
In [8]:
# ... or from Excel
rxn_df = pd.read_excel('./data/rhea2uniprot.xlsx', index_col=None, header=0, 
                       sheetname=0)

Viewing tables

In [9]:
# show the first few rows
rxn_df.head()
Out[9]:
RHEA_ID DIRECTION MASTER_ID ID
0 10011 BI 10008 A5ENR3
1 10011 BI 10008 A5VCB6
2 10011 BI 10008 A7HCA3
3 10011 BI 10008 A9KBI4
4 10011 BI 10008 B4UA87
In [10]:
# specify how many is few
rxn_df.head(n=3)
Out[10]:
RHEA_ID DIRECTION MASTER_ID ID
0 10011 BI 10008 A5ENR3
1 10011 BI 10008 A5VCB6
2 10011 BI 10008 A7HCA3
In [11]:
# show the last 3 rows
rxn_df.tail(n=3)
Out[11]:
RHEA_ID DIRECTION MASTER_ID ID
34396 51223 BI 51220 Q9Y7Q2
34397 51223 BI 51220 Q9ZRW8
34398 51223 BI 51220 Q9ZW30
In [12]:
# let's see some descriptive statistics
rxn_df.describe(include='all')
Out[12]:
RHEA_ID DIRECTION MASTER_ID ID
count 34399.000000 34399 34399.000000 34399
unique NaN 1 NaN 24315
top NaN BI NaN P59911
freq NaN 34399 NaN 17
mean 23635.779703 NaN 23632.779703 NaN
std 10532.838213 NaN 10532.838213 NaN
min 10011.000000 NaN 10008.000000 NaN
25% 15644.000000 NaN 15641.000000 NaN
50% 21251.000000 NaN 21248.000000 NaN
75% 28097.000000 NaN 28094.000000 NaN
max 51223.000000 NaN 51220.000000 NaN
In [13]:
# table dimensions
rxn_df.shape
Out[13]:
(34399, 4)
In [14]:
# len gives the number of rows
len(rxn_df)
Out[14]:
34399

Subsetting tables

Column names are stored in the columns attribute as a list

In [15]:
rxn_df.columns
Out[15]:
Index(['RHEA_ID', 'DIRECTION', 'MASTER_ID', 'ID'], dtype='object')

So, RHEA_ID is the reaction identifier, ID is the Uniprot identifier. The rest we do not need for now and can remove.

We can select a subset using a list of valid column names

In [16]:
# Select specific columns
rxn_df = rxn_df[['RHEA_ID', 'ID']]
rxn_df.head(n=3)
Out[16]:
RHEA_ID ID
0 10011 A5ENR3
1 10011 A5VCB6
2 10011 A7HCA3
In [17]:
# which RHEA_ID is in the 4th row?
rxn_df.loc[3, 'RHEA_ID']
Out[17]:
10011
In [18]:
# which are the values of columns 0 and 1 for rows from 1765 to 1769?
rxn_df.iloc[1765:1770, [0, 1]]
Out[18]:
RHEA_ID ID
1765 11063 Q9RUP8
1766 11063 Q9XDB4
1767 11067 A0M0J2
1768 11067 A0M764
1769 11067 A0RP36

NB: loc takes column names, while iloc takes indices

Remember the indexing and slicing syntax in Python. It's the same mechanism in Pandas dataframes

In [19]:
# select Uniprot column
rxn_df['ID'].head()
Out[19]:
0    A5ENR3
1    A5VCB6
2    A7HCA3
3    A9KBI4
4    B4UA87
Name: ID, dtype: object
In [20]:
# how many genes are there?
len(rxn_df['ID'])
Out[20]:
34399
In [21]:
# and unique ones?
len(rxn_df['ID'].unique())
Out[21]:
24315

But how do we know which of the genes are panda's??

Panda and genes

Reading tables

Gene table

odb9_genes.tab

  1. Ortho DB unique gene id (not stable between releases)
  2. organism tax id
  3. protein sequence id, as downloaded together with the sequence
  4. Uniprot id, evaluated by mapping
  5. ENSEMBL gene name, evaluated by mapping
  6. NCBI gid, evaluated by mapping
  7. description, evaluated by mapping
In [22]:
gene_df = pd.read_table('./data/odb9_genes_sample.tab', index_col=None, 
                        sep='\t', header=None)

Viewing tables

In [23]:
gene_df.head(n=3)
Out[23]:
0 1 2 3 4 5 6
0 323097:000d58 323097 ABE64319 Q1QHH8 Nham_3591 NaN Alpha-IPM isomerase
1 390235:000063 390235 ACA70606 B1J4B0 PputW619_0100 NaN Tryptophan synthase alpha chain
2 283643:0000f4 283643 P0CP81 P0CP81 NaN 4936978.0 hypothetical protein

Subsetting tables

In [24]:
# Let's only keep gene id and Uniprot id, and rename the columns
gene_df = gene_df.iloc[:, [0, 3]]
gene_df.columns = ['gene', 'Uniprot']
gene_df.head(n=3)
Out[24]:
gene Uniprot
0 323097:000d58 Q1QHH8
1 390235:000063 B1J4B0
2 283643:0000f4 P0CP81

Filtering tables

In [25]:
# select the gene column
gene_df['gene'].head(n=5)
Out[25]:
0    323097:000d58
1    390235:000063
2    283643:0000f4
3    284812:0000c8
4      9646:003cac
Name: gene, dtype: object

Panda's taxon id

In [26]:
# which rows contain panda's genes? 
# many ways to do it, we use regular expression (^ means string starts with)
mask = gene_df['gene'].str.contains('^9646:')
mask.head(n=5)
Out[26]:
0    False
1    False
2    False
3    False
4     True
Name: gene, dtype: bool
In [27]:
# select only those rows that contain panda's genes
gene_df[mask].head(n=3)
Out[27]:
gene Uniprot
4 9646:003cac NaN
9 9646:00079a G1L5N9
12 9646:0042dd G1MI96
In [28]:
# get panda's uniprots
up_column = gene_df.loc[mask, 'Uniprot']
panda_uniprots = up_column.unique()
len(panda_uniprots)
Out[28]:
4807

Comments about the syntax

We can write the code fragment above in a shorter way using the dot syntax:

In [29]:
# get panda's uniprots
panda_uniprots = gene_df.loc[mask, 'Uniprot'].unique()
len(panda_uniprots)
Out[29]:
4807

Combining reaction and gene tables?

In [36]:
# Select reactions that are catalysed by panda's genes
panda_rxn_df = rxn_df[rxn_df['ID'].isin(panda_uniprots)]
panda_rxn_df.head()
Out[36]:
RHEA_ID ID
12112 17992 D2HXI8
28498 34282 D2HBV9
33177 46611 D2HXI8

Another way to access to the reactions catalysed by panda's genes: query

You can use column names directly in query, variables should be prefixed by @

In [37]:
panda_rxn_df = rxn_df.query("ID in @panda_uniprots")
panda_rxn_df.head()
Out[37]:
RHEA_ID ID
12112 17992 D2HXI8
28498 34282 D2HBV9
33177 46611 D2HXI8
In [38]:
# Let's rename the columns
panda_rxn_df.columns = ['RHEA_ID', 'panda_id']
panda_rxn_df.head(n=10)
Out[38]:
RHEA_ID panda_id
12112 17992 D2HXI8
28498 34282 D2HBV9
33177 46611 D2HXI8

Only three panda's reactions, so sad :'(

Sad panda

Part 2. Let's search for orthologs!

Task 2

Reading and viewing tables

OG table

odb9_OG2genes.tab

  1. OG unique id
  2. Ortho DB unique gene id
In [40]:
og_df = pd.read_table('./data/odb9_OG2genes_sample.tab', index_col=None, 
                          sep='\t', header=None)
og_df.head(n=3)
Out[40]:
0 1
0 POG091H02K5 190486:0003bc
1 POG091B02KO 221109:0007c2
2 EOG090A04GW 9646:003fea
In [41]:
# let's have better names for columns
og_df.columns = ['OG', 'gene']

Filtering tables

In [42]:
# select rows that contain panda's genes
mask = og_df['gene'].str.contains('^9646:') 
og_df[mask].head(n=3)
Out[42]:
OG gene
2 EOG090A04GW 9646:003fea
5 EOG090B094A 9646:002cbf
7 EOG0908065O 9646:0034fe
In [43]:
# select panda genes
panda_genes = og_df.loc[mask, 'gene'].unique()
panda_genes
Out[43]:
array(['9646:003fea', '9646:002cbf', '9646:0034fe', ..., '9646:00404d',
       '9646:001a26', '9646:0013f1'], dtype=object)
In [44]:
# select OGs that contain panda's genes
panda_ogs = og_df.loc[mask, 'OG'].unique()
panda_ogs
Out[44]:
array(['EOG090A04GW', 'EOG090B094A', 'EOG0908065O', ..., 'EOG091G02ID',
       'EOG091G0F1K', 'EOG090803ZY'], dtype=object)
In [45]:
# filter the table to contain only these OGs
old_len = len(og_df)
og_df = og_df[og_df['OG'].isin(panda_ogs)]
(old_len, len(og_df))
Out[45]:
(33880, 8347)

Merging tables

Merge

In [46]:
og_df.head(n=3)
Out[46]:
OG gene
2 EOG090A04GW 9646:003fea
4 EOG091G0B1Y 473542:002ab8
5 EOG090B094A 9646:002cbf
In [50]:
# mask selecting panda's genes
mask = og_df['gene'].isin(panda_genes)
mask.head(n=3)
Out[50]:
2     True
4    False
5     True
Name: gene, dtype: bool
In [53]:
# mask selecting non-panda's genes
~mask.head(n=3)
Out[53]:
2    False
4     True
5    False
Name: gene, dtype: bool

Merge

In [80]:
# get panda's gene to orthologous gene mapping
gene2gene = pd.merge(og_df[mask], og_df[~mask], on='OG', how='inner')
gene2gene.head(n=3)
Out[80]:
OG gene_x gene_y
0 EOG091G00PO 9646:000d66 9615:0027eb
1 EOG090M04R7 9646:0002b5 10090:001b13
2 EOG090705I4 9646:0007fd 9606:003167
In [81]:
# we only need the gene columns, and with better names
gene2gene = gene2gene.iloc[:, 1:]
gene2gene.columns = ['panda_gene', 'other_gene']
gene2gene.head(n=3)
Out[81]:
panda_gene other_gene
0 9646:000d66 9615:0027eb
1 9646:0002b5 10090:001b13
2 9646:0007fd 9606:003167

But these are not Uniprot ids...

In [82]:
# merge panda's genes with the gene table to obtain their Uniprots
gene2gene = pd.merge(gene2gene, gene_df, left_on='panda_gene', 
                         right_on='gene', how='left')
gene2gene.head(n=3)
Out[82]:
panda_gene other_gene gene Uniprot
0 9646:000d66 9615:0027eb 9646:000d66 G1LA54
1 9646:0002b5 10090:001b13 9646:0002b5 G1L1J8
2 9646:0007fd 9606:003167 9646:0007fd G1L5Y1
In [83]:
# merge other genes with the gene table to obtain their Uniprots
gene2gene = pd.merge(gene2gene, gene_df, left_on='other_gene', 
                     right_on='gene', how='inner')
gene2gene.head(n=3)
Out[83]:
panda_gene other_gene gene_x Uniprot_x gene_y Uniprot_y
0 9646:000d66 9615:0027eb 9646:000d66 G1LA54 9615:0027eb Q9MZF4
1 9646:000d66 9615:0027eb 9646:000d66 G1LA54 9615:0027eb Q9MZF4
2 9646:0002b5 10090:001b13 9646:0002b5 G1L1J8 10090:001b13 O35495

Cleaning tables

In [84]:
# remove duplicated columns produced by merge...
gene2gene = gene2gene.iloc[:, [0, 1, 3, 5]]
gene2gene.columns.values[2:] = ['panda_Uniprot', 'other_Uniprot']
gene2gene.head(n=3)
Out[84]:
panda_gene other_gene panda_Uniprot other_Uniprot
0 9646:000d66 9615:0027eb G1LA54 Q9MZF4
1 9646:000d66 9615:0027eb G1LA54 Q9MZF4
2 9646:0002b5 10090:001b13 G1L1J8 O35495
In [85]:
# remove rows without other Uniprot
old_len = len(gene2gene)
gene2gene.dropna(axis=0, how='any', subset=['other_Uniprot'], inplace=True)
(old_len, len(gene2gene))
Out[85]:
(239, 239)
In [86]:
# remove duplicated rows
old_len = len(gene2gene)
gene2gene.drop_duplicates(inplace=True)
(old_len, len(gene2gene))
Out[86]:
(239, 215)

Merging tables

Merge the gene mapping with the reaction table

In [87]:
# Let's specify a panda_id to be Uniprot if it's present, else panda_gene
import numpy

gene2gene['panda_id'] = numpy.where(gene2gene['panda_Uniprot'].isnull(), 
                                    gene2gene['panda_gene'], 
                                    gene2gene['panda_Uniprot'])
gene2gene.head(n=3)
Out[87]:
panda_gene other_gene panda_Uniprot other_Uniprot panda_id
0 9646:000d66 9615:0027eb G1LA54 Q9MZF4 G1LA54
2 9646:0002b5 10090:001b13 G1L1J8 O35495 G1L1J8
4 9646:0007fd 9606:003167 G1L5Y1 Q16875 G1L5Y1
In [88]:
# Remove unneeded columns
gene2gene = gene2gene.iloc[:, 3:]
gene2gene.head(n=3)
Out[88]:
other_Uniprot panda_id
0 Q9MZF4 G1LA54
2 O35495 G1L1J8
4 Q16875 G1L5Y1
In [91]:
# What was there is the reaction table?
rxn_df.head(n=2)
Out[91]:
RHEA_ID ID
0 10011 A5ENR3
1 10011 A5VCB6
In [92]:
# And in the gene2gene one?
gene2gene.head(n=2)
Out[92]:
other_Uniprot panda_id
0 Q9MZF4 G1LA54
2 O35495 G1L1J8
In [93]:
# Merge them
gene2rxn = pd.merge(gene2gene, rxn_df, left_on='other_Uniprot', 
                        right_on='ID', how='inner')
gene2rxn.head(n=3)
Out[93]:
other_Uniprot panda_id RHEA_ID ID
0 Q9MZF4 G1LA54 11263 Q9MZF4
1 Q9MZF4 G1LA54 11267 Q9MZF4
2 O35495 G1L1J8 17992 O35495
In [94]:
# Clean a bit
gene2rxn = gene2rxn.iloc[:, 1:3]
gene2rxn.drop_duplicates(inplace=True)
gene2rxn.tail(n=5)
Out[94]:
panda_id RHEA_ID
305 G1MGG1 24595
306 D2HGR3 11359
307 D2HGR3 29094
308 G1M3B6 16240
310 D2HD49 13068
In [95]:
# Add panda gene mapping
gene2rxn = pd.concat([gene2rxn, panda_rxn_df], axis=0)
gene2rxn.drop_duplicates(inplace=True)
gene2rxn.tail(n=5)
Out[95]:
RHEA_ID panda_id
308 16240 G1M3B6
310 13068 D2HD49
12112 17992 D2HXI8
28498 34282 D2HBV9
33177 46611 D2HXI8
In [96]:
gene2rxn.describe(include='all')
Out[96]:
RHEA_ID panda_id
count 254.000000 254
unique NaN 173
top NaN G1L820
freq NaN 11
mean 23635.767717 NaN
std 12437.744535 NaN
min 10011.000000 NaN
25% 14656.000000 NaN
50% 18278.000000 NaN
75% 32235.000000 NaN
max 51223.000000 NaN
In [97]:
# change Rhea ID type to string for the description to be more meaningful
gene2rxn['RHEA_ID'] = gene2rxn['RHEA_ID'].astype(str)
gene2rxn.describe(include='all')
Out[97]:
RHEA_ID panda_id
count 254 254
unique 152 173
top 46611 G1L820
freq 20 11

Saving tables

In [98]:
# save mappings in tab-delimited format
gene2gene.to_csv('./data/gene2gene.tab', sep='\t', header=True, index=False)
gene2rxn.to_csv('./data/rxn2gene.tab', sep='\t', header=True, index=False)
In [99]:
# save mappings in Excel format
gene2gene.to_excel('./data/gene2gene.xlsx', header=True, index=False)
gene2rxn.to_excel('./data/rxn2gene.xlsx', header=True, index=False)
In [100]:
# save mappings to the same Excel document
with pd.ExcelWriter('./data/mappings.xlsx') as writer:
    for df, sheet in ((gene2gene, 'gene'), (gene2rxn, 'reaction')):
        df.to_excel(writer, sheet_name=sheet, header=True, index=False)

Happy panda

Part 3. Extract more info about reactions

We know Rhea reactions ids, but do not know their formulas.

Stoichiometric matrix

Stoichiometric matrix

Let's extract the reaction formulas using Rhea webservices...

Rhea

In [111]:
import requests
from xml.etree import ElementTree

RHEA_XML_PREFIX = '{http://www.xml-cml.org/schema/cml2/react}'

def get_metabolites(rxn_id):
    """
    Given a Rhea (http://www.rhea-db.org/) reaction id,
    returns its participant metabolites as a dict: {metabolite: stoichiometry},
    e.g. '2 H2 + O2 = 2 H2O' would be represented ad {'H2': -2, 'O2': -1, 'H2O': 2}.
    :param rxn_id: Rhea reaction id
    :return: dict of participant metabolites.
    """
    response = requests.get('http://www.rhea-db.org/rest/1.0/ws/reaction/cmlreact/%s' % rxn_id)
    
    ms = {}
    if response.ok:
        # parse XML that comes as a resonse
        tree = ElementTree.fromstring(response.content)
        
        ms.update({m.attrib['title']: -float(m.attrib['count'])
                   for m in tree.iter('%sreactant' % RHEA_XML_PREFIX)})
        ms.update({m.attrib['title']: float(m.attrib['count'])
                   for m in tree.iter('%sproduct' % RHEA_XML_PREFIX)})
    return ms
In [125]:
# As web services are slow, let's work on just a few reactions
rxn_df = gene2rxn.head(n=5).copy()
rxn_df
Out[125]:
RHEA_ID panda_id
0 11263 G1LA54
1 11267 G1LA54
2 17992 G1L1J8
3 46611 G1L1J8
4 15656 G1L5Y1
In [126]:
# load metabolites of reaction 11263 from web services
get_metabolites('11263')
Out[126]:
{'H(+)': -1.0, 'H2O2': 1.0, 'NADP(+)': 1.0, 'NADPH': -1.0, 'O2': -1.0}
In [127]:
# load metabolites for all our reactions, takes about a minute
rxn2ms = {r_id: get_metabolites(r_id) for r_id in rxn_df['RHEA_ID'].unique()}
rxn2ms['11263']
Out[127]:
{'H(+)': -1.0, 'H2O2': 1.0, 'NADP(+)': 1.0, 'NADPH': -1.0, 'O2': -1.0}
In [129]:
# Add information on metabolites to our table
rxn_df['Reactants'] = rxn_df['RHEA_ID'].apply(lambda r_id: [m for (m, s) in rxn2ms[r_id].items() if s < 0])
rxn_df['Products'] = rxn_df['RHEA_ID'].apply(lambda r_id: [m for (m, s) in rxn2ms[r_id].items() if s > 0])
rxn_df
Out[129]:
RHEA_ID panda_id Reactants Products
0 11263 G1LA54 [H(+), NADPH, O2] [NADP(+), H2O2]
1 11267 G1LA54 [NADH, H(+), O2] [H2O2, NAD(+)]
2 17992 G1L1J8 [ATP, [protein]-L-serine] [ADP, [protein]-O-phospho-L-serine, H(+)]
3 46611 G1L1J8 [ATP, [protein]-L-threonine] [H(+), ADP, [protein]-O-phospho-L-threonine]
4 15656 G1L5Y1 [ATP, beta-D-fructose 6-phosphate] [ADP, H(+), beta-D-fructose 2,6-bisphosphate]

Convert a dictionary to a DataFrame

In [130]:
stoich_matrix = pd.DataFrame.from_dict(rxn2ms, 'columns')
stoich_matrix
Out[130]:
11263 11267 15656 17992 46611
ADP NaN NaN 1.0 1.0 1.0
ATP NaN NaN -1.0 -1.0 -1.0
H(+) -1.0 -1.0 1.0 1.0 1.0
H2O2 1.0 1.0 NaN NaN NaN
NAD(+) NaN 1.0 NaN NaN NaN
NADH NaN -1.0 NaN NaN NaN
NADP(+) 1.0 NaN NaN NaN NaN
NADPH -1.0 NaN NaN NaN NaN
O2 -1.0 -1.0 NaN NaN NaN
[protein]-L-serine NaN NaN NaN -1.0 NaN
[protein]-L-threonine NaN NaN NaN NaN -1.0
[protein]-O-phospho-L-serine NaN NaN NaN 1.0 NaN
[protein]-O-phospho-L-threonine NaN NaN NaN NaN 1.0
beta-D-fructose 2,6-bisphosphate NaN NaN 1.0 NaN NaN
beta-D-fructose 6-phosphate NaN NaN -1.0 NaN NaN
In [131]:
# Replace NaNs with zeros
stoich_matrix.fillna(0, inplace=True)
stoich_matrix
Out[131]:
11263 11267 15656 17992 46611
ADP 0.0 0.0 1.0 1.0 1.0
ATP 0.0 0.0 -1.0 -1.0 -1.0
H(+) -1.0 -1.0 1.0 1.0 1.0
H2O2 1.0 1.0 0.0 0.0 0.0
NAD(+) 0.0 1.0 0.0 0.0 0.0
NADH 0.0 -1.0 0.0 0.0 0.0
NADP(+) 1.0 0.0 0.0 0.0 0.0
NADPH -1.0 0.0 0.0 0.0 0.0
O2 -1.0 -1.0 0.0 0.0 0.0
[protein]-L-serine 0.0 0.0 0.0 -1.0 0.0
[protein]-L-threonine 0.0 0.0 0.0 0.0 -1.0
[protein]-O-phospho-L-serine 0.0 0.0 0.0 1.0 0.0
[protein]-O-phospho-L-threonine 0.0 0.0 0.0 0.0 1.0
beta-D-fructose 2,6-bisphosphate 0.0 0.0 1.0 0.0 0.0
beta-D-fructose 6-phosphate 0.0 0.0 -1.0 0.0 0.0

Calculate metabolite frequencies

In [137]:
# mask for selecting non-zero positions
mask_df = stoich_matrix != 0
mask_df.head()
Out[137]:
11263 11267 15656 17992 46611
ADP False False True True True
ATP False False True True True
H(+) True True True True True
H2O2 True True False False False
NAD(+) False True False False False
In [138]:
# Calculate metabolite frequencies
freq_df = mask_df.sum(axis='columns')
freq_df.head()
Out[138]:
ADP       3
ATP       3
H(+)      5
H2O2      2
NAD(+)    1
dtype: int64

Visualizing

In [139]:
%pylab inline

freq_df.plot(x=0, y=1, kind='barh')
Populating the interactive namespace from numpy and matplotlib
Out[139]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f03b9060400>

Part 4. Nucleotide BLAST

Goal: Given a several RNA sequences (sequences.csv) find their RefSeq identifiers.

Already done: BLASTn of those sequences against rna database:

blastn -query input.fasta -db rna -word_size 8 -evalue 0.1 -out blastn.tab \
-outfmt "10 qseqid sseqid length mismatch gapopen qstart qend sstart send evalue bitscore score nident"

Input: blastn.tab containing BLASTn results, sequences.csv containing sequence id to sequence mapping

TODO: keep only the RefSeq matches that are identical to the our sequences for at least 80%.

Reading tables

In [212]:
blast_df = pd.read_csv('./data/blastn.csv', index_col=False, header=None, 
                       sep=',')
blast_df.head()
Out[212]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 GSE25001|ILMN_1698220 gi|188219564|ref|NM_020432.4| 50 0 0 1 50 4612 4661 2.000000e-19 93.5 50 50
1 GSE25001|ILMN_1698220 gi|188219568|ref|NM_001127357.1| 50 0 0 1 50 4624 4673 2.000000e-19 93.5 50 50
2 GSE25001|ILMN_1698220 gi|188219570|ref|NM_001127358.1| 50 0 0 1 50 4701 4750 2.000000e-19 93.5 50 50
3 GSE25001|ILMN_1810835 gi|148229143|ref|NM_001097589.1| 50 0 0 1 50 601 650 2.000000e-19 93.5 50 50
4 GSE25001|ILMN_1810835 gi|147905322|ref|NM_005416.2| 50 0 0 1 50 685 734 2.000000e-19 93.5 50 50
In [213]:
# Let's name the columns properly
blast_df.columns = ['qseqid', 'gi.refseq.v', 'alignment.len', 'num.mismatch', 
                    'num.gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 
                    'bitscore', 'score', 'num.identical']

Cleaning tables

In [214]:
# Extract RefSeq identifiers without GI prefix and without versions, 
# e.g. gi|188219564|ref|NM_020432.4|  --> NM_020432

# First remove gi|XXXXXXX|ref|...
blast_df['RefSeq'] = blast_df['gi.refseq.v'].str.replace('^gi\|\d+\|ref\|', '')
# ...then remove the version: .X| 
blast_df['RefSeq'] = blast_df['RefSeq'].str.replace('\.\d+\|$', '')
blast_df[['gi.refseq.v', 'RefSeq', 'evalue']].head()
Out[214]:
gi.refseq.v RefSeq evalue
0 gi|188219564|ref|NM_020432.4| NM_020432 2.000000e-19
1 gi|188219568|ref|NM_001127357.1| NM_001127357 2.000000e-19
2 gi|188219570|ref|NM_001127358.1| NM_001127358 2.000000e-19
3 gi|148229143|ref|NM_001097589.1| NM_001097589 2.000000e-19
4 gi|147905322|ref|NM_005416.2| NM_005416 2.000000e-19
In [215]:
# Remove 'gi.refseq.v' column
blast_df.drop(labels=['gi.refseq.v'], axis=1, inplace=True)

Filtering tables

In [216]:
# Remove multiplicated assignments of the same RefSeq to the same sequence 
# (that could appear due to multiple BLASTN alignments).

# To do so we first sort the table, so that better e-values appear last...
blast_df.sort_values(by=['evalue'], axis=0, ascending=False, 
                     inplace=True, na_position='first')
blast_df[['qseqid', 'RefSeq', 'evalue']].head()
Out[216]:
qseqid RefSeq evalue
47757 GSE25001|ILMN_1661479 NM_002304 0.042
46921 GSE25001|ILMN_1784380 NM_030660 0.042
46916 GSE25001|ILMN_1784380 NM_001164779 0.042
46917 GSE25001|ILMN_1784380 NM_001164778 0.042
46918 GSE25001|ILMN_1784380 NM_001164780 0.042
In [217]:
# ... and then only keep the last occurences of each sequence-RefSeq combinations.
old_len = len(blast_df)
blast_df.drop_duplicates(subset=['qseqid', 'RefSeq'], 
                         keep='last', inplace=True)
(old_len, len(blast_df))
Out[217]:
(190476, 149648)

Calculating percentage of overlap

percent.ident = 100 * num.ident / seq.len

But we do not know sequence lengths! So lets's calculate them.

In [218]:
# Read sequence table
seq_df = pd.read_table('./data/sequences.csv', header=None, 
                       index_col=False, sep=',')
# Rename columns
seq_df.columns = ['qseqid', 'seq']
seq_df.head(n=2)
Out[218]:
qseqid seq
0 GSE25001|ILMN_1698220 CAAAGAGAATTGTGGCAGATGTTGTGTGTGAACTGTTGTTTCTTTG...
1 GSE25001|ILMN_1810835 GAAGCCAACCACCAGATGCTGGACACCCTCTTCCCATCTGTTTCTG...
In [219]:
# Calculate sequences lengths
seq_df['len'] = seq_df['seq'].str.len()
seq_df[['qseqid', 'len']].head(n=2)
Out[219]:
qseqid len
0 GSE25001|ILMN_1698220 50
1 GSE25001|ILMN_1810835 50
In [220]:
# Merge blast_df and seq_df tables
blast_df = pd.merge(blast_df, seq_df, on='qseqid', how='inner')
blast_df.loc[:, ['qseqid', 'RefSeq', 'num.identical', 'evalue', 'len']].head()
Out[220]:
qseqid RefSeq num.identical evalue len
0 GSE25001|ILMN_1661479 NM_002304 28 0.042 50
1 GSE25001|ILMN_1661479 NM_001040167 28 0.042 50
2 GSE25001|ILMN_1661479 NM_001166355 28 0.042 50
3 GSE25001|ILMN_1661479 NM_017542 23 0.042 50
4 GSE25001|ILMN_1661479 NM_001962 24 0.012 50
In [221]:
# Calculate percentage of match
blast_df['percent.ident'] = 100 * blast_df['num.identical'] / blast_df['len']
blast_df[['qseqid', 'RefSeq', 'num.identical', 'len', 'percent.ident']].head()
Out[221]:
qseqid RefSeq num.identical len percent.ident
0 GSE25001|ILMN_1661479 NM_002304 28 50 56.0
1 GSE25001|ILMN_1661479 NM_001040167 28 50 56.0
2 GSE25001|ILMN_1661479 NM_001166355 28 50 56.0
3 GSE25001|ILMN_1661479 NM_017542 23 50 46.0
4 GSE25001|ILMN_1661479 NM_001962 24 50 48.0
In [222]:
# Filter by percentage
blast_df = blast_df[blast_df['percent.ident'] >= 80]
blast_df.sort_values(by=['percent.ident'], axis=0, ascending=True, inplace=True, na_position='last')
blast_df[['qseqid', 'RefSeq', 'percent.ident']].head()
Out[222]:
qseqid RefSeq percent.ident
30471 GSE25001|ILMN_1806032 NR_026974 80.0
89517 GSE25001|ILMN_1652464 NR_003608 80.0
35794 GSE25001|ILMN_1666652 XR_109511 80.0
35777 GSE25001|ILMN_1666652 NR_026999 80.0
35769 GSE25001|ILMN_1666652 NM_001170423 80.0

Saving tables

In [223]:
# Save the result
blast_df.to_csv('./data/blastn_filtered.csv', header=True, index=False, sep=',')

Part 5. Exercise: Which of the genes are significant?

We performed 5 screens and obtained z-scores for some of the genes from the previous example, and would like to extract the significant ones among them (present in at least 3 experiments, mean pvalue < 0.01).

Input: blastn_filtered.csv from the previous example, screens.csv mapping genes identified by RefSeq to z-scores from different experiments.

Goal: A table of significant genes (pvalue < 0.01) identified by qseqid, their pvalues and percent.ident.

Steps

  1. Prepare blastn table:
    1. If needed, read './data/blastn_filtered.csv'
    2. Keep only columns 'qseqid', 'RefSeq' and 'percent.ident'
    3. Remove duplicates
  2. Prepare the screen table:
    1. Read the table './data/screens.csv'
    2. Remove genes that were found in less than 3 screens. Check out pd.DataFrame.count(...) function
    3. Create a new column that will contain mean z-score (based on available screen z-scores). Check out pd.DataFrame.mean(...) function
    4. Convert a z-score to a two-sided p-value and store it in a separate column. p_values = scipy.stats.norm.sf(abs(z_scores))
    5. Remove unneeded columns
  3. Combine the tables
    1. Merge screen and blastn tables. Keep only genes for which a p-value is available
    2. Visualize a histogram of pvalues. Check out pd.DataFrame.hist(...) function
  4. Select significant genes
    1. Filter the combined table to select significant genes
    2. Sort by pvalue (most significant first)
  5. Save the table to tab-delimited format and to excel