(in terminal)
conda install pandas
conda install xlrd # for excel reading support
conda install openpyxl # for excel writing support
import pandas as pd
Problem: Let's build a metabolic model of a panda (Ailuropoda melanoleuca, taxon id 9646) using a reaction database.
Glycose + ATP = Glucose-6-phosphate + ADP
(catalysed by hexokinase)
# from tab/csv ...
rxn_df = pd.read_csv('./data/rhea2uniprot.tsv', index_col=None, header=0,
sep='\t')
# ... or from Excel
rxn_df = pd.read_excel('./data/rhea2uniprot.xlsx', index_col=None, header=0,
sheetname=0)
# show the first few rows
rxn_df.head()
RHEA_ID | DIRECTION | MASTER_ID | ID | |
---|---|---|---|---|
0 | 10011 | BI | 10008 | A5ENR3 |
1 | 10011 | BI | 10008 | A5VCB6 |
2 | 10011 | BI | 10008 | A7HCA3 |
3 | 10011 | BI | 10008 | A9KBI4 |
4 | 10011 | BI | 10008 | B4UA87 |
# specify how many is few
rxn_df.head(n=3)
RHEA_ID | DIRECTION | MASTER_ID | ID | |
---|---|---|---|---|
0 | 10011 | BI | 10008 | A5ENR3 |
1 | 10011 | BI | 10008 | A5VCB6 |
2 | 10011 | BI | 10008 | A7HCA3 |
# show the last 3 rows
rxn_df.tail(n=3)
RHEA_ID | DIRECTION | MASTER_ID | ID | |
---|---|---|---|---|
34396 | 51223 | BI | 51220 | Q9Y7Q2 |
34397 | 51223 | BI | 51220 | Q9ZRW8 |
34398 | 51223 | BI | 51220 | Q9ZW30 |
# let's see some descriptive statistics
rxn_df.describe(include='all')
RHEA_ID | DIRECTION | MASTER_ID | ID | |
---|---|---|---|---|
count | 34399.000000 | 34399 | 34399.000000 | 34399 |
unique | NaN | 1 | NaN | 24315 |
top | NaN | BI | NaN | P59911 |
freq | NaN | 34399 | NaN | 17 |
mean | 23635.779703 | NaN | 23632.779703 | NaN |
std | 10532.838213 | NaN | 10532.838213 | NaN |
min | 10011.000000 | NaN | 10008.000000 | NaN |
25% | 15644.000000 | NaN | 15641.000000 | NaN |
50% | 21251.000000 | NaN | 21248.000000 | NaN |
75% | 28097.000000 | NaN | 28094.000000 | NaN |
max | 51223.000000 | NaN | 51220.000000 | NaN |
# table dimensions
rxn_df.shape
(34399, 4)
# len gives the number of rows
len(rxn_df)
34399
Column names are stored in the columns attribute as a list
rxn_df.columns
Index(['RHEA_ID', 'DIRECTION', 'MASTER_ID', 'ID'], dtype='object')
So, RHEA_ID is the reaction identifier, ID is the Uniprot identifier. The rest we do not need for now and can remove.
We can select a subset using a list of valid column names
# Select specific columns
rxn_df = rxn_df[['RHEA_ID', 'ID']]
rxn_df.head(n=3)
RHEA_ID | ID | |
---|---|---|
0 | 10011 | A5ENR3 |
1 | 10011 | A5VCB6 |
2 | 10011 | A7HCA3 |
# which RHEA_ID is in the 4th row?
rxn_df.loc[3, 'RHEA_ID']
10011
# which are the values of columns 0 and 1 for rows from 1765 to 1769?
rxn_df.iloc[1765:1770, [0, 1]]
RHEA_ID | ID | |
---|---|---|
1765 | 11063 | Q9RUP8 |
1766 | 11063 | Q9XDB4 |
1767 | 11067 | A0M0J2 |
1768 | 11067 | A0M764 |
1769 | 11067 | A0RP36 |
NB: loc takes column names, while iloc takes indices
Remember the indexing and slicing syntax in Python. It's the same mechanism in Pandas dataframes
# select Uniprot column
rxn_df['ID'].head()
0 A5ENR3 1 A5VCB6 2 A7HCA3 3 A9KBI4 4 B4UA87 Name: ID, dtype: object
# how many genes are there?
len(rxn_df['ID'])
34399
# and unique ones?
len(rxn_df['ID'].unique())
24315
But how do we know which of the genes are panda's??
odb9_genes.tab
gene_df = pd.read_table('./data/odb9_genes_sample.tab', index_col=None,
sep='\t', header=None)
gene_df.head(n=3)
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 323097:000d58 | 323097 | ABE64319 | Q1QHH8 | Nham_3591 | NaN | Alpha-IPM isomerase |
1 | 390235:000063 | 390235 | ACA70606 | B1J4B0 | PputW619_0100 | NaN | Tryptophan synthase alpha chain |
2 | 283643:0000f4 | 283643 | P0CP81 | P0CP81 | NaN | 4936978.0 | hypothetical protein |
# Let's only keep gene id and Uniprot id, and rename the columns
gene_df = gene_df.iloc[:, [0, 3]]
gene_df.columns = ['gene', 'Uniprot']
gene_df.head(n=3)
gene | Uniprot | |
---|---|---|
0 | 323097:000d58 | Q1QHH8 |
1 | 390235:000063 | B1J4B0 |
2 | 283643:0000f4 | P0CP81 |
# select the gene column
gene_df['gene'].head(n=5)
0 323097:000d58 1 390235:000063 2 283643:0000f4 3 284812:0000c8 4 9646:003cac Name: gene, dtype: object
# which rows contain panda's genes?
# many ways to do it, we use regular expression (^ means string starts with)
mask = gene_df['gene'].str.contains('^9646:')
mask.head(n=5)
0 False 1 False 2 False 3 False 4 True Name: gene, dtype: bool
# select only those rows that contain panda's genes
gene_df[mask].head(n=3)
gene | Uniprot | |
---|---|---|
4 | 9646:003cac | NaN |
9 | 9646:00079a | G1L5N9 |
12 | 9646:0042dd | G1MI96 |
# get panda's uniprots
up_column = gene_df.loc[mask, 'Uniprot']
panda_uniprots = up_column.unique()
len(panda_uniprots)
4807
We can write the code fragment above in a shorter way using the dot syntax:
# get panda's uniprots
panda_uniprots = gene_df.loc[mask, 'Uniprot'].unique()
len(panda_uniprots)
4807
# Select reactions that are catalysed by panda's genes
panda_rxn_df = rxn_df[rxn_df['ID'].isin(panda_uniprots)]
panda_rxn_df.head()
RHEA_ID | ID | |
---|---|---|
12112 | 17992 | D2HXI8 |
28498 | 34282 | D2HBV9 |
33177 | 46611 | D2HXI8 |
You can use column names directly in query, variables should be prefixed by @
panda_rxn_df = rxn_df.query("ID in @panda_uniprots")
panda_rxn_df.head()
RHEA_ID | ID | |
---|---|---|
12112 | 17992 | D2HXI8 |
28498 | 34282 | D2HBV9 |
33177 | 46611 | D2HXI8 |
# Let's rename the columns
panda_rxn_df.columns = ['RHEA_ID', 'panda_id']
panda_rxn_df.head(n=10)
RHEA_ID | panda_id | |
---|---|---|
12112 | 17992 | D2HXI8 |
28498 | 34282 | D2HBV9 |
33177 | 46611 | D2HXI8 |
Only three panda's reactions, so sad :'(
og_df = pd.read_table('./data/odb9_OG2genes_sample.tab', index_col=None,
sep='\t', header=None)
og_df.head(n=3)
0 | 1 | |
---|---|---|
0 | POG091H02K5 | 190486:0003bc |
1 | POG091B02KO | 221109:0007c2 |
2 | EOG090A04GW | 9646:003fea |
# let's have better names for columns
og_df.columns = ['OG', 'gene']
# select rows that contain panda's genes
mask = og_df['gene'].str.contains('^9646:')
og_df[mask].head(n=3)
OG | gene | |
---|---|---|
2 | EOG090A04GW | 9646:003fea |
5 | EOG090B094A | 9646:002cbf |
7 | EOG0908065O | 9646:0034fe |
# select panda genes
panda_genes = og_df.loc[mask, 'gene'].unique()
panda_genes
array(['9646:003fea', '9646:002cbf', '9646:0034fe', ..., '9646:00404d', '9646:001a26', '9646:0013f1'], dtype=object)
# select OGs that contain panda's genes
panda_ogs = og_df.loc[mask, 'OG'].unique()
panda_ogs
array(['EOG090A04GW', 'EOG090B094A', 'EOG0908065O', ..., 'EOG091G02ID', 'EOG091G0F1K', 'EOG090803ZY'], dtype=object)
# filter the table to contain only these OGs
old_len = len(og_df)
og_df = og_df[og_df['OG'].isin(panda_ogs)]
(old_len, len(og_df))
(33880, 8347)
og_df.head(n=3)
OG | gene | |
---|---|---|
2 | EOG090A04GW | 9646:003fea |
4 | EOG091G0B1Y | 473542:002ab8 |
5 | EOG090B094A | 9646:002cbf |
# mask selecting panda's genes
mask = og_df['gene'].isin(panda_genes)
mask.head(n=3)
2 True 4 False 5 True Name: gene, dtype: bool
# mask selecting non-panda's genes
~mask.head(n=3)
2 False 4 True 5 False Name: gene, dtype: bool
# get panda's gene to orthologous gene mapping
gene2gene = pd.merge(og_df[mask], og_df[~mask], on='OG', how='inner')
gene2gene.head(n=3)
OG | gene_x | gene_y | |
---|---|---|---|
0 | EOG091G00PO | 9646:000d66 | 9615:0027eb |
1 | EOG090M04R7 | 9646:0002b5 | 10090:001b13 |
2 | EOG090705I4 | 9646:0007fd | 9606:003167 |
# we only need the gene columns, and with better names
gene2gene = gene2gene.iloc[:, 1:]
gene2gene.columns = ['panda_gene', 'other_gene']
gene2gene.head(n=3)
panda_gene | other_gene | |
---|---|---|
0 | 9646:000d66 | 9615:0027eb |
1 | 9646:0002b5 | 10090:001b13 |
2 | 9646:0007fd | 9606:003167 |
But these are not Uniprot ids...
# merge panda's genes with the gene table to obtain their Uniprots
gene2gene = pd.merge(gene2gene, gene_df, left_on='panda_gene',
right_on='gene', how='left')
gene2gene.head(n=3)
panda_gene | other_gene | gene | Uniprot | |
---|---|---|---|---|
0 | 9646:000d66 | 9615:0027eb | 9646:000d66 | G1LA54 |
1 | 9646:0002b5 | 10090:001b13 | 9646:0002b5 | G1L1J8 |
2 | 9646:0007fd | 9606:003167 | 9646:0007fd | G1L5Y1 |
# merge other genes with the gene table to obtain their Uniprots
gene2gene = pd.merge(gene2gene, gene_df, left_on='other_gene',
right_on='gene', how='inner')
gene2gene.head(n=3)
panda_gene | other_gene | gene_x | Uniprot_x | gene_y | Uniprot_y | |
---|---|---|---|---|---|---|
0 | 9646:000d66 | 9615:0027eb | 9646:000d66 | G1LA54 | 9615:0027eb | Q9MZF4 |
1 | 9646:000d66 | 9615:0027eb | 9646:000d66 | G1LA54 | 9615:0027eb | Q9MZF4 |
2 | 9646:0002b5 | 10090:001b13 | 9646:0002b5 | G1L1J8 | 10090:001b13 | O35495 |
# remove duplicated columns produced by merge...
gene2gene = gene2gene.iloc[:, [0, 1, 3, 5]]
gene2gene.columns.values[2:] = ['panda_Uniprot', 'other_Uniprot']
gene2gene.head(n=3)
panda_gene | other_gene | panda_Uniprot | other_Uniprot | |
---|---|---|---|---|
0 | 9646:000d66 | 9615:0027eb | G1LA54 | Q9MZF4 |
1 | 9646:000d66 | 9615:0027eb | G1LA54 | Q9MZF4 |
2 | 9646:0002b5 | 10090:001b13 | G1L1J8 | O35495 |
# remove rows without other Uniprot
old_len = len(gene2gene)
gene2gene.dropna(axis=0, how='any', subset=['other_Uniprot'], inplace=True)
(old_len, len(gene2gene))
(239, 239)
# remove duplicated rows
old_len = len(gene2gene)
gene2gene.drop_duplicates(inplace=True)
(old_len, len(gene2gene))
(239, 215)
Merge the gene mapping with the reaction table
# Let's specify a panda_id to be Uniprot if it's present, else panda_gene
import numpy
gene2gene['panda_id'] = numpy.where(gene2gene['panda_Uniprot'].isnull(),
gene2gene['panda_gene'],
gene2gene['panda_Uniprot'])
gene2gene.head(n=3)
panda_gene | other_gene | panda_Uniprot | other_Uniprot | panda_id | |
---|---|---|---|---|---|
0 | 9646:000d66 | 9615:0027eb | G1LA54 | Q9MZF4 | G1LA54 |
2 | 9646:0002b5 | 10090:001b13 | G1L1J8 | O35495 | G1L1J8 |
4 | 9646:0007fd | 9606:003167 | G1L5Y1 | Q16875 | G1L5Y1 |
# Remove unneeded columns
gene2gene = gene2gene.iloc[:, 3:]
gene2gene.head(n=3)
other_Uniprot | panda_id | |
---|---|---|
0 | Q9MZF4 | G1LA54 |
2 | O35495 | G1L1J8 |
4 | Q16875 | G1L5Y1 |
# What was there is the reaction table?
rxn_df.head(n=2)
RHEA_ID | ID | |
---|---|---|
0 | 10011 | A5ENR3 |
1 | 10011 | A5VCB6 |
# And in the gene2gene one?
gene2gene.head(n=2)
other_Uniprot | panda_id | |
---|---|---|
0 | Q9MZF4 | G1LA54 |
2 | O35495 | G1L1J8 |
# Merge them
gene2rxn = pd.merge(gene2gene, rxn_df, left_on='other_Uniprot',
right_on='ID', how='inner')
gene2rxn.head(n=3)
other_Uniprot | panda_id | RHEA_ID | ID | |
---|---|---|---|---|
0 | Q9MZF4 | G1LA54 | 11263 | Q9MZF4 |
1 | Q9MZF4 | G1LA54 | 11267 | Q9MZF4 |
2 | O35495 | G1L1J8 | 17992 | O35495 |
# Clean a bit
gene2rxn = gene2rxn.iloc[:, 1:3]
gene2rxn.drop_duplicates(inplace=True)
gene2rxn.tail(n=5)
panda_id | RHEA_ID | |
---|---|---|
305 | G1MGG1 | 24595 |
306 | D2HGR3 | 11359 |
307 | D2HGR3 | 29094 |
308 | G1M3B6 | 16240 |
310 | D2HD49 | 13068 |
# Add panda gene mapping
gene2rxn = pd.concat([gene2rxn, panda_rxn_df], axis=0)
gene2rxn.drop_duplicates(inplace=True)
gene2rxn.tail(n=5)
RHEA_ID | panda_id | |
---|---|---|
308 | 16240 | G1M3B6 |
310 | 13068 | D2HD49 |
12112 | 17992 | D2HXI8 |
28498 | 34282 | D2HBV9 |
33177 | 46611 | D2HXI8 |
gene2rxn.describe(include='all')
RHEA_ID | panda_id | |
---|---|---|
count | 254.000000 | 254 |
unique | NaN | 173 |
top | NaN | G1L820 |
freq | NaN | 11 |
mean | 23635.767717 | NaN |
std | 12437.744535 | NaN |
min | 10011.000000 | NaN |
25% | 14656.000000 | NaN |
50% | 18278.000000 | NaN |
75% | 32235.000000 | NaN |
max | 51223.000000 | NaN |
# change Rhea ID type to string for the description to be more meaningful
gene2rxn['RHEA_ID'] = gene2rxn['RHEA_ID'].astype(str)
gene2rxn.describe(include='all')
RHEA_ID | panda_id | |
---|---|---|
count | 254 | 254 |
unique | 152 | 173 |
top | 46611 | G1L820 |
freq | 20 | 11 |
# save mappings in tab-delimited format
gene2gene.to_csv('./data/gene2gene.tab', sep='\t', header=True, index=False)
gene2rxn.to_csv('./data/rxn2gene.tab', sep='\t', header=True, index=False)
# save mappings in Excel format
gene2gene.to_excel('./data/gene2gene.xlsx', header=True, index=False)
gene2rxn.to_excel('./data/rxn2gene.xlsx', header=True, index=False)
# save mappings to the same Excel document
with pd.ExcelWriter('./data/mappings.xlsx') as writer:
for df, sheet in ((gene2gene, 'gene'), (gene2rxn, 'reaction')):
df.to_excel(writer, sheet_name=sheet, header=True, index=False)
We know Rhea reactions ids, but do not know their formulas.
Let's extract the reaction formulas using Rhea webservices...
import requests
from xml.etree import ElementTree
RHEA_XML_PREFIX = '{http://www.xml-cml.org/schema/cml2/react}'
def get_metabolites(rxn_id):
"""
Given a Rhea (http://www.rhea-db.org/) reaction id,
returns its participant metabolites as a dict: {metabolite: stoichiometry},
e.g. '2 H2 + O2 = 2 H2O' would be represented ad {'H2': -2, 'O2': -1, 'H2O': 2}.
:param rxn_id: Rhea reaction id
:return: dict of participant metabolites.
"""
response = requests.get('http://www.rhea-db.org/rest/1.0/ws/reaction/cmlreact/%s' % rxn_id)
ms = {}
if response.ok:
# parse XML that comes as a resonse
tree = ElementTree.fromstring(response.content)
ms.update({m.attrib['title']: -float(m.attrib['count'])
for m in tree.iter('%sreactant' % RHEA_XML_PREFIX)})
ms.update({m.attrib['title']: float(m.attrib['count'])
for m in tree.iter('%sproduct' % RHEA_XML_PREFIX)})
return ms
# As web services are slow, let's work on just a few reactions
rxn_df = gene2rxn.head(n=5).copy()
rxn_df
RHEA_ID | panda_id | |
---|---|---|
0 | 11263 | G1LA54 |
1 | 11267 | G1LA54 |
2 | 17992 | G1L1J8 |
3 | 46611 | G1L1J8 |
4 | 15656 | G1L5Y1 |
# load metabolites of reaction 11263 from web services
get_metabolites('11263')
{'H(+)': -1.0, 'H2O2': 1.0, 'NADP(+)': 1.0, 'NADPH': -1.0, 'O2': -1.0}
# load metabolites for all our reactions, takes about a minute
rxn2ms = {r_id: get_metabolites(r_id) for r_id in rxn_df['RHEA_ID'].unique()}
rxn2ms['11263']
{'H(+)': -1.0, 'H2O2': 1.0, 'NADP(+)': 1.0, 'NADPH': -1.0, 'O2': -1.0}
# Add information on metabolites to our table
rxn_df['Reactants'] = rxn_df['RHEA_ID'].apply(lambda r_id: [m for (m, s) in rxn2ms[r_id].items() if s < 0])
rxn_df['Products'] = rxn_df['RHEA_ID'].apply(lambda r_id: [m for (m, s) in rxn2ms[r_id].items() if s > 0])
rxn_df
RHEA_ID | panda_id | Reactants | Products | |
---|---|---|---|---|
0 | 11263 | G1LA54 | [H(+), NADPH, O2] | [NADP(+), H2O2] |
1 | 11267 | G1LA54 | [NADH, H(+), O2] | [H2O2, NAD(+)] |
2 | 17992 | G1L1J8 | [ATP, [protein]-L-serine] | [ADP, [protein]-O-phospho-L-serine, H(+)] |
3 | 46611 | G1L1J8 | [ATP, [protein]-L-threonine] | [H(+), ADP, [protein]-O-phospho-L-threonine] |
4 | 15656 | G1L5Y1 | [ATP, beta-D-fructose 6-phosphate] | [ADP, H(+), beta-D-fructose 2,6-bisphosphate] |
stoich_matrix = pd.DataFrame.from_dict(rxn2ms, 'columns')
stoich_matrix
11263 | 11267 | 15656 | 17992 | 46611 | |
---|---|---|---|---|---|
ADP | NaN | NaN | 1.0 | 1.0 | 1.0 |
ATP | NaN | NaN | -1.0 | -1.0 | -1.0 |
H(+) | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
H2O2 | 1.0 | 1.0 | NaN | NaN | NaN |
NAD(+) | NaN | 1.0 | NaN | NaN | NaN |
NADH | NaN | -1.0 | NaN | NaN | NaN |
NADP(+) | 1.0 | NaN | NaN | NaN | NaN |
NADPH | -1.0 | NaN | NaN | NaN | NaN |
O2 | -1.0 | -1.0 | NaN | NaN | NaN |
[protein]-L-serine | NaN | NaN | NaN | -1.0 | NaN |
[protein]-L-threonine | NaN | NaN | NaN | NaN | -1.0 |
[protein]-O-phospho-L-serine | NaN | NaN | NaN | 1.0 | NaN |
[protein]-O-phospho-L-threonine | NaN | NaN | NaN | NaN | 1.0 |
beta-D-fructose 2,6-bisphosphate | NaN | NaN | 1.0 | NaN | NaN |
beta-D-fructose 6-phosphate | NaN | NaN | -1.0 | NaN | NaN |
# Replace NaNs with zeros
stoich_matrix.fillna(0, inplace=True)
stoich_matrix
11263 | 11267 | 15656 | 17992 | 46611 | |
---|---|---|---|---|---|
ADP | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
ATP | 0.0 | 0.0 | -1.0 | -1.0 | -1.0 |
H(+) | -1.0 | -1.0 | 1.0 | 1.0 | 1.0 |
H2O2 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
NAD(+) | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
NADH | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 |
NADP(+) | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
NADPH | -1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
O2 | -1.0 | -1.0 | 0.0 | 0.0 | 0.0 |
[protein]-L-serine | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 |
[protein]-L-threonine | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 |
[protein]-O-phospho-L-serine | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
[protein]-O-phospho-L-threonine | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
beta-D-fructose 2,6-bisphosphate | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
beta-D-fructose 6-phosphate | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 |
# mask for selecting non-zero positions
mask_df = stoich_matrix != 0
mask_df.head()
11263 | 11267 | 15656 | 17992 | 46611 | |
---|---|---|---|---|---|
ADP | False | False | True | True | True |
ATP | False | False | True | True | True |
H(+) | True | True | True | True | True |
H2O2 | True | True | False | False | False |
NAD(+) | False | True | False | False | False |
# Calculate metabolite frequencies
freq_df = mask_df.sum(axis='columns')
freq_df.head()
ADP 3 ATP 3 H(+) 5 H2O2 2 NAD(+) 1 dtype: int64
%pylab inline
freq_df.plot(x=0, y=1, kind='barh')
Populating the interactive namespace from numpy and matplotlib
<matplotlib.axes._subplots.AxesSubplot at 0x7f03b9060400>
Goal: Given a several RNA sequences (sequences.csv) find their RefSeq identifiers.
Already done: BLASTn of those sequences against rna database:
blastn -query input.fasta -db rna -word_size 8 -evalue 0.1 -out blastn.tab \
-outfmt "10 qseqid sseqid length mismatch gapopen qstart qend sstart send evalue bitscore score nident"
Input: blastn.tab containing BLASTn results, sequences.csv containing sequence id to sequence mapping
TODO: keep only the RefSeq matches that are identical to the our sequences for at least 80%.
blast_df = pd.read_csv('./data/blastn.csv', index_col=False, header=None,
sep=',')
blast_df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GSE25001|ILMN_1698220 | gi|188219564|ref|NM_020432.4| | 50 | 0 | 0 | 1 | 50 | 4612 | 4661 | 2.000000e-19 | 93.5 | 50 | 50 |
1 | GSE25001|ILMN_1698220 | gi|188219568|ref|NM_001127357.1| | 50 | 0 | 0 | 1 | 50 | 4624 | 4673 | 2.000000e-19 | 93.5 | 50 | 50 |
2 | GSE25001|ILMN_1698220 | gi|188219570|ref|NM_001127358.1| | 50 | 0 | 0 | 1 | 50 | 4701 | 4750 | 2.000000e-19 | 93.5 | 50 | 50 |
3 | GSE25001|ILMN_1810835 | gi|148229143|ref|NM_001097589.1| | 50 | 0 | 0 | 1 | 50 | 601 | 650 | 2.000000e-19 | 93.5 | 50 | 50 |
4 | GSE25001|ILMN_1810835 | gi|147905322|ref|NM_005416.2| | 50 | 0 | 0 | 1 | 50 | 685 | 734 | 2.000000e-19 | 93.5 | 50 | 50 |
# Let's name the columns properly
blast_df.columns = ['qseqid', 'gi.refseq.v', 'alignment.len', 'num.mismatch',
'num.gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue',
'bitscore', 'score', 'num.identical']
# Extract RefSeq identifiers without GI prefix and without versions,
# e.g. gi|188219564|ref|NM_020432.4| --> NM_020432
# First remove gi|XXXXXXX|ref|...
blast_df['RefSeq'] = blast_df['gi.refseq.v'].str.replace('^gi\|\d+\|ref\|', '')
# ...then remove the version: .X|
blast_df['RefSeq'] = blast_df['RefSeq'].str.replace('\.\d+\|$', '')
blast_df[['gi.refseq.v', 'RefSeq', 'evalue']].head()
gi.refseq.v | RefSeq | evalue | |
---|---|---|---|
0 | gi|188219564|ref|NM_020432.4| | NM_020432 | 2.000000e-19 |
1 | gi|188219568|ref|NM_001127357.1| | NM_001127357 | 2.000000e-19 |
2 | gi|188219570|ref|NM_001127358.1| | NM_001127358 | 2.000000e-19 |
3 | gi|148229143|ref|NM_001097589.1| | NM_001097589 | 2.000000e-19 |
4 | gi|147905322|ref|NM_005416.2| | NM_005416 | 2.000000e-19 |
# Remove 'gi.refseq.v' column
blast_df.drop(labels=['gi.refseq.v'], axis=1, inplace=True)
# Remove multiplicated assignments of the same RefSeq to the same sequence
# (that could appear due to multiple BLASTN alignments).
# To do so we first sort the table, so that better e-values appear last...
blast_df.sort_values(by=['evalue'], axis=0, ascending=False,
inplace=True, na_position='first')
blast_df[['qseqid', 'RefSeq', 'evalue']].head()
qseqid | RefSeq | evalue | |
---|---|---|---|
47757 | GSE25001|ILMN_1661479 | NM_002304 | 0.042 |
46921 | GSE25001|ILMN_1784380 | NM_030660 | 0.042 |
46916 | GSE25001|ILMN_1784380 | NM_001164779 | 0.042 |
46917 | GSE25001|ILMN_1784380 | NM_001164778 | 0.042 |
46918 | GSE25001|ILMN_1784380 | NM_001164780 | 0.042 |
# ... and then only keep the last occurences of each sequence-RefSeq combinations.
old_len = len(blast_df)
blast_df.drop_duplicates(subset=['qseqid', 'RefSeq'],
keep='last', inplace=True)
(old_len, len(blast_df))
(190476, 149648)
percent.ident = 100 * num.ident / seq.len
But we do not know sequence lengths! So lets's calculate them.
# Read sequence table
seq_df = pd.read_table('./data/sequences.csv', header=None,
index_col=False, sep=',')
# Rename columns
seq_df.columns = ['qseqid', 'seq']
seq_df.head(n=2)
qseqid | seq | |
---|---|---|
0 | GSE25001|ILMN_1698220 | CAAAGAGAATTGTGGCAGATGTTGTGTGTGAACTGTTGTTTCTTTG... |
1 | GSE25001|ILMN_1810835 | GAAGCCAACCACCAGATGCTGGACACCCTCTTCCCATCTGTTTCTG... |
# Calculate sequences lengths
seq_df['len'] = seq_df['seq'].str.len()
seq_df[['qseqid', 'len']].head(n=2)
qseqid | len | |
---|---|---|
0 | GSE25001|ILMN_1698220 | 50 |
1 | GSE25001|ILMN_1810835 | 50 |
# Merge blast_df and seq_df tables
blast_df = pd.merge(blast_df, seq_df, on='qseqid', how='inner')
blast_df.loc[:, ['qseqid', 'RefSeq', 'num.identical', 'evalue', 'len']].head()
qseqid | RefSeq | num.identical | evalue | len | |
---|---|---|---|---|---|
0 | GSE25001|ILMN_1661479 | NM_002304 | 28 | 0.042 | 50 |
1 | GSE25001|ILMN_1661479 | NM_001040167 | 28 | 0.042 | 50 |
2 | GSE25001|ILMN_1661479 | NM_001166355 | 28 | 0.042 | 50 |
3 | GSE25001|ILMN_1661479 | NM_017542 | 23 | 0.042 | 50 |
4 | GSE25001|ILMN_1661479 | NM_001962 | 24 | 0.012 | 50 |
# Calculate percentage of match
blast_df['percent.ident'] = 100 * blast_df['num.identical'] / blast_df['len']
blast_df[['qseqid', 'RefSeq', 'num.identical', 'len', 'percent.ident']].head()
qseqid | RefSeq | num.identical | len | percent.ident | |
---|---|---|---|---|---|
0 | GSE25001|ILMN_1661479 | NM_002304 | 28 | 50 | 56.0 |
1 | GSE25001|ILMN_1661479 | NM_001040167 | 28 | 50 | 56.0 |
2 | GSE25001|ILMN_1661479 | NM_001166355 | 28 | 50 | 56.0 |
3 | GSE25001|ILMN_1661479 | NM_017542 | 23 | 50 | 46.0 |
4 | GSE25001|ILMN_1661479 | NM_001962 | 24 | 50 | 48.0 |
# Filter by percentage
blast_df = blast_df[blast_df['percent.ident'] >= 80]
blast_df.sort_values(by=['percent.ident'], axis=0, ascending=True, inplace=True, na_position='last')
blast_df[['qseqid', 'RefSeq', 'percent.ident']].head()
qseqid | RefSeq | percent.ident | |
---|---|---|---|
30471 | GSE25001|ILMN_1806032 | NR_026974 | 80.0 |
89517 | GSE25001|ILMN_1652464 | NR_003608 | 80.0 |
35794 | GSE25001|ILMN_1666652 | XR_109511 | 80.0 |
35777 | GSE25001|ILMN_1666652 | NR_026999 | 80.0 |
35769 | GSE25001|ILMN_1666652 | NM_001170423 | 80.0 |
# Save the result
blast_df.to_csv('./data/blastn_filtered.csv', header=True, index=False, sep=',')
We performed 5 screens and obtained z-scores for some of the genes from the previous example, and would like to extract the significant ones among them (present in at least 3 experiments, mean pvalue < 0.01).
Input: blastn_filtered.csv from the previous example, screens.csv mapping genes identified by RefSeq to z-scores from different experiments.
Goal: A table of significant genes (pvalue < 0.01) identified by qseqid, their pvalues and percent.ident.