Introduction to sequence alignment, Entrez database retrieval and curve fitting.
Download original file: 7_introduction_to_sequence_alignment_Entrez_and_curve_fitting.ipynb
View original file in nbviewer: 7_introduction_to_sequence_alignment_Entrez_and_curve_fitting.ipynb
from Bio import pairwise2
Task:
You have two sequences: seq1 = ‘ACGTCCTTCATT’ seq2 = ‘GTCTCATG’
You have a scoring scheme where a match gives you +1 a mismatch gives you 0 * gap opening costs −10
Find the best alignment of the two sequences.
You have a scoring scheme where: a match gives you +1 a mismatch gives you −1 * gap opening costs you −1
Find the best alignment of the two sequences.
HINT: look at the following documentation page for Biopython: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
seq1 = 'ACGTCCTTCATT'
seq2 = 'GTCTCATG'
pairwise2.align.localms(seq1, seq2, 1, 0, -10, -10000000)
[('ACGTCCTTCATT', '----GTCTCATG', 4, 7, 12),
('ACGTCCTTCATT', '----GTCTCATG', 4, 7, 11)]
pairwise2.align.localms(seq1, seq2, 1, -1, -1, -10000000)
[('ACGTCCTTCATT', '--GTC-T-CATG', 5, 2, 11),
('ACGTCCTTCATT', '--GT-CT-CATG', 5, 2, 11),
('ACGTCCTTCATT', '--GT-C-TCATG', 5, 2, 11)]
The format_alignment function
pairwise2 has a format_alignment function to better visualize the alignment output.
out_tuplesA = pairwise2.align.localms(seq1, seq2, 1, 0, -10, -10000000)
out_tuplesB = pairwise2.align.localms(seq1, seq2, 1, -1, -1, -1)
print('1, 0, -10, -inf:')
for ot in out_tuplesA:
print(pairwise2.format_alignment(*ot))
print('\n1, -1, -1, -inf:')
for ot in out_tuplesB:
print(pairwise2.format_alignment(*ot))
1, 0, -10, -inf:
ACGTCCTTCATT
|||||
----GTCTCATG
Score=4
ACGTCCTTCATT
||||
----GTCTCATG
Score=4
1, -1, -1, -inf:
ACGTCCTTCATT
|||||||||
--GTC-T-CATG
Score=5
ACGTCCTTCATT
|||||||||
--GT-CT-CATG
Score=5
ACGTCCTTCATT
|||||||||
--GT-C-TCATG
Score=5
ACGTCCTTCATT
|||||||||
--GTC--TCATG
Score=5
Using the Entrez databases via Biopython
You can access all of the information in the Entrez databases via Biopython, although a new Python package, Bioservices, is probably easier and provides access to MANY different online databases. https://pythonhosted.org/bioservices/
from Bio import Entrez
Entrez.email = 'schryer@gmail.com'
handle = Entrez.einfo()
record = Entrez.read(handle)
record['DbList']
/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/Parser.py:525: UserWarning: Unable to load DTD file einfo.dtd.
Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez.
Though most of NCBI's DTD files are included in the Biopython distribution,
sometimes you may find that a particular DTD file is missing. While we can
access the DTD file through the internet, the parser is much faster if the
required DTD files are available locally.
For this purpose, please download einfo.dtd from
http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd
and save it either in directory
/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/DTDs
or in directory
/home/david/.biopython/Bio/Entrez/DTDs
in order for Bio.Entrez to find it.
Alternatively, you can save einfo.dtd in the directory
Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython.
Please also inform the Biopython developers about this missing DTD, by
reporting a bug on https://github.com/biopython/biopython/issues or sign
up to our mailing list and emailing us, so that we can include it with the
next release of Biopython.
Proceeding to access the DTD file through the internet...
warnings.warn(message)
['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 'genome', 'assembly', 'genomeprj', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'dbvar', 'epigenomics', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'toolkit', 'toolkitall', 'toolkitbook', 'unigene', 'gencoll', 'gtr']
handle = Entrez.einfo(db='pubmed')
record = Entrez.read(handle)
record['DbInfo']['Description']
'PubMed bibliographic record'
handle = Entrez.esearch(db='pubmed', term='persister')
record = Entrez.read(handle)
record['IdList']
/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/Parser.py:525: UserWarning: Unable to load DTD file esearch.dtd.
Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez.
Though most of NCBI's DTD files are included in the Biopython distribution,
sometimes you may find that a particular DTD file is missing. While we can
access the DTD file through the internet, the parser is much faster if the
required DTD files are available locally.
For this purpose, please download esearch.dtd from
http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd
and save it either in directory
/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/DTDs
or in directory
/home/david/.biopython/Bio/Entrez/DTDs
in order for Bio.Entrez to find it.
Alternatively, you can save esearch.dtd in the directory
Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython.
Please also inform the Biopython developers about this missing DTD, by
reporting a bug on https://github.com/biopython/biopython/issues or sign
up to our mailing list and emailing us, so that we can include it with the
next release of Biopython.
Proceeding to access the DTD file through the internet...
warnings.warn(message)
['25304862', '25302955', '25283595', '25267859', '25254624', '25245998', '25203125', '25192989', '25144274', '25130648', '25120957', '25118250', '25102868', '25100240', '25049258', '25041421', '24987115', '24891106', '24885389', '24837402']
handle = Entrez.efetch(db='pubmed', id='25267859')
print(handle.read())
Pubmed-entry ::= {
pmid 25267859,
medent {
em std {
year 2014,
month 9,
day 30
},
cit {
title {
name "Controlling bacterial behavior with indole-containing natural
products and derivatives."
},
authors {
names std {
{
name ml "Melander RJ",
affil str "Department of Chemistry, North Carolina State
University, Raleigh, NC, 27695."
},
{
name ml "Minvielle MJ",
affil str "Department of Chemistry, North Carolina State
University, Raleigh, NC, 27695."
},
{
name ml "Melander C",
affil str "Department of Chemistry, North Carolina State
University, Raleigh, NC, 27695."
}
}
},
from journal {
title {
iso-jta "Tetrahedron",
ml-jta "Tetrahedron",
issn "0040-4020"
},
imp {
date std {
year 2014,
month 9,
day 16
},
volume "70",
issue "37",
pages "6363-6372",
language "ENG",
pubstatus ppublish,
history {
{
pubstatus other,
date std {
year 2014,
month 10,
day 1,
hour 6,
minute 0
}
},
{
pubstatus pubmed,
date std {
year 2014,
month 10,
day 1,
hour 6,
minute 0
}
},
{
pubstatus medline,
date std {
year 2014,
month 10,
day 1,
hour 6,
minute 0
}
},
{
pubstatus other,
date std {
year 2015,
month 9,
day 16,
hour 0,
minute 0
}
}
}
}
},
ids {
doi "10.1016/j.tet.2014.05.089",
pubmed 25267859,
other {
db "pmc",
tag str "PMC4175420"
},
other {
db "mid",
tag str "NIHMS601941"
}
}
},
abstract "Indole has recently been implicated as an important small
molecule signal utilized by many bacteria to coordinate various forms of
behavior. Indole plays a role in numerous bacterial processes, including:
biofilm formation and maintenance, virulence factor production, antibiotic
resistance and persister cell formation. Intercepting indole-signaling
pathways with appropriately designed small molecules provides a n opportunity
to control unwanted bacterial behaviors, and is an attractive anti-virulence
therapeutic strategy. In this review, we give an overview of the process
controlled by indole signaling, and summarize current efforts to design
indole-containing small molecules to intercept these pathways, and detail the
synthetic efforts towards accessing indole derived bioactive small molecules.",
idnum {
"R01 GM055769/GM/NIGMS NIH HHS"
},
pmid 25267859,
pub-type {
"JOURNAL ARTICLE"
},
status publisher
}
}
Curve fitting
One common task for biologists is to fit experimental data to a function. For example, one may want to fit a 4 parameter logistic equation to enzyme-linked immunosorbent assay (ELISA) data. The usual formula for this model is:
where
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import leastsq
def logistic4(x, A, B, C, D):
'''Four parameter logistic equation.
$f(x) = \frac{A - D}{1 + \left(\frac{x}{C}\right)^B} + D$
'''
return ((A-D)/(1.0+((x/C)**B))) + D
def residuals(parameters, x, y):
'''Deviations of data from logistic4'''
A, B, C, D = parameters
err = y - logistic4(x, A, B, C, D)
return err
# Make up some data for fitting and add noise
# In practice, y_meas would be read in from a file
x_measured = np.linspace(0, 20, 12)
x_true = np.linspace(0, 20, 100)
# True parameter values
A, B, C, D = true_parameters = 0.5, 2.5, 8, 7.3
y_true = logistic4(x_true, *true_parameters)#A, B, C, D)
# Simulated measurements of data.
y_measured = logistic4(x_measured, *true_parameters) + 0.1 * np.random.randn(len(x_measured))
# Initial guess for parameters
initial_parameters = [0, 1, 1, 1]
# Fit equation using least squares optimization
leastsq_output = leastsq(residuals, initial_parameters, full_output=True, args=(x_measured, y_measured))
# Plot results
plt.plot(x_true, logistic4(x_true, *leastsq_output[0]), 'r--')
plt.plot(x_measured, y_measured, 'bo')
plt.plot(x_true, y_true, 'g-')
plt.title('Least squares fit to logistic equation with noisy data')
plt.legend(['Fit', 'Data', 'True'], loc='upper left')
plt.text(10, 2.5, 'Parameter True Estimated')
for index, estimated_value in enumerate(leastsq_output[0]):
param_key = 'ABCD'[index]
true_value = true_parameters[index]
plot_text = '{:>10} {:>12.2f} {:>10.2f}'.format(param_key, true_value, estimated_value)
plt.text(10, 2-index*0.5, plot_text)
plt.show()
import pandas as pd
filename = '../../data/python_excel.xlsx'
df = pd.read_excel(filename)
df.columns
df['weight']
df.hist(column='weight', bins=20)
array([[<matplotlib.axes.AxesSubplot object at 0x7f6549a044e0>]], dtype=object)
df['weight_squared'] = df['weight'] ** 2
s = pd.Series([1,2,3])
df['random_series'] = s
del df['random_series']
for column_name in df.columns:
if column_name not in ['year', 'weight', 'weight_squared']:
del df[column_name]
df.to_excel('small_table.xls', 'Year Weight Weight squared')
ls
1_reading_python_code.ipynb
2_fundamental_python_types.ipynb
3_python_strings_as_containers.ipynb
4_python_keywords.ipynb
5_elementwise_and_vector_operations.ipynb
6_functions_with_tests.ipynb
7_introduction_to_sequence_alignment_Entrez_and_curve_fitting.ipynb
8_questions_from_unit_6.ipynb
python_PART_4.ipynb
python_PART_5.ipynb
python_PART_6.ipynb
sequence_data.tsv
small_table.xls
X_.ipynb