Introduction to sequence alignment, Entrez database retrieval and curve fitting.

Download original file: 7_introduction_to_sequence_alignment_Entrez_and_curve_fitting.ipynb

View original file in nbviewer: 7_introduction_to_sequence_alignment_Entrez_and_curve_fitting.ipynb

from Bio import pairwise2

Task:

You have two sequences: seq1 = ‘ACGTCCTTCATT’ seq2 = ‘GTCTCATG’

You have a scoring scheme where a match gives you +1 a mismatch gives you 0 * gap opening costs −10

Find the best alignment of the two sequences.

You have a scoring scheme where: a match gives you +1 a mismatch gives you −1 * gap opening costs you −1

Find the best alignment of the two sequences.

HINT: look at the following documentation page for Biopython: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

seq1 = 'ACGTCCTTCATT'
seq2 =  'GTCTCATG'
pairwise2.align.localms(seq1, seq2, 1, 0, -10, -10000000)




[('ACGTCCTTCATT', '----GTCTCATG', 4, 7, 12),
 ('ACGTCCTTCATT', '----GTCTCATG', 4, 7, 11)]




pairwise2.align.localms(seq1, seq2, 1, -1, -1, -10000000)




[('ACGTCCTTCATT', '--GTC-T-CATG', 5, 2, 11),
 ('ACGTCCTTCATT', '--GT-CT-CATG', 5, 2, 11),
 ('ACGTCCTTCATT', '--GT-C-TCATG', 5, 2, 11)]

The format_alignment function

pairwise2 has a format_alignment function to better visualize the alignment output.

out_tuplesA = pairwise2.align.localms(seq1, seq2, 1, 0, -10, -10000000)
out_tuplesB = pairwise2.align.localms(seq1, seq2, 1, -1, -1, -1)

print('1, 0, -10, -inf:')
for ot in out_tuplesA:
    print(pairwise2.format_alignment(*ot))

print('\n1, -1, -1, -inf:')
for ot in out_tuplesB:
    print(pairwise2.format_alignment(*ot))

1, 0, -10, -inf:
ACGTCCTTCATT
       |||||
----GTCTCATG
  Score=4

ACGTCCTTCATT
       ||||
----GTCTCATG
  Score=4


1, -1, -1, -inf:
ACGTCCTTCATT
  |||||||||
--GTC-T-CATG
  Score=5

ACGTCCTTCATT
  |||||||||
--GT-CT-CATG
  Score=5

ACGTCCTTCATT
  |||||||||
--GT-C-TCATG
  Score=5

ACGTCCTTCATT
  |||||||||
--GTC--TCATG
  Score=5

Using the Entrez databases via Biopython

You can access all of the information in the Entrez databases via Biopython, although a new Python package, Bioservices, is probably easier and provides access to MANY different online databases. https://pythonhosted.org/bioservices/

from Bio import Entrez
Entrez.email = 'schryer@gmail.com'


handle = Entrez.einfo()
record = Entrez.read(handle)
record['DbList']

/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/Parser.py:525: UserWarning: Unable to load DTD file einfo.dtd.

Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez.
Though most of NCBI's DTD files are included in the Biopython distribution,
sometimes you may find that a particular DTD file is missing. While we can
access the DTD file through the internet, the parser is much faster if the
required DTD files are available locally.

For this purpose, please download einfo.dtd from

http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd

and save it either in directory

/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/DTDs

or in directory

/home/david/.biopython/Bio/Entrez/DTDs

in order for Bio.Entrez to find it.

Alternatively, you can save einfo.dtd in the directory
Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython.

Please also inform the Biopython developers about this missing DTD, by
reporting a bug on https://github.com/biopython/biopython/issues or sign
up to our mailing list and emailing us, so that we can include it with the
next release of Biopython.

Proceeding to access the DTD file through the internet...

  warnings.warn(message)





['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 'genome', 'assembly', 'genomeprj', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'dbvar', 'epigenomics', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'toolkit', 'toolkitall', 'toolkitbook', 'unigene', 'gencoll', 'gtr']




handle = Entrez.einfo(db='pubmed')
record = Entrez.read(handle)
record['DbInfo']['Description']




'PubMed bibliographic record'




handle = Entrez.esearch(db='pubmed', term='persister')
record = Entrez.read(handle)
record['IdList']

/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/Parser.py:525: UserWarning: Unable to load DTD file esearch.dtd.

Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI Entrez.
Though most of NCBI's DTD files are included in the Biopython distribution,
sometimes you may find that a particular DTD file is missing. While we can
access the DTD file through the internet, the parser is much faster if the
required DTD files are available locally.

For this purpose, please download esearch.dtd from

http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd

and save it either in directory

/home/david/pyenvs/py3/lib/python3.4/site-packages/Bio/Entrez/DTDs

or in directory

/home/david/.biopython/Bio/Entrez/DTDs

in order for Bio.Entrez to find it.

Alternatively, you can save esearch.dtd in the directory
Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython.

Please also inform the Biopython developers about this missing DTD, by
reporting a bug on https://github.com/biopython/biopython/issues or sign
up to our mailing list and emailing us, so that we can include it with the
next release of Biopython.

Proceeding to access the DTD file through the internet...

  warnings.warn(message)





['25304862', '25302955', '25283595', '25267859', '25254624', '25245998', '25203125', '25192989', '25144274', '25130648', '25120957', '25118250', '25102868', '25100240', '25049258', '25041421', '24987115', '24891106', '24885389', '24837402']




handle = Entrez.efetch(db='pubmed', id='25267859')
print(handle.read())


Pubmed-entry ::= {
  pmid 25267859,
  medent {
    em std {
      year 2014,
      month 9,
      day 30
    },
    cit {
      title {
        name "Controlling bacterial behavior with indole-containing natural
 products and derivatives."
      },
      authors {
        names std {
          {
            name ml "Melander RJ",
            affil str "Department of Chemistry, North Carolina State
 University, Raleigh, NC, 27695."
          },
          {
            name ml "Minvielle MJ",
            affil str "Department of Chemistry, North Carolina State
 University, Raleigh, NC, 27695."
          },
          {
            name ml "Melander C",
            affil str "Department of Chemistry, North Carolina State
 University, Raleigh, NC, 27695."
          }
        }
      },
      from journal {
        title {
          iso-jta "Tetrahedron",
          ml-jta "Tetrahedron",
          issn "0040-4020"
        },
        imp {
          date std {
            year 2014,
            month 9,
            day 16
          },
          volume "70",
          issue "37",
          pages "6363-6372",
          language "ENG",
          pubstatus ppublish,
          history {
            {
              pubstatus other,
              date std {
                year 2014,
                month 10,
                day 1,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus pubmed,
              date std {
                year 2014,
                month 10,
                day 1,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus medline,
              date std {
                year 2014,
                month 10,
                day 1,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus other,
              date std {
                year 2015,
                month 9,
                day 16,
                hour 0,
                minute 0
              }
            }
          }
        }
      },
      ids {
        doi "10.1016/j.tet.2014.05.089",
        pubmed 25267859,
        other {
          db "pmc",
          tag str "PMC4175420"
        },
        other {
          db "mid",
          tag str "NIHMS601941"
        }
      }
    },
    abstract "Indole has recently been implicated as an important small
 molecule signal utilized by many bacteria to coordinate various forms of
 behavior. Indole plays a role in numerous bacterial processes, including:
 biofilm formation and maintenance, virulence factor production, antibiotic
 resistance and persister cell formation. Intercepting indole-signaling
 pathways with appropriately designed small molecules provides a n opportunity
 to control unwanted bacterial behaviors, and is an attractive anti-virulence
 therapeutic strategy. In this review, we give an overview of the process
 controlled by indole signaling, and summarize current efforts to design
 indole-containing small molecules to intercept these pathways, and detail the
 synthetic efforts towards accessing indole derived bioactive small molecules.",
    idnum {
      "R01 GM055769/GM/NIGMS NIH HHS"
    },
    pmid 25267859,
    pub-type {
      "JOURNAL ARTICLE"
    },
    status publisher
  }
}

Curve fitting

One common task for biologists is to fit experimental data to a function. For example, one may want to fit a 4 parameter logistic equation to enzyme-linked immunosorbent assay (ELISA) data. The usual formula for this model is:

$f(x) = \frac{A - D}{1 + \left(\frac{x}{C}\right)^B} + D$,

where $x$ is the concentration, $A$ is the minimum asymptote, $B$ is the steepness, $C$ is the inflection point and $D$ is the maximum asymptote.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import leastsq

def logistic4(x, A, B, C, D):
    '''Four parameter logistic equation.

    $f(x) = \frac{A - D}{1 + \left(\frac{x}{C}\right)^B} + D$
    '''
    return ((A-D)/(1.0+((x/C)**B))) + D

def residuals(parameters, x, y):
    '''Deviations of data from logistic4'''
    A, B, C, D = parameters
    err = y - logistic4(x, A, B, C, D)
    return err

# Make up some data for fitting and add noise
# In practice, y_meas would be read in from a file
x_measured = np.linspace(0, 20, 12)
x_true = np.linspace(0, 20, 100)

# True parameter values
A, B, C, D = true_parameters = 0.5, 2.5, 8, 7.3
y_true = logistic4(x_true, *true_parameters)#A, B, C, D)

# Simulated measurements of data.
y_measured = logistic4(x_measured, *true_parameters) + 0.1 * np.random.randn(len(x_measured))

# Initial guess for parameters
initial_parameters = [0, 1, 1, 1]

# Fit equation using least squares optimization
leastsq_output = leastsq(residuals, initial_parameters, full_output=True, args=(x_measured, y_measured))

# Plot results
plt.plot(x_true, logistic4(x_true, *leastsq_output[0]), 'r--')
plt.plot(x_measured, y_measured, 'bo')
plt.plot(x_true, y_true, 'g-')
plt.title('Least squares fit to logistic equation with noisy data')
plt.legend(['Fit', 'Data', 'True'], loc='upper left')

plt.text(10, 2.5, 'Parameter   True   Estimated')
for index, estimated_value in enumerate(leastsq_output[0]):
    param_key = 'ABCD'[index]
    true_value = true_parameters[index]
    plot_text = '{:>10} {:>12.2f} {:>10.2f}'.format(param_key, true_value, estimated_value)
    plt.text(10, 2-index*0.5, plot_text)

plt.show()

png

import pandas as pd


filename = '../../data/python_excel.xlsx'


df = pd.read_excel(filename)


df.columns
df['weight']
df.hist(column='weight', bins=20)




array([[<matplotlib.axes.AxesSubplot object at 0x7f6549a044e0>]], dtype=object)

png

df['weight_squared'] = df['weight'] ** 2


s = pd.Series([1,2,3])


df['random_series'] = s


del df['random_series']


for column_name in df.columns:
    if column_name not in ['year', 'weight', 'weight_squared']:
        del df[column_name]


df.to_excel('small_table.xls', 'Year Weight Weight squared')


ls

1_reading_python_code.ipynb
2_fundamental_python_types.ipynb
3_python_strings_as_containers.ipynb
4_python_keywords.ipynb
5_elementwise_and_vector_operations.ipynb
6_functions_with_tests.ipynb
7_introduction_to_sequence_alignment_Entrez_and_curve_fitting.ipynb
8_questions_from_unit_6.ipynb
python_PART_4.ipynb
python_PART_5.ipynb
python_PART_6.ipynb
sequence_data.tsv
small_table.xls
X_.ipynb

Introduction to sequence alignment, Entrez database retrieval and curve fitting.

Task:

The format_alignment function

Using the Entrez databases via Biopython

Curve fitting

Categories

Tags