Responses to student questions from unit 6.
Download original file: 8_questions_from_unit_6.ipynb
View original file in nbviewer: 8_questions_from_unit_6.ipynb
Answers to questions raised in unit 7
Going through strange code to try and understand it is the very best way of learning new tricks, but often raises questions. This notebook tries to address the questions raised last week in unit 7.
Question 1:
I don’t understand this line:
return set(sequence.upper()) - set(alphabet.letters.upper()) == set()
Response 1:
To understand code, one should separate it into discrete chunks that are interpreted in order. In this case there are six separate elements that occur in one line:
Element 1:
The .upper
string method returns the sequence in upper case:
sequence = 'abcd'
sequence.upper()
'ABCD'
Element 2:
This element takes the letters attribute from the alphabet and returns them in upper case:
from Bio.Alphabet import IUPAC
alphabet = IUPAC.ambiguous_rna
alphabet.letters.upper()
'GAUCRYWSMKHBVDN'
Element 3:
The set
function takes a sequence or iterable and turns it into a set:
iterable = 'A string is an iterable.'
set(iterable)
{' ', '.', 'A', 'a', 'b', 'e', 'g', 'i', 'l', 'n', 'r', 's', 't'}
Note that A set only contains unique members and the order does not matter.
Element 4:
If A
is a set and B
is a set we can compute the difference between sets. The
difference between sets means we find all members in A
that do not appear in
B
. The difference can be computed in two ways:
A = set('Any iterable')
B = set('iterable B')
# This method is prefered when it is not clear what A and B are
A.difference(B)
# This method is used here because it is more readable in this case:
D = A.intersection(B)
S == D
True
Element 5:
This evaluates to True
if A
has the same members as B
:
A == B
False
Element 6:
The final element returns the result of element 5 when the function is called:
return result
Question 2:
Why do we use method A and B (and C)? Sometimes it seams that only one method is sufficient to use
Response 2:
I included all answers provided by all learners as different methods. In this way people can compare the code for different methods so that they can learn different ways of accomplishing the same thing. The test methods prove that all methods are equivalent.
Question 3:
I have trouble understanding built in function enumarate()
.
Response 3:
Here is the documentation for the enumerate
function:
https://docs.python.org/3/library/functions.html#enumerate
There are a number of quirks to this function:
The function enumerate returns a generator
enumerate
is equivalent to:
def enumerate(sequence, start=0, reset=False):
n = start
for elem in sequence:
yield n, elem
n += 1
a = enumerate('ABCD')
The enumerate function uses the yield
keyword to create a generator.
https://docs.python.org/3/reference/expressions.html#yield-
expressions
A generator is an object we can iterate over exactly one time. This contrasts with a list or other container that can be iterated over as many times as one desires. Consider the following example:
# Note that round brackets are used.
my_generator = (x*x for x in range(3))
# Note that square brackets are used.
my_list = [x*x for x in range(3)]
for item in my_generator:
print('Generator item:', item)
for item in my_list:
print('List item:', item)
for item in my_generator:
print('This should never be printed out.')
for item in my_list:
print('Second time through list:', item)
Generator item: 0
Generator item: 1
Generator item: 4
List item: 0
List item: 1
List item: 4
Second time through list: 0
Second time through list: 1
Second time through list: 4
The enumerate
function returns a two tuple
The enumerate
function returns items from an iterable one at a time together
with the value of a counter that starts at start. If the default start=0
is
used, the value of the counter is the same as the index used to access items
from this iterable, if that iterable supports retrieval by index.
list(enumerate('ABCD', start=99))
[(99, 'A'), (100, 'B'), (101, 'C'), (102, 'D')]
Question 4:
The following code is difficult to understand because the assert
keyword uses
two parameters and the loop uses the enumarate() function
new_seq = 'AAAABBBBCCCC'
# Has only one base that is different from new_seq.
# In this case the assert statement is passed silently.
random_sequence = 'AAAABBBBCCDC'
# Has more than one base that is different from new_seq.
# In this case the assert is not passed.
#random_sequence = 'ADAABBBBCCDC'
N_differences = 0
for index, base in enumerate(new_seq):
if base != random_sequence[index]:
N_differences += 1
assert N_differences == 1, N_differences
Response 4:
Assuming we now understand the enumerate
function from response 3, we can
focus on only two elements of this code.
Element 1: The counting loop
Let us rewrite this a little bit to help us understand what is going on:
new_seq = 'AAAABBBBCCCC'
random_sequence = 'AAAABBBBCCDC'
N_differences = 0
for index, base in enumerate(new_seq):
other_base = random_sequence[index]
print('Comparing original index {} (base {})'.format(index, base) +
' to random index {} (base {}).'.format(index, other_base))
if base != other_base:
print(' Bases are different. Adding 1 to N_differences')
N_differences += 1
print('N_differences =', N_differences)
Comparing original index 0 (base A) to random index 0 (base A).
Comparing original index 1 (base A) to random index 1 (base A).
Comparing original index 2 (base A) to random index 2 (base A).
Comparing original index 3 (base A) to random index 3 (base A).
Comparing original index 4 (base B) to random index 4 (base B).
Comparing original index 5 (base B) to random index 5 (base B).
Comparing original index 6 (base B) to random index 6 (base B).
Comparing original index 7 (base B) to random index 7 (base B).
Comparing original index 8 (base C) to random index 8 (base C).
Comparing original index 9 (base C) to random index 9 (base C).
Comparing original index 10 (base C) to random index 10 (base D).
Bases are different. Adding 1 to N_differences
Comparing original index 11 (base C) to random index 11 (base C).
N_differences = 1
This use of enumerate
allows us to retrieve the base from random_sequence
that is in the same position as new_seq
.
If these bases are different we add it to the total number of differences (N_differences).
Element 2: The assert
statement
https://docs.python.org/3/reference/simple_stmts.html#grammar-token-
assert_stmt
The assert
statement can be given either one or two expressions as such:
expression1 = 2 == 1+1
expression2 = 1*4/(3-6**4)%99
assert expression1
assert expression1, expression2
assert(expression1)
assert(expression1), expression2
Note that the second expression can be a tuple of other expressions:
assert expression1, (expression2)
assert(expression1), (expression2)
assert(expression1), (expression1, expression2, 'Or more...')
If only one expression is given to an assert
statement and this evaluates to
False
, an AssertionError
is raised without any arguments:
if expression1:
raise AssertionError
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-43-3d79098c5148> in <module>()
1 if expression1:
----> 2 raise AssertionError
AssertionError:
If two expressions are given to an assert
statement, an AssertionError
is
raised with expression2
as the argument:
if True:
raise AssertionError(expression2)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-42-83355e252f98> in <module>()
1 if True:
----> 2 raise AssertionError(expression2)
AssertionError: 98.9969064191802
Question 5:
This part is also difficult to understand:
def test_sequence_content_fraction():
f_A = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='A')
f_B = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='B')
f_C = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='C')
# This is how you should check if two floats have the same
# (the values must be close, but not exactly the same)
assert abs(f_A - f_B) < 1e-12
assert abs(f_A - f_C) < 1e-12
print('Test OK.')
test_sequence_content_fraction()
# We need a dummy function to explain this code:
def sequence_content_fraction(*args, **kwds):
import random
almost_the_same = 0.654 + 1e-14 * random.random()
print('Returning sequence_content_fraction:', almost_the_same)
return almost_the_same
def test_sequence_content_fraction():
f_A = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='A')
f_B = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='B')
f_C = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='C')
# This is how you should check if two floats have the same
# (the values must be close, but not exactly the same)
assert abs(f_A - f_B) < 1e-12
assert abs(f_A - f_C) < 1e-12
print('Test OK.')
test_sequence_content_fraction()
Returning sequence_content_fraction: 0.6540000000000047
Returning sequence_content_fraction: 0.6540000000000042
Returning sequence_content_fraction: 0.6540000000000017
Test OK.
This code runs the function sequence_content_fraction
three times and stores
the results in f_A
, f_B
, and f_C
.
The assert
statements check the question “Are these number close to within
machine precision?” using the following expressions:
f_A = float('0.6540000000000082') # Your numbers may be different..
f_B = float('0.6540000000000096')
f_C = float('0.6540000000000085')
expressionA = abs(f_A - f_B) < 1e-12
expressionB = abs(f_A - f_C) < 1e-12
The abs
function is required to avoid the following error in logic:
f_A = 0.25
f_B = 500000
# This evaluates to True even though the numbers are not close.
print(f_A - f_B < 1e-12)
# This correctly evaluates to False (numbers are not close.)
print(abs(f_A - f_B) < 1e-12)
True
False