Responses to student questions from unit 6.

Download original file: 8_questions_from_unit_6.ipynb

View original file in nbviewer: 8_questions_from_unit_6.ipynb

Answers to questions raised in unit 7

Going through strange code to try and understand it is the very best way of learning new tricks, but often raises questions. This notebook tries to address the questions raised last week in unit 7.

Question 1:

I don’t understand this line:

return set(sequence.upper()) - set(alphabet.letters.upper()) == set()

Response 1:

To understand code, one should separate it into discrete chunks that are interpreted in order. In this case there are six separate elements that occur in one line:

Element 1:

The .upper string method returns the sequence in upper case:

sequence = 'abcd'
sequence.upper()




'ABCD'

Element 2:

This element takes the letters attribute from the alphabet and returns them in upper case:

from Bio.Alphabet import IUPAC
alphabet = IUPAC.ambiguous_rna
alphabet.letters.upper()




'GAUCRYWSMKHBVDN'

Element 3:

The set function takes a sequence or iterable and turns it into a set:

iterable = 'A string is an iterable.'
set(iterable)




{' ', '.', 'A', 'a', 'b', 'e', 'g', 'i', 'l', 'n', 'r', 's', 't'}

Note that A set only contains unique members and the order does not matter.

Element 4:

If A is a set and B is a set we can compute the difference between sets. The difference between sets means we find all members in A that do not appear in B. The difference can be computed in two ways:

A = set('Any iterable')
B = set('iterable B')

# This method is prefered when it is not clear what A and B are
A.difference(B)

# This method is used here because it is more readable in this case:
D = A.intersection(B)
S == D




True

Element 5:

This evaluates to True if A has the same members as B:

A == B




False

Element 6:

The final element returns the result of element 5 when the function is called:

return result

Question 2:

Why do we use method A and B (and C)? Sometimes it seams that only one method is sufficient to use

Response 2:

I included all answers provided by all learners as different methods. In this way people can compare the code for different methods so that they can learn different ways of accomplishing the same thing. The test methods prove that all methods are equivalent.

Question 3:

I have trouble understanding built in function enumarate().

Response 3:

Here is the documentation for the enumerate function:

https://docs.python.org/3/library/functions.html#enumerate

There are a number of quirks to this function:

The function enumerate returns a generator

enumerate is equivalent to:

def enumerate(sequence, start=0, reset=False):
    n = start
    for elem in sequence:
        yield n, elem
        n += 1


a = enumerate('ABCD')

The enumerate function uses the yield keyword to create a generator.

https://docs.python.org/3/reference/expressions.html#yield- expressions

A generator is an object we can iterate over exactly one time. This contrasts with a list or other container that can be iterated over as many times as one desires. Consider the following example:

# Note that round brackets are used.
my_generator = (x*x for x in range(3))

# Note that square brackets are used.
my_list = [x*x for x in range(3)]

for item in my_generator:
    print('Generator item:', item)

for item in my_list:
    print('List item:', item)

for item in my_generator:
    print('This should never be printed out.')

for item in my_list:
    print('Second time through list:', item)

Generator item: 0
Generator item: 1
Generator item: 4
List item: 0
List item: 1
List item: 4
Second time through list: 0
Second time through list: 1
Second time through list: 4

The enumerate function returns a two tuple

The enumerate function returns items from an iterable one at a time together with the value of a counter that starts at start. If the default start=0 is used, the value of the counter is the same as the index used to access items from this iterable, if that iterable supports retrieval by index.

list(enumerate('ABCD', start=99))




[(99, 'A'), (100, 'B'), (101, 'C'), (102, 'D')]

Question 4:

The following code is difficult to understand because the assert keyword uses two parameters and the loop uses the enumarate() function

new_seq = 'AAAABBBBCCCC'

# Has only one base that is different from new_seq.
# In this case the assert statement is passed silently.
random_sequence = 'AAAABBBBCCDC'

# Has more than one base that is different from new_seq.
# In this case the assert is not passed.
#random_sequence = 'ADAABBBBCCDC'

N_differences = 0
for index, base in enumerate(new_seq):
    if base != random_sequence[index]:
        N_differences += 1

assert N_differences == 1, N_differences

Response 4:

Assuming we now understand the enumerate function from response 3, we can focus on only two elements of this code.

Element 1: The counting loop

Let us rewrite this a little bit to help us understand what is going on:

new_seq = 'AAAABBBBCCCC'
random_sequence = 'AAAABBBBCCDC'

N_differences = 0
for index, base in enumerate(new_seq):
    other_base = random_sequence[index]

    print('Comparing original index {} (base {})'.format(index, base) + 
          ' to random index {} (base {}).'.format(index, other_base))

    if base != other_base:

        print('   Bases are different. Adding 1 to N_differences')
        N_differences += 1

print('N_differences =', N_differences)

Comparing original index 0 (base A) to random index 0 (base A).
Comparing original index 1 (base A) to random index 1 (base A).
Comparing original index 2 (base A) to random index 2 (base A).
Comparing original index 3 (base A) to random index 3 (base A).
Comparing original index 4 (base B) to random index 4 (base B).
Comparing original index 5 (base B) to random index 5 (base B).
Comparing original index 6 (base B) to random index 6 (base B).
Comparing original index 7 (base B) to random index 7 (base B).
Comparing original index 8 (base C) to random index 8 (base C).
Comparing original index 9 (base C) to random index 9 (base C).
Comparing original index 10 (base C) to random index 10 (base D).
   Bases are different. Adding 1 to N_differences
Comparing original index 11 (base C) to random index 11 (base C).
N_differences = 1

This use of enumerate allows us to retrieve the base from random_sequence that is in the same position as new_seq.

If these bases are different we add it to the total number of differences (N_differences).

Element 2: The assert statement

https://docs.python.org/3/reference/simple_stmts.html#grammar-token- assert_stmt

The assert statement can be given either one or two expressions as such:

expression1 = 2 == 1+1
expression2 = 1*4/(3-6**4)%99

assert expression1
assert expression1, expression2 
assert(expression1)
assert(expression1), expression2

Note that the second expression can be a tuple of other expressions:

assert expression1, (expression2)
assert(expression1), (expression2)
assert(expression1), (expression1, expression2, 'Or more...')

If only one expression is given to an assert statement and this evaluates to False, an AssertionError is raised without any arguments:

if expression1:
    raise AssertionError


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

<ipython-input-43-3d79098c5148> in <module>()
      1 if expression1:
----> 2     raise AssertionError


AssertionError:

If two expressions are given to an assert statement, an AssertionError is raised with expression2 as the argument:

if True:
    raise AssertionError(expression2)


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)

<ipython-input-42-83355e252f98> in <module>()
      1 if True:
----> 2     raise AssertionError(expression2)


AssertionError: 98.9969064191802

Question 5:

This part is also difficult to understand:

def test_sequence_content_fraction():
    f_A = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='A')
    f_B = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='B')
    f_C = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='C')

    # This is how you should check if two floats have the same
    # (the values must be close, but not exactly the same)
    assert abs(f_A - f_B) < 1e-12
    assert abs(f_A - f_C) < 1e-12
    print('Test OK.')
test_sequence_content_fraction()


# We need a dummy function to explain this code:
def sequence_content_fraction(*args, **kwds):
    import random
    almost_the_same = 0.654 + 1e-14 * random.random()
    print('Returning sequence_content_fraction:', almost_the_same)
    return almost_the_same

def test_sequence_content_fraction():
    f_A = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='A')
    f_B = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='B')
    f_C = sequence_content_fraction('AAABBBCCC', 'BC', 'ABC', method='C')

    # This is how you should check if two floats have the same 
    # (the values must be close, but not exactly the same)
    assert abs(f_A - f_B) < 1e-12
    assert abs(f_A - f_C) < 1e-12
    print('Test OK.')

test_sequence_content_fraction()

Returning sequence_content_fraction: 0.6540000000000047
Returning sequence_content_fraction: 0.6540000000000042
Returning sequence_content_fraction: 0.6540000000000017
Test OK.

This code runs the function sequence_content_fraction three times and stores the results in f_A, f_B, and f_C.

The assert statements check the question “Are these number close to within machine precision?” using the following expressions:

f_A = float('0.6540000000000082') # Your numbers may be different..
f_B = float('0.6540000000000096')
f_C = float('0.6540000000000085')

expressionA = abs(f_A - f_B) < 1e-12
expressionB = abs(f_A - f_C) < 1e-12

The abs function is required to avoid the following error in logic:

f_A = 0.25
f_B = 500000

# This evaluates to True even though the numbers are not close.
print(f_A - f_B < 1e-12)

# This correctly evaluates to False (numbers are not close.)
print(abs(f_A - f_B) < 1e-12)

True
False