Algorithms-DNA-Sequencing

🧬 Algorithms for DNA Sequencing by Johns Hopkins University


Project maintained by claytonjwong Hosted on GitHub Pages — Theme by mattgraham

Algorithms for DNA Sequencing

Week 4: Algorithms for Assembly

Lectures

  1. Shortest Common Substring
  2. Greedy Shortest Common Substring
  3. 3rd Law of Assembly: Repeats are Bad
  4. De Bruijn Graphs and Eulerian Walks
  5. When Eulerian Walks Go Wrong
  6. Assemblers in Practice
  7. The Future is Long?
  8. Wrap Up
    • Computer Scientists are signficiant contributors to Life Sciences since DNA, RNA, proteins, etc can be abstracted into algorithms based on strings.

Assignment 4

Questions 1 and 2

In a practical, we saw the scs function (copied below along with overlap) for finding the shortest common superstring of a set of strings.

def overlap(a, b, min_length=3):
    """ Return length of longest suffix of 'a' matching
        a prefix of 'b' that is at least 'min_length'
        characters long.  If no such overlap exists,
        return 0. """
    start = 0  # start all the way at the left
    while True:
        start = a.find(b[:min_length], start)  # look for b's suffx in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match

import itertools

def scss(ss):
    """ Returns shortest common superstring of given
        strings, which must be the same length """
    shortest_sup = None
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]  # superstring starts as first string
        for i in range(len(ss)-1):
            # overlap adjacent strings A and B in the permutation
            olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
            # add non-overlapping portion of B to superstring
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup) < len(shortest_sup):
            shortest_sup = sup  # found shorter superstring
    return shortest_sup  # return shortest    

It’s possible for there to be multiple different shortest common superstrings for the same set of input strings. Consider the input strings ABC, BCA, CAB. One shortest common superstring is ABCAB but another is BCABC and another is CABCA.

Questions 3 and 4

Download this FASTQ file containing synthetic sequencing reads from a mystery virus. All the reads are the same length (100 bases) and are exact copies of substrings from the forward strand of the virus genome. You don’t have to worry about sequencing errors, ploidy, or reads coming from the reverse strand.

Assemble these reads using one of the approaches discussed, such as greedy shortest common superstring. Since there are many reads, you might consider ways to make the algorithm faster, such as the one discussed in the programming assignment in the previous module.

def pick_max_overlap(reads, k):
    best_a, best_b, best_len = None, None, 0
    for a, b in itertools.permutations(reads, 2):
        length = overlap(a, b, min_length=k)
        if length > best_len:
            best_a, best_b, best_len = a, b, length
    return best_a, best_b, best_len

def greedy_scss(reads, k):
    while True:
        a, b, olen = pick_max_overlap(reads, k)
        if olen == 0:
            break
        reads.remove(a)
        reads.remove(b)
        reads.append(a + b[olen:])
    return ''.join(reads) # append all non-overlaps onto eachother and return the concatenated string

Hint: the virus genome you are assembling is exactly 15,894 bases long

Supplemental

The greedy shortest common superstring algorithm is a Monte Carlo algorithm:

#
# optimal solution derived from the naive scss algorithm (which is very slow!)
#
res, _ = scss(['ABCD', 'CDBC', 'BCDA'])
print(res)
print(len(res))
#
# non-optimal solution (ie. greedy solution is a monte-carlo algorithm [sometimes its wrong!])
#
res = greedy_scss(['ABCD', 'CDBC', 'BCDA'], 1)
print(res)
print(len(res))

Solution 4