Algorithms-DNA-Sequencing

🧬 Algorithms for DNA Sequencing by Johns Hopkins University


Project maintained by claytonjwong Hosted on GitHub Pages — Theme by mattgraham

Algorithms for DNA Sequencing

DNA Sequencing

https://en.wikipedia.org/wiki/DNA_sequencing

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

3 Laws of Assembly

  1. If the suffix of read A is similar to the prefix of read B then A and B might overlap in the genome
  2. More coverage leads to more and longer overlaps
  3. Repeats make assembly difficult

Week 1: DNA Sequencing, Strings and Matching

Lectures

  1. Introduction
  2. Why Study This?
  3. DNA sequencing past and present
  4. Genomes as strings, reads as substrings
  5. String definitions and Python examples
  6. How DNA gets copied
  7. How second-generation sequencers work
  8. Sequencing errors and base qualities
  9. Working with sequencing reads (FASTQ format)
  10. Sequencers give pieces to genomic puzzles
  11. Read alignment and why it’s hard
  12. Naive exact matching

Assignment

Resources

Week 2: Preprocessing, Indexing, and Approximate Matching

Lectures

  1. Boyer-Moore: Basics
  2. Boyer-Moore: Putting It All Together
  3. Diversion: Repetitive Elements
  4. Preprocessing
  5. Indexing and K-mers
  6. Data Structures for Indexing
  7. Hash Tables for Indexing
  8. Variations on K-mer Indexes
  9. Indexing by Suffix
  10. Approximate Matching, Hamming, and Edit Distance
  11. Pigeonhole Principle

Assignment

Resources

Week 3: Edit Distance, Assembly, and Overlaps

Lectures

  1. Edit Distance (part 1)
  2. Edit Distance (part 2)
  3. Edit Distance (part 3)
  4. Edit Distance (part 4)
  5. Global and Local Alignment
  6. De Novo Shotgun Assembly
  7. 1st and 2nd Laws of Assembly: Overlaps and Coverage
  8. Overlap Graph

Assignment

Resources

Week 4: Algorithms for Assembly

Lectures

  1. Shortest Common Substring
  2. Greedy Shortest Common Substring
  3. 3rd Law of Assembly: Repeats are Bad
  4. De Bruijn Graphs and Eulerian Walks
  5. When Eulerian Walks Go Wrong
  6. Assemblers in Practice
  7. The Future is Long?
  8. Wrap Up

Resources

Assignment

Utility Functions

def read_FAST_A(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome
def readFAST_Q(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline()  # skip name line
            seq = fh.readline().rstrip()  # read base sequence
            fh.readline()  # skip placeholder line
            qual = fh.readline().rstrip() # base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

External Resources

Supplemental

$ jupyter notebook 1_notebook.ipynb
[I 11:45:05.991 NotebookApp] The Jupyter Notebook is running at:
[I 11:45:05.991 NotebookApp] http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp]  or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

[C 11:45:05.998 NotebookApp]
    To access the notebook, open this file in a browser:
        file:///Users/.../Library/Jupyter/runtime/nbserver-38748-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
     or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
NameError: name 'get_ipython' is not defined

Auxillary

K-mers are a fundamental concept for creating “words” from a DNA sequencing read. These “words” are abstracted to computer science string algorithms (ie. simply finding pattern in text).

For example, a DNA substring consisting of two neucleotides is a 2-mer (regardless of Mr. Schwarzenegger’s beliefs):