Algorithms-DNA-Sequencing

🧬 Algorithms for DNA Sequencing by Johns Hopkins University

Project maintained by claytonjwong Hosted on GitHub Pages — Theme by mattgraham

Algorithms for DNA Sequencing

DNA Sequencing

https://en.wikipedia.org/wiki/DNA_sequencing

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

3 Laws of Assembly

If the suffix of read A is similar to the prefix of read B then A and B might overlap in the genome
- See Week 3 - Lecture 7 for details
More coverage leads to more and longer overlaps
- See Week 3 - Lecture 7 for details
Repeats make assembly difficult
- See Week 4 - Lecture 3 for details

Week 1: DNA Sequencing, Strings and Matching

Lectures

Assignment

Resources

Week 2: Preprocessing, Indexing, and Approximate Matching

Lectures

Assignment

Resources

Week 3: Edit Distance, Assembly, and Overlaps

Lectures

Assignment

Resources

Week 4: Algorithms for Assembly

Lectures

Resources

mystery.fq

Assignment

Utility Functions

def read_FAST_A(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

def readFAST_Q(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline()  # skip name line
            seq = fh.readline().rstrip()  # read base sequence
            fh.readline()  # skip placeholder line
            qual = fh.readline().rstrip() # base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

External Resources

Supplemental

Jupyter Notebooks can be executed from the command line:

$ jupyter notebook 1_notebook.ipynb
[I 11:45:05.991 NotebookApp] The Jupyter Notebook is running at:
[I 11:45:05.991 NotebookApp] http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp]  or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
[I 11:45:05.991 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

[C 11:45:05.998 NotebookApp]
    To access the notebook, open this file in a browser:
        file:///Users/.../Library/Jupyter/runtime/nbserver-38748-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422
     or http://127.0.0.1:8889/?token=070644d6de70204df12235b2356476b577d0744b5df41422

The Python file for each Jupyter Notebook can be executed using ipython. If python is used to execute then the following error will occur:

NameError: name 'get_ipython' is not defined

Auxillary

K-mers are a fundamental concept for creating “words” from a DNA sequencing read. These “words” are abstracted to computer science string algorithms (ie. simply finding pattern in text).

For example, a DNA substring consisting of two neucleotides is a 2-mer (regardless of Mr. Schwarzenegger’s beliefs):