Project #5: Gene Sequence Alignment

Overview

In this project, you will implement dynamic programming algorithms for computing the minimal cost of aligning gene sequences and for extracting optimal alignments.

Objectives

Design and implement a Dynamic Programming algorithm that solves the gene sequence alignment problem.
Think carefully about the use of memory in your implementation.
Solve a non-trivial computational genomics problem.
Mature as problem-solvers.

Distribution

Code: http://faculty.cs.byu.edu/~ringger/cs312/projects/Project5-GeneSequenceAlignment-VS2008.zip

The data is distributed with the code as an .mdb database file. However, if you want to look at plain text, see the following raw data:

http://faculty.cs.byu.edu/~ringger/cs312/projects/GeneData.zip

Background

You are using dynamic programming to align multiple gene sequences, two at a time. In light of the SARS outbreak a few years ago, we have chosen to use the SARS virus as one of our DNA sequences. SARS is a coronavirus, and we have also provided the genomes for several other coronaviruses. It is your job in this project to align all pairs in order to assess the pair-wise similarity.

To prepare to succeed on this project, make sure you understand the sequence alignment and solution extraction algorithms as presented in class. The edit distance algorithm in section 6.3 of the Dasgupta et al. textbook is essentially the same algorithm.

Provided Framework

You are provided with some scaffolding code to help you get started on the project. We provide a Visual Studio project with a GUI containing a 10×10 matrix, with a row and a column for each of 10 organisms. The organism at row $X$ is the same organism for column $X$. Note that this matrix is

not* the dynamic programming table; it is simply a place to store and display the final minimum alignment cost result from the alignment of the gene sequences for each pair of organisms. Thus, cell $(i, j)$ in this table will contain the minimum cost alignment score of the genome for organism $i$ with the genome for organism $j$.

When you press the “Process” button on the toolbar, the matrix fills with numbers, one number for each pair, as shown in the figure below. You will generate these numbers by aligning exacty the first 5000 characters (bases) in each sequence pair. Currently, the numbers are computed by a quick placeholder computation (the difference in sequence lengths) that is not particularly useful; these are the numbers shown in the figure below. Your job will be to replace that number with the output from a sequence alignment function that employs dynamic programming. You will fill the matrix with the pair-wise scores. When you press the “Process” button, the event handler for the button-down event should call the PairWiseAlign.Align() method which you implement.

genesequencealignment1.png

Each gene sequence consists of a string of letters and is stored in the given database file. The scaffolding code loads the data from the database. For example, the record for the “sars9” genome contains the following sequence (prefix shown here):

atattaggtttttacctacccaggaaaagccaaccaacctcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgtagctgtcgctcggctgca
tgcctagtgcacctacgcagtataaacaataataaattttactgtcgttgacaagaaacgagtaactcgtccctct…

We encourage you to familiarize yourself with the few lines of existing code required for using the database file, in order to become familiar with this functionality of C#. Note that you can also open the .mdb database file using Microsoft Access, if you have installed a copy (possibly through the MSDN Academic Alliance), to inspect the data that way.

Be sure that you are using the latest .NET framework: Bring up the properties of your project (can use the key-chord Alt+F7). For the property “Target framework:”, specify .NET Framework 4 (or higher).

Summary:

For each pair of genes, you will apply the DP algorithm to compute their alignment cost.
You will summarize the alignment costs for all pairs in the results table (the 10 x 10 table) in the GUI. The results table is just a way of presenting the results.

Required: Whiteboard Experience (TM)

You are developing as a problem solver. You are also learning to solve a problem before you start writing code. In this project, we require you to step up to a whiteboard after reading these instructions before writing any code. Use the whiteboard to walk through the solution with a friend. Your friend could be another CS major, a CS 312 classmate, or another technical person with an eye for detail.

Make sure you sketch out and understand the distinction between the problem matrix and the individual DP tables. Simulate the alignment algorithm on an example DP table. Think out loud and with sketches.

In short, make sure you understand what you are going to do before you write a stitch of code. As you will note below, including a photo of your “whiteboard experience” is a required element of your project report.

Have a “whiteboard experience” and document it with a photograph of you next to the whiteboard sketches, etc.

Prepare a whiteboard experience report as a .pdf file. It should consist of one page with the following elements:

your name
your section number
[5 points] your photo
[5 points] a caption (2-3 sentences) explaining why you are ready to proceed to write code
the names of the people in your audience

Submit the whiteboard experience report online by the whiteboard experience deadline for this project.

To Do

Have a “whiteboard experience” as directed above, and submit it online by the deadline for the whiteboard experience.
Plan out your data structures, and write pseudo-code for the alignment and extraction algorithms that you need to implement. There are two different (but very similar) alignment algorithms to implement:
- The first algorithm should simply score the alignment between two sequences but cannot be used to extract the actual alignment. Call this the “scoring algorithm”.
- The second algorithm should align the sequences and compute the alignment score (like the first algorithm) but do so in such a way that the actual character-by-character alignment can also be extracted. Call this the “extraction algorithm”.
- Each algorithm should use dynamic programming: in particular, use the Needleman/Wunsch algorithm we used in class. Find the best alignment by minimizing the total cost.
- The requirement is that your scoring algorithm run in at most $O(n^2)$ time and $O(n)$ space; this will require a modification to the algorithm as given in order to be more sparing in the use of space. On the other hand, your extraction algorithm can use $O(n^2)$ in time and space, as shown in class and practiced in the homework.
- Your scoring and extraction algorithms should use the following operator costs to compute the distance between two sequences:
  - Substitutions, which are single character mis-matches, are penalized 1 unit
  - Insertions/Deletions (“indels”), which create gaps, are penalized 5 units
  - Matches are rewarded -3 units
- Make your pseudo-code easy to follow, but keep it within one page. If needed for clarity, document your algorithm with legible comments in important places in the pseudo-code. Clear pseudo-code provides evidence of your comprehension. The correctness of your approach should be easy for the reader (e.g., TA) to verify. If the TA cannot easily determine the code's correctness, then fewer points will be awarded.
Implement the scoring algorithm. When scoring, only align the first 5000 characters (bases) of each sequence. Implement your algorithm in the PairWiseAlign.Align() method. That method should return the alignment score.
- Document your algorithm with legible comments in important places in the code that make it obvious to the reader what your solution is doing. This documentation also provides evidence of your comprehension. With the aid of adequate documentation, the correctness of your approach should be easy for the reader (e.g., TA) to verify. If the TA cannot easily determine the code's correctness, then fewer points will be awarded.
Record the pair-wise scores computed by the scoring algorithm in the given 10×10 score matrix such that position (i, j) in the display shows the optimal distance from $sequence_i$ to $sequence_j$ for the first 5000 characters. Since the matrix is symmetric, you should only fill the upper part of the matrix (above the diagonal). Correctness of your algorithm will be checked using this matrix. (There is also no need to fill the diagonal, since the alignment of a sequence with itself consists only of matches.)
As an aid to debugging, the first two sequences in the database are the strings: “polynomial” and “exponential”. The string “polynomial” is the sequence for the first row (and column) of the score matrix, and “exponential” comes second. While these strings aren’t biologically valid DNA sequences, they are the strings used in an example in the textbook.
- To help you verify the correctness of your algorithm, with our operator costs the optimal alignment of these two strings should be -1 (your code should compute that result for the cell at row 1 and column 2 in the table).
  - You could also try the operator costs in the book chapter and verify your DP table against the one in the book (see pp. 163-164 in the print textbook).
- As a further aid for verifying the correctness of your algorithm, here is another value that should appear in your table: (counting from 1) at row 3, column 4, the result should be -14972. That is the result of aligning the first 5000 bases each of the genomes for “Bovine coronavirus isolate BCoV-LUN” and “Bovine coronavirus isolate BCoV-ENT”.
Implement the extraction algorithm. Your function should produce a character-by-character alignment of its two sequence arguments. (This function should not be called to fill the score matrix.)
- Add a text field to your GUI for displaying an extracted alignment. (Right click on MainForm.cs in the “Solution Explorer”, and choose “View Designer” to view the GUI in designer mode and gain access to the toolbox of visual controls.)
- Add a click event to the GUI table cells. The method handling this event should use your extraction algorithm to calculate the optimal alignment for the strings corresponding to that cell and should display the alignment in the text field. Thus, the field will be updated whenever you click on a cell in the score table.
- To add the click event, select the DataGridView (the gray box in the middle of the main form), and select the Events tab (the lightning bolt) on the Properties list. Add a handler for CellMouseClick by double clicking on its row.
- Display the alignment in a manner that easily reveals the matches, substitutions, and insertions/deletions (indels), such as is shown in the short example below. Gaps should be shown with a '-' (hyphen).
- Extract and display the alignment of the first 100 characters (bases) (no more, no less) of each of the selected sequences in the pair. (The alignment itself will most likely include gaps and will therefore be longer than 100 characters.)
- In case there is more than one optimal alignment, choose one of them.
- For the extra mile, display all of the optimal alignments.
Align the first 100 characters (bases) of sequence #3 (row 3) and #10 (column 10) (assuming the first sequence in the table is numbered as #1); extract and display the alignment in your report using a fixed-width font. For example, if you were aligning “AGCTCATGCT” and “ACTGCATCA”, the side-by-side alignment should be presented in the manner shown here:

AGCTCATGCT
ACTGCAT-CA

Performance Requirement

You must fill the 10×10 score matrix with your scoring algorithm in less than 30 seconds using $O(n^2)$ time and $O(n)$ space. Remember: Since the matrix is symmetric, you should only fill the upper part of the matrix (above the diagonal)!

If you have implemented the algorithms correctly but are struggling to meet the time requirement, it is worth noting that your program will run more quickly if you:

Build and run the “release” version (rather than debug)
Run your program unattached to Visual Studio (“Start without Debugging (Ctrl-F5)”)

You are welcome to run on a lab machine if it helps provide you with a faster time. Any time you report should be a real time on a real machine.

Implementation Notes

Platform

x86

If you are running on a 64-bit operating system and wish to use our project out of the box, you will need to switch to a 32-bit configuration. You have the choice. Start by changing the build properties for the project as follows:

Load the solution into Visual Studio.
In the Solution Explorer (usually on the right side of the screen), right click “Genetics Lab” and choose “Properties”.
On the page that appears, click the tab labeled “Build”.
Now, click the “Platform Target” drop-down box and choose “x86”.
Save

Note that you may need to reset the Platform Target if you switch from Debug to Release or vice versa.

Setting the Platform Target to “x86” eliminates the InvalidOperationException with the message “The 'Microsoft.Jet.OLEDB.4.0' provider is not registered on the local machine.”

x64

If you wish to use 64-bit Visual Studio on 64-bit Windows, you will need to install the Access database engine added in MS Office 2010. It is 64-bit and 100% backwards compatible. Simply change the provider in the call to the database constructor as follows, and it will work out of the box:

Original (32-bit):

string strAccessConn = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + dbaseFileName;

New (x64):

string strAccessConn = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + dbaseFileName;

The 64-bit database engine cannot run with the 32-bit version of Office installed. So you would need to install the 64-bit version of Office to get it to work.

It should also work without Office installed. Here's a link to the 64-bit installer:

: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=13255

(Thanks to Nathan Standiford and Jason Rasmussen for the information on this x64 option.)

Coordinates

X-Y coordinates vs. Row-Column coordinates:

Typically we address points in the plane with coordinates (x,y), where x refers to the horizontal offset from the origin, and y refers to the vertical offset from the origin
However, we address points in a 2-D table with coordinates (row, column), where row refers to the vertical position relative to cell (0,0) (usually in the upper left), and column refers to the horizontal position from cell (0,0)
In this project, we use (row, column) addressing for our tables. Be sure that you are thinking in these terms as well!

Report

Write a type-written report containing the following distinct sections as a single PDF document, and submit it through Learning Suite.

Place your name, net ID, and section number at the top of the first page.
Also include an estimate of the amount of time you spent on this project at the top of the first page.
Include a “methods” section containing the pseudo-code for:
- [20 points] your dynamic programming scoring algorithm (linear space)
- [15 points] your alignment extraction algorithm, including the backtrace
- Make your pseudo-code easy to follow, but keep it within one page. If needed for clarity, document your algorithm with legible comments in important places in the pseudo-code. Clear pseudo-code provides evidence of your comprehension. The correctness of your approach should be easy for the reader (e.g., TA) to verify. If the TA cannot easily determine the code's correctness, then fewer points will be awarded.
Analyze your pseudo-code for both space and time complexity (worst-case). Give reasonable justifications for your analysis.
- [5 points] Your analysis should show that your scoring algorithm is at most $O(n^2)$ time and at most $O(n)$ space
- [5 points] Your analysis should show that your extraction algorithm is at most $O(n^2)$ space and time.
[5 points] Write a paragraph that explains how your alignment extraction algorithm works.
Include a “results” section showing:
- [10 points] a screen-shot of your 10×10 score matrix.
- [10 points] your running time for producing the screenshot. (The running time may appear in the screenshot.) (Full points for meeting the performance requirement.)
- [10 points] the extracted alignment for the first 100 characters of sequences #3 and #10 (counting from 1), printed in a side-by-side fashion in such a way that matches, substitutions, and insertions/deletions are clearly discernible as shown above in the To Do section.
[20 points] Attach your source code for both your scoring and extraction algorithms.
- Document your algorithm with legible comments in important places in the code that make it obvious to the reader what your solution is doing. This documentation also provides evidence of your comprehension. With the aid of adequate documentation, the correctness of your approach should be easy for the reader (e.g., TA) to verify. If the TA cannot easily determine the code's correctness, then fewer points will be awarded.

Further Exploration

Here are a couple of suggestions for further exploration. They are listed without prejudice against things that are or are not on this list.

Experiment with alternative operator (substitution, match, indel) costs and discuss the impact of changing their values.
Try aligning longer (sub-)sequences. Conduct an empirical analysis. Discuss impact on the alignment score matrix.
Re-implement your alignment algorithm in top-down fashion using recursion and a memory function. How does this algorithm compare to the iterative, bottom-up implementation?
Try a shortest-path algorithm like Dijkstra’s to solve this problem.
Get ahead of the game, learn about the A* shortest-path algorithm, and implement it with a tight admissible heuristic (we will get to this topic later in the course, but you might enjoy trying it in the context of the alignment problem).

Appendix: Coronaviruses

The following explains the biological setting of this project, including some background on SARS and Coronaviruses in general from the Department of Microbiology and Immunology, University of Leicester.

Coronaviruses: Coronaviruses were first isolated from chickens in 1937. After the discovery of Rhinoviruses in the 1950’s, approximately 50% of colds still could not be ascribed to known agents. In 1965, Tyrell and Bynoe used cultures of human ciliated embryonal trachea to propagate the first human coronavirus (HCoV) in vitro. There are now approximately 15 species in this family, which infect not only man but cattle, pigs, rodents, cats, dogs, and birds (some are serious veterinary pathogens, especially in chickens).

genesequencer2.png genesequencer3.png

Coronavirus particles are irregularly-shaped, ~60-220nm in diameter, with an outer envelope bearing distinctive, ‘club-shaped’ peplomers (~20nm long x 10nm at wide distal end). This ‘crown-like’ appearance (Latin, corona) gives the family its name. The center of the particle appears amorphous in negatively shaped stained EM preps, the nucleocapsid being in a loosely wound rather disordered state.

Most human coronaviruses do not grow in cultured cells, therefore relatively little is known about them, but two strains grow in some cell lines and have been used as a model. Replication is slow compared to other envelope viruses, e.g. influenza.

Coronavirus infection is very common and occurs worldwide. The incidence of infection is strongly seasonal, with the greatest incidence in children in winter. Adult infections are less common. The number of coronavirus serotypes and the extent of antigenic variation is unknown. Re-infections occur throughout life, implying multiple serotypes (at least four are known) and/or antigenic variation, hence the prospects for immunization appear bleak.

SARS: SARS is a type of viral pneumonia, with symptoms including fever, a dry cough, dyspnea (shortness of breath), headache, and hypoxaemia (low blood oxygen concentration). Typical laboratory findings include lymphopaenia (reduced lymphocyte numbers) and mildly elevated aminotransferase levels (indicating liver damage). Death may result from progressive respiratory failure due to alveolar damage. The typical clinical course of SARS involves an improvement in symptoms during the first week of infection, followed by a worsening during the second week. Studies indicate that this worsening may be related to a patient’s immune responses rather than uncontrolled viral replication.

genesequencer4.png

The outbreak is believed to have originated in February 2003 in the Guangdong province of China, where 300 people became ill, and at least five died. After initial reports that a paramyxovirus was responsible, the true cause appears to be a novel coronavirus with some unusual properties. For one thing, SARS virus can be grown in Vero cells (a fibroblast cell line isolated in 1962 from a primate)—a novel property for HCoV’s, most of which cannot be cultivated. In these cells, virus infection results in a cytopathic effect, and budding of coronavirus-like particles from the endoplasmic reticulum within infected cells.

Amplification of short regions of the polymerase gene, (the most strongly conserved part of the coronavirus genome) by reverse transcriptase polymerase chain reaction (RT-PCR) and nucleotide sequencing revealed that the SARS virus is a novel coronavirus which has not previously been present in human populations. This conclusion is confirmed by serological (antigenic) investigations. We now know the complete ~29,700 nucleotide sequence of many isolates of the SARS virus. The sequence appears to be typical of coronaviruses, with no obviously unusual features, although there are some differences in the make up of the non-structural proteins which are unusual.

There is currently no general agreement that antiviral drugs have been shown to be consistently successful in treating SARS or any coronavirus infection, nor any vaccine against SARS. However, new drugs targeted specifically against this virus are under development. Coronaviruses with 99% sequence similarity to the surface spike protein of human SARS isolates have been isolated in Guangdong, China, from apparently healthy masked palm civets (Paguma larvata), a cat-like mammal closely related to the mongoose. The unlucky palm civet is regarded as a delicacy in Guangdong and it is believed that humans became infected as they raised and slaughtered the animals rather than by consumption of infected meat.

Might SARS coronavirus recombine with other human coronaviruses to produce an even more deadly virus? Fortunately, the coronaviruses of which we are aware indicate that recombination has not occurred between viruses of different groups, only within a group, so recombination does not seem likely given the distance between the SARS virus and HCoV.

SARS, The Disease and the Virus (from The National Center for Biotechnology Information): In late 2002, an outbreak of severe, atypical pneumonia was reported in Guangdong Province of China. The disease had an extremely high mortality rate (currently up to 15-19%), and quickly expanded to over 25 countries. The World Health Organization coined it “severe acute respiratory syndrome”, or SARS. In April 2003, a previously unknown coronavirus was isolated from patients and subsequently proven to be the causative agent according to Koch's postulates in experiments on monkeys. The virus has been named SARS coronavirus (SARS-CoV).

The first complete sequence of SARS coronavirus was obtained in BCCA Genome Sciences Centre, Canada, about two weeks after the virus was detected in SARS patients. It was immediately submitted to GenBank prior to publication as a raw nucleotide sequence. GenBank released the sequence to the public the same day under accession number AY274119; the NCBI Viral Genomes Group annotated the sequence also the same day and released it in the form of the Entrez Genomes RefSeq record NC_004718 at 2 am next day. As of the beginning of May of 2003, all the SARS-CoV RNA transcripts have been detected and sequenced in at least two laboratories; further experiments are underway.

The availability of the sequence data and functional dissection of the SARS-CoV genome is the first step towards developing diagnostic tests, antiviral agents, and vaccines.

cs-312/project-5.txt · Last modified: 2015/03/04 09:58 by ringger

Back to top

Table of Contents