Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cs-312:project-5 [2015/02/18 22:30]
ringger [Distribution]
cs-312:project-5 [2015/03/04 09:58]
ringger
Line 5: Line 5:
  
 ==Objectives== ==Objectives==
-*Design and implement a Dynamic Programming algorithm that has applications to gene sequence alignment.+*Design and implement a Dynamic Programming algorithm that solves the gene sequence alignment ​problem.
 *Think carefully about the use of memory in your implementation. *Think carefully about the use of memory in your implementation.
 *Solve a non-trivial computational genomics problem. *Solve a non-trivial computational genomics problem.
Line 25: Line 25:
 ==Provided Framework== ==Provided Framework==
  
-You are provided with some scaffolding code to help you get started on the project. ​ We provide a Visual Studio project with a GUI containing a 10x10 matrix, with a row and a column for each of 10 organisms. ​ The organism at row is the same organism for column ​n.  Note that this matrix is '''​*not*'''​ the dynamic programming table; it is simply a place to store and display the final minimum alignment cost result from the alignment of the gene sequences for each pair of organisms. ​ Thus, cell (i, j) in this table will contain the minimum cost alignment score of the genome for organism i with the genome for organism j.+You are provided with some scaffolding code to help you get started on the project. ​ We provide a Visual Studio project with a GUI containing a 10x10 matrix, with a row and a column for each of 10 organisms. ​ The organism at row $X$ is the same organism for column ​$X$.  Note that this matrix is '''​*not*'''​ the dynamic programming table; it is simply a place to store and display the final minimum alignment cost result from the alignment of the gene sequences for each pair of organisms. ​ Thus, cell $(i, j)in this table will contain the minimum cost alignment score of the genome for organism ​$iwith the genome for organism ​$j$.
  
-When you press the “Process” button on the toolbar, the matrix fills with numbers, one number for each pair, as shown in the figure below. ​ You will generate these numbers by aligning '''​exacty the first 5000 characters'''​ (bases) in each sequence pair.  ''​Currently,​ the numbers are computed by a quick placeholder computation (the difference in sequence lengths) that is not particularly useful; these are the numbers shown in the figure below. ''​ Your job will be to replace that number with the output from a sequence alignment function that employs dynamic programming. ​ You will fill the matrix with the pair-wise scores. ​ When you press the “Process” button, the event handler for the button-down event should call the <code>​PairWiseAlign.Align()</​code> method which you implement.+When you press the “Process” button on the toolbar, the matrix fills with numbers, one number for each pair, as shown in the figure below. ​ You will generate these numbers by aligning '''​exacty the first 5000 characters'''​ (bases) in each sequence pair.  ''​Currently,​ the numbers are computed by a quick placeholder computation (the difference in sequence lengths) that is not particularly useful; these are the numbers shown in the figure below. ''​ Your job will be to replace that number with the output from a sequence alignment function that employs dynamic programming. ​ You will fill the matrix with the pair-wise scores. ​ When you press the “Process” button, the event handler for the button-down event should call the <tt>​PairWiseAlign.Align()</​tt> method which you implement.
  
 [[media:​cs-312:​project-5:​genesequencealignment1.png?​500px]] [[media:​cs-312:​project-5:​genesequencealignment1.png?​500px]]
Line 41: Line 41:
  
 Be sure that you are using the latest .NET framework: Bring up the properties of your project (can use the key-chord Alt+F7). For the property "​Target framework:",​ specify .NET Framework 4 (or higher). Be sure that you are using the latest .NET framework: Bring up the properties of your project (can use the key-chord Alt+F7). For the property "​Target framework:",​ specify .NET Framework 4 (or higher).
 +
 +Summary:
 +* For each pair of genes, you will apply the DP algorithm to compute their alignment cost.
 +* You will summarize the alignment costs for all pairs in the results table (the 10 x 10 table) in the GUI.  The results table is just a way of presenting the results. ​
  
 == Required: Whiteboard Experience (TM) == == Required: Whiteboard Experience (TM) ==
Line 74: Line 78:
 #** Matches are rewarded -3 units #** Matches are rewarded -3 units
 #* Make your pseudo-code easy to follow, but keep it within one page.  If needed for clarity, document your algorithm with legible comments in important places in the pseudo-code. ​ Clear pseudo-code provides evidence of your comprehension. ​ The correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded. #* Make your pseudo-code easy to follow, but keep it within one page.  If needed for clarity, document your algorithm with legible comments in important places in the pseudo-code. ​ Clear pseudo-code provides evidence of your comprehension. ​ The correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded.
-# Implement the scoring algorithm. ​ When scoring, only align the first 5000 characters (bases) of each sequence. ​ Implement your algorithm in the <code>​PairWiseAlign.Align()</​code> method. ​ That method should return the alignment score.+# Implement the scoring algorithm. ​ When scoring, only align the first 5000 characters (bases) of each sequence. ​ Implement your algorithm in the <tt>​PairWiseAlign.Align()</​tt> method. ​ That method should return the alignment score.
 #* Document your algorithm with legible comments in important places in the code that make it obvious to the reader what your solution is doing. ​ This documentation also provides evidence of your comprehension. ​ With the aid of adequate documentation,​ the correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded. #* Document your algorithm with legible comments in important places in the code that make it obvious to the reader what your solution is doing. ​ This documentation also provides evidence of your comprehension. ​ With the aid of adequate documentation,​ the correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded.
 # Record the pair-wise scores computed by the scoring algorithm in the given 10x10 score matrix such that position (i, j) in the display shows the optimal distance from $sequence_i$ to $sequence_j$ for the first 5000 characters. ​ Since the matrix is symmetric, you should only fill the upper part of the matrix (above the diagonal). ​ Correctness of your algorithm will be checked using this matrix. ​ (There is also no need to fill the diagonal, since the alignment of a sequence with itself consists only of matches.) # Record the pair-wise scores computed by the scoring algorithm in the given 10x10 score matrix such that position (i, j) in the display shows the optimal distance from $sequence_i$ to $sequence_j$ for the first 5000 characters. ​ Since the matrix is symmetric, you should only fill the upper part of the matrix (above the diagonal). ​ Correctness of your algorithm will be checked using this matrix. ​ (There is also no need to fill the diagonal, since the alignment of a sequence with itself consists only of matches.)
Line 82: Line 86:
 #* As a further aid for verifying the correctness of your algorithm, here is another value that should appear in your table: ​ (counting from 1) at row 3, column 4, the result should be '''​-14972'''​. ​ That is the result of aligning the first 5000 bases each of the genomes for “Bovine coronavirus isolate BCoV-LUN” and “Bovine coronavirus isolate BCoV-ENT”. #* As a further aid for verifying the correctness of your algorithm, here is another value that should appear in your table: ​ (counting from 1) at row 3, column 4, the result should be '''​-14972'''​. ​ That is the result of aligning the first 5000 bases each of the genomes for “Bovine coronavirus isolate BCoV-LUN” and “Bovine coronavirus isolate BCoV-ENT”.
 # Implement the extraction algorithm. ​ Your function should produce a character-by-character alignment of its two sequence arguments. ​ (This function should not be called to fill the score matrix.) # Implement the extraction algorithm. ​ Your function should produce a character-by-character alignment of its two sequence arguments. ​ (This function should not be called to fill the score matrix.)
-#* Add a text field to your GUI for displaying an extracted alignment. ​ (Right click on <code>​MainForm.cs</​code> in the "​Solution Explorer",​ and choose "View Designer"​ to view the GUI in designer mode and gain access to the toolbox of visual controls.)+#* Add a text field to your GUI for displaying an extracted alignment. ​ (Right click on <tt>​MainForm.cs</​tt> in the "​Solution Explorer",​ and choose "View Designer"​ to view the GUI in designer mode and gain access to the toolbox of visual controls.)
 #* Add a click event to the GUI table cells. ​ The method handling this event should use your extraction algorithm to calculate the optimal alignment for the strings corresponding to that cell and should display the alignment in the text field. ​ Thus, the field will be updated whenever you click on a cell in the score table. #* Add a click event to the GUI table cells. ​ The method handling this event should use your extraction algorithm to calculate the optimal alignment for the strings corresponding to that cell and should display the alignment in the text field. ​ Thus, the field will be updated whenever you click on a cell in the score table.
-#* To add the click event, select the <code>​DataGridView</​code> (the gray box in the middle of the main form), and select the Events tab (the lightning bolt) on the Properties list.  Add a handler for <code>​CellMouseClick</​code> by double clicking on its row.+#* To add the click event, select the <tt>​DataGridView</​tt> (the gray box in the middle of the main form), and select the Events tab (the lightning bolt) on the Properties list.  Add a handler for <tt>​CellMouseClick</​tt> by double clicking on its row.
 #* Display the alignment in a manner that easily reveals the matches, substitutions,​ and insertions/​deletions (indels), such as is shown in the short example below. ​ Gaps should be shown with a '​-'​ (hyphen). #* Display the alignment in a manner that easily reveals the matches, substitutions,​ and insertions/​deletions (indels), such as is shown in the short example below. ​ Gaps should be shown with a '​-'​ (hyphen).
 #* Extract and display the alignment of the first 100 characters (bases) (no more, no less) of each of the selected sequences in the pair. (The alignment itself will most likely include gaps and will therefore be longer than 100 characters.) #* Extract and display the alignment of the first 100 characters (bases) (no more, no less) of each of the selected sequences in the pair. (The alignment itself will most likely include gaps and will therefore be longer than 100 characters.)
Line 91: Line 95:
 # Align the first 100 characters (bases) of sequence #3 (row 3) and #10 (column 10) (assuming the first sequence in the table is numbered as #1); extract and display the alignment in your report using a fixed-width font.  For example, if you were aligning "​AGCTCATGCT"​ and "​ACTGCATCA",​ the side-by-side alignment should be presented in the manner shown here: # Align the first 100 characters (bases) of sequence #3 (row 3) and #10 (column 10) (assuming the first sequence in the table is numbered as #1); extract and display the alignment in your report using a fixed-width font.  For example, if you were aligning "​AGCTCATGCT"​ and "​ACTGCATCA",​ the side-by-side alignment should be presented in the manner shown here:
 <pre> <pre>
-AGCT-CATGCT +AGCTCATGCT 
-A-CTGCAT-CA+ACTGCAT-CA
 </​pre>​ </​pre>​
  
Line 119: Line 123:
 Note that you may need to reset the Platform Target if you switch from Debug to Release or vice versa. Note that you may need to reset the Platform Target if you switch from Debug to Release or vice versa.
  
-Setting the Platform Target to "​x86"​ eliminates the <code>​InvalidOperationException</​code> with the message "The '​Microsoft.Jet.OLEDB.4.0'​ provider is not registered on the local machine."​+Setting the Platform Target to "​x86"​ eliminates the <tt>​InvalidOperationException</​tt> with the message "The '​Microsoft.Jet.OLEDB.4.0'​ provider is not registered on the local machine."​
  
 ==== x64 ==== ==== x64 ====
Line 174: Line 178:
 #Experiment with alternative operator (substitution,​ match, indel) costs and discuss the impact of changing their values. #Experiment with alternative operator (substitution,​ match, indel) costs and discuss the impact of changing their values.
 #Try aligning longer (sub-)sequences. ​ Conduct an empirical analysis. ​ Discuss impact on the alignment score matrix. #Try aligning longer (sub-)sequences. ​ Conduct an empirical analysis. ​ Discuss impact on the alignment score matrix.
-#​Re-implement your alignment algorithm in top-down fashion using recursion and a memory function. ​ How does this algorithm compare to the implementation ​using a table?+#​Re-implement your alignment algorithm in top-down fashion using recursion and a memory function. ​ How does this algorithm compare to the iterative, bottom-up ​implementation?​
 #Try a shortest-path algorithm like Dijkstra’s to solve this problem. #Try a shortest-path algorithm like Dijkstra’s to solve this problem.
 #Get ahead of the game, learn about the A* shortest-path algorithm, and implement it with a tight admissible heuristic (we will get to this topic later in the course, but you might enjoy trying it in the context of the alignment problem). #Get ahead of the game, learn about the A* shortest-path algorithm, and implement it with a tight admissible heuristic (we will get to this topic later in the course, but you might enjoy trying it in the context of the alignment problem).
Line 211: Line 215:
  
 The availability of the sequence data and functional dissection of the SARS-CoV genome is the first step towards developing diagnostic tests, antiviral agents, and vaccines. The availability of the sequence data and functional dissection of the SARS-CoV genome is the first step towards developing diagnostic tests, antiviral agents, and vaccines.
- 
cs-312/project-5.txt · Last modified: 2015/03/04 09:58 by ringger
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0