Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
cs-312:project-5 [2014/12/31 16:20]
ringger
cs-312:project-5 [2015/02/21 11:23]
ringger [Further Exploration]
Line 5: Line 5:
  
 ==Objectives== ==Objectives==
-*Design and implement a Dynamic Programming algorithm that has applications to gene sequence alignment.+*Design and implement a Dynamic Programming algorithm that solves the gene sequence alignment ​problem.
 *Think carefully about the use of memory in your implementation. *Think carefully about the use of memory in your implementation.
 *Solve a non-trivial computational genomics problem. *Solve a non-trivial computational genomics problem.
Line 14: Line 14:
 Code: http://​faculty.cs.byu.edu/​~ringger/​cs312/​projects/​Project5-GeneSequenceAlignment-VS2008.zip Code: http://​faculty.cs.byu.edu/​~ringger/​cs312/​projects/​Project5-GeneSequenceAlignment-VS2008.zip
  
-The data is distributed with the code as an .mdb database file.  However, if you want to look at plain text, see the following: http://​faculty.cs.byu.edu/​~ringger/​cs312/​projects/​GeneData.zip+The data is distributed with the code as an .mdb database file.  However, if you want to look at plain text, see the following ​raw data 
 + 
 +http://​faculty.cs.byu.edu/​~ringger/​cs312/​projects/​GeneData.zip
  
 ==Background== ==Background==
Line 23: Line 25:
 ==Provided Framework== ==Provided Framework==
  
-You are provided with some scaffolding code to help you get started on the project. ​ We provide a Visual Studio project with a GUI containing a 10x10 matrix, with a row and a column for each of 10 organisms. ​ The organism at row is the same organism for column ​n.  Note that this matrix is '''​*not*'''​ the dynamic programming table; it is simply a place to store and display the final minimum alignment cost result from the alignment of the gene sequences for each pair of organisms. ​ Thus, cell (i, j) in this table will contain the minimum cost alignment score of the genome for organism i with the genome for organism j.+You are provided with some scaffolding code to help you get started on the project. ​ We provide a Visual Studio project with a GUI containing a 10x10 matrix, with a row and a column for each of 10 organisms. ​ The organism at row $X$ is the same organism for column ​$X$.  Note that this matrix is '''​*not*'''​ the dynamic programming table; it is simply a place to store and display the final minimum alignment cost result from the alignment of the gene sequences for each pair of organisms. ​ Thus, cell $(i, j)in this table will contain the minimum cost alignment score of the genome for organism ​$iwith the genome for organism ​$j$.
  
-When you press the “Process” button on the toolbar, the matrix fills with numbers, one number for each pair, as shown in the figure below. ​ You will generate these numbers by aligning '''​exacty the first 5000 characters'''​ (bases) in each sequence pair.  ''​Currently,​ the numbers are computed by a quick placeholder computation (the difference in sequence lengths) that is not particularly useful; these are the numbers shown in the figure below. ''​ Your job will be to replace that number with the output from a sequence alignment function that employs dynamic programming. ​ You will fill the matrix with the pair-wise scores. ​ When you press the “Process” button, the event handler for the button-down event should call the <code>​PairWiseAlign.Align()</​code> method which you implement.+When you press the “Process” button on the toolbar, the matrix fills with numbers, one number for each pair, as shown in the figure below. ​ You will generate these numbers by aligning '''​exacty the first 5000 characters'''​ (bases) in each sequence pair.  ''​Currently,​ the numbers are computed by a quick placeholder computation (the difference in sequence lengths) that is not particularly useful; these are the numbers shown in the figure below. ''​ Your job will be to replace that number with the output from a sequence alignment function that employs dynamic programming. ​ You will fill the matrix with the pair-wise scores. ​ When you press the “Process” button, the event handler for the button-down event should call the <tt>​PairWiseAlign.Align()</​tt> method which you implement.
  
-[[Image:geneSequenceAlignment1.png|500px|center]]+[[media:cs-312:​project-5:​genesequencealignment1.png?500px]]
  
 Each gene sequence consists of a string of letters and is stored in the given database file.  The scaffolding code loads the data from the database. ​ For example, the record for the “sars9” genome contains the following sequence (prefix shown here): Each gene sequence consists of a string of letters and is stored in the given database file.  The scaffolding code loads the data from the database. ​ For example, the record for the “sars9” genome contains the following sequence (prefix shown here):
Line 39: Line 41:
  
 Be sure that you are using the latest .NET framework: Bring up the properties of your project (can use the key-chord Alt+F7). For the property "​Target framework:",​ specify .NET Framework 4 (or higher). Be sure that you are using the latest .NET framework: Bring up the properties of your project (can use the key-chord Alt+F7). For the property "​Target framework:",​ specify .NET Framework 4 (or higher).
 +
 +Summary:
 +* For each pair of genes, you will apply the DP algorithm to compute their alignment cost.
 +* You will summarize the alignment costs for all pairs in the results table (the 10 x 10 table) in the GUI.  The results table is just a way of presenting the results. ​
  
 == Required: Whiteboard Experience (TM) == == Required: Whiteboard Experience (TM) ==
Line 72: Line 78:
 #** Matches are rewarded -3 units #** Matches are rewarded -3 units
 #* Make your pseudo-code easy to follow, but keep it within one page.  If needed for clarity, document your algorithm with legible comments in important places in the pseudo-code. ​ Clear pseudo-code provides evidence of your comprehension. ​ The correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded. #* Make your pseudo-code easy to follow, but keep it within one page.  If needed for clarity, document your algorithm with legible comments in important places in the pseudo-code. ​ Clear pseudo-code provides evidence of your comprehension. ​ The correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded.
-# Implement the scoring algorithm. ​ When scoring, only align the first 5000 characters (bases) of each sequence. ​ Implement your algorithm in the <code>​PairWiseAlign.Align()</​code> method. ​ That method should return the alignment score.+# Implement the scoring algorithm. ​ When scoring, only align the first 5000 characters (bases) of each sequence. ​ Implement your algorithm in the <tt>​PairWiseAlign.Align()</​tt> method. ​ That method should return the alignment score.
 #* Document your algorithm with legible comments in important places in the code that make it obvious to the reader what your solution is doing. ​ This documentation also provides evidence of your comprehension. ​ With the aid of adequate documentation,​ the correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded. #* Document your algorithm with legible comments in important places in the code that make it obvious to the reader what your solution is doing. ​ This documentation also provides evidence of your comprehension. ​ With the aid of adequate documentation,​ the correctness of your approach should be easy for the reader (e.g., TA) to verify. ​ If the TA cannot easily determine the code's correctness,​ then fewer points will be awarded.
 # Record the pair-wise scores computed by the scoring algorithm in the given 10x10 score matrix such that position (i, j) in the display shows the optimal distance from $sequence_i$ to $sequence_j$ for the first 5000 characters. ​ Since the matrix is symmetric, you should only fill the upper part of the matrix (above the diagonal). ​ Correctness of your algorithm will be checked using this matrix. ​ (There is also no need to fill the diagonal, since the alignment of a sequence with itself consists only of matches.) # Record the pair-wise scores computed by the scoring algorithm in the given 10x10 score matrix such that position (i, j) in the display shows the optimal distance from $sequence_i$ to $sequence_j$ for the first 5000 characters. ​ Since the matrix is symmetric, you should only fill the upper part of the matrix (above the diagonal). ​ Correctness of your algorithm will be checked using this matrix. ​ (There is also no need to fill the diagonal, since the alignment of a sequence with itself consists only of matches.)
Line 80: Line 86:
 #* As a further aid for verifying the correctness of your algorithm, here is another value that should appear in your table: ​ (counting from 1) at row 3, column 4, the result should be '''​-14972'''​. ​ That is the result of aligning the first 5000 bases each of the genomes for “Bovine coronavirus isolate BCoV-LUN” and “Bovine coronavirus isolate BCoV-ENT”. #* As a further aid for verifying the correctness of your algorithm, here is another value that should appear in your table: ​ (counting from 1) at row 3, column 4, the result should be '''​-14972'''​. ​ That is the result of aligning the first 5000 bases each of the genomes for “Bovine coronavirus isolate BCoV-LUN” and “Bovine coronavirus isolate BCoV-ENT”.
 # Implement the extraction algorithm. ​ Your function should produce a character-by-character alignment of its two sequence arguments. ​ (This function should not be called to fill the score matrix.) # Implement the extraction algorithm. ​ Your function should produce a character-by-character alignment of its two sequence arguments. ​ (This function should not be called to fill the score matrix.)
-#* Add a text field to your GUI for displaying an extracted alignment. ​ (Right click on <code>​MainForm.cs</​code> in the "​Solution Explorer",​ and choose "View Designer"​ to view the GUI in designer mode and gain access to the toolbox of visual controls.)+#* Add a text field to your GUI for displaying an extracted alignment. ​ (Right click on <tt>​MainForm.cs</​tt> in the "​Solution Explorer",​ and choose "View Designer"​ to view the GUI in designer mode and gain access to the toolbox of visual controls.)
 #* Add a click event to the GUI table cells. ​ The method handling this event should use your extraction algorithm to calculate the optimal alignment for the strings corresponding to that cell and should display the alignment in the text field. ​ Thus, the field will be updated whenever you click on a cell in the score table. #* Add a click event to the GUI table cells. ​ The method handling this event should use your extraction algorithm to calculate the optimal alignment for the strings corresponding to that cell and should display the alignment in the text field. ​ Thus, the field will be updated whenever you click on a cell in the score table.
-#* To add the click event, select the <code>​DataGridView</​code> (the gray box in the middle of the main form), and select the Events tab (the lightning bolt) on the Properties list.  Add a handler for <code>​CellMouseClick</​code> by double clicking on its row.+#* To add the click event, select the <tt>​DataGridView</​tt> (the gray box in the middle of the main form), and select the Events tab (the lightning bolt) on the Properties list.  Add a handler for <tt>​CellMouseClick</​tt> by double clicking on its row.
 #* Display the alignment in a manner that easily reveals the matches, substitutions,​ and insertions/​deletions (indels), such as is shown in the short example below. ​ Gaps should be shown with a '​-'​ (hyphen). #* Display the alignment in a manner that easily reveals the matches, substitutions,​ and insertions/​deletions (indels), such as is shown in the short example below. ​ Gaps should be shown with a '​-'​ (hyphen).
 #* Extract and display the alignment of the first 100 characters (bases) (no more, no less) of each of the selected sequences in the pair. (The alignment itself will most likely include gaps and will therefore be longer than 100 characters.) #* Extract and display the alignment of the first 100 characters (bases) (no more, no less) of each of the selected sequences in the pair. (The alignment itself will most likely include gaps and will therefore be longer than 100 characters.)
Line 117: Line 123:
 Note that you may need to reset the Platform Target if you switch from Debug to Release or vice versa. Note that you may need to reset the Platform Target if you switch from Debug to Release or vice versa.
  
-Setting the Platform Target to "​x86"​ eliminates the <code>​InvalidOperationException</​code> with the message "The '​Microsoft.Jet.OLEDB.4.0'​ provider is not registered on the local machine."​+Setting the Platform Target to "​x86"​ eliminates the <tt>​InvalidOperationException</​tt> with the message "The '​Microsoft.Jet.OLEDB.4.0'​ provider is not registered on the local machine."​
  
 ==== x64 ==== ==== x64 ====
Line 172: Line 178:
 #Experiment with alternative operator (substitution,​ match, indel) costs and discuss the impact of changing their values. #Experiment with alternative operator (substitution,​ match, indel) costs and discuss the impact of changing their values.
 #Try aligning longer (sub-)sequences. ​ Conduct an empirical analysis. ​ Discuss impact on the alignment score matrix. #Try aligning longer (sub-)sequences. ​ Conduct an empirical analysis. ​ Discuss impact on the alignment score matrix.
-#​Re-implement your alignment algorithm in top-down fashion using recursion and a memory function. ​ How does this algorithm compare to the implementation ​using a table?+#​Re-implement your alignment algorithm in top-down fashion using recursion and a memory function. ​ How does this algorithm compare to the iterative, bottom-up ​implementation?​
 #Try a shortest-path algorithm like Dijkstra’s to solve this problem. #Try a shortest-path algorithm like Dijkstra’s to solve this problem.
 #Get ahead of the game, learn about the A* shortest-path algorithm, and implement it with a tight admissible heuristic (we will get to this topic later in the course, but you might enjoy trying it in the context of the alignment problem). #Get ahead of the game, learn about the A* shortest-path algorithm, and implement it with a tight admissible heuristic (we will get to this topic later in the course, but you might enjoy trying it in the context of the alignment problem).
Line 180: Line 186:
  
 '''​Coronaviruses:'''​ Coronaviruses were first isolated from chickens in 1937.  After the discovery of Rhinoviruses in the 1950’s, approximately 50% of colds still could not be ascribed to known agents. ​ In 1965, Tyrell and Bynoe used cultures of human ciliated embryonal trachea to propagate the first human coronavirus (HCoV) in vitro. ​ There are now approximately 15 species in this family, which infect not only man but cattle, pigs, rodents, cats, dogs, and birds (some are serious veterinary pathogens, especially in chickens). ​ '''​Coronaviruses:'''​ Coronaviruses were first isolated from chickens in 1937.  After the discovery of Rhinoviruses in the 1950’s, approximately 50% of colds still could not be ascribed to known agents. ​ In 1965, Tyrell and Bynoe used cultures of human ciliated embryonal trachea to propagate the first human coronavirus (HCoV) in vitro. ​ There are now approximately 15 species in this family, which infect not only man but cattle, pigs, rodents, cats, dogs, and birds (some are serious veterinary pathogens, especially in chickens). ​
-  ​ + 
-[[Image:GeneSequencer2.png|300px|thumb|Coronavirus]] +[[media:cs-312:​project-5:​genesequencer2.png?300]] 
-[[Image:GeneSequencer3.png|300px|thumb|Coronavirus]]+[[media:​cs-312:​project-5:genesequencer3.png?300]]
  
 Coronavirus particles are irregularly-shaped,​ ~60-220nm in diameter, with an outer envelope bearing distinctive,​ ‘club-shaped’ peplomers (~20nm long x 10nm at wide distal end).  This ‘crown-like’ appearance (Latin, corona) gives the family its name.  The center of the particle appears amorphous in negatively shaped stained EM preps, the nucleocapsid being in a loosely wound rather disordered state. ​ Coronavirus particles are irregularly-shaped,​ ~60-220nm in diameter, with an outer envelope bearing distinctive,​ ‘club-shaped’ peplomers (~20nm long x 10nm at wide distal end).  This ‘crown-like’ appearance (Latin, corona) gives the family its name.  The center of the particle appears amorphous in negatively shaped stained EM preps, the nucleocapsid being in a loosely wound rather disordered state. ​
Line 192: Line 198:
 '''​SARS:'''​ SARS is a type of viral pneumonia, with symptoms including fever, a dry cough, dyspnea (shortness of breath), headache, and hypoxaemia (low blood oxygen concentration). ​ Typical laboratory findings include lymphopaenia (reduced lymphocyte numbers) and mildly elevated aminotransferase levels (indicating liver damage). ​ Death may result from progressive respiratory failure due to alveolar damage. ​ The typical clinical course of SARS involves an improvement in symptoms during the first week of infection, followed by a worsening during the second week.  Studies indicate that this worsening may be related to a patient’s immune responses rather than uncontrolled viral replication. '''​SARS:'''​ SARS is a type of viral pneumonia, with symptoms including fever, a dry cough, dyspnea (shortness of breath), headache, and hypoxaemia (low blood oxygen concentration). ​ Typical laboratory findings include lymphopaenia (reduced lymphocyte numbers) and mildly elevated aminotransferase levels (indicating liver damage). ​ Death may result from progressive respiratory failure due to alveolar damage. ​ The typical clinical course of SARS involves an improvement in symptoms during the first week of infection, followed by a worsening during the second week.  Studies indicate that this worsening may be related to a patient’s immune responses rather than uncontrolled viral replication.
  
-[[Image:GeneSequencer4.png|300px|thumb|SARS virus]]+ 
 +[[media:cs-312:​project-5:​genesequencer4.png?300]]
  
 The outbreak is believed to have originated in February 2003 in the Guangdong province of China, where 300 people became ill, and at least five died.  After initial reports that a paramyxovirus was responsible,​ the true cause appears to be a novel coronavirus with some unusual properties. ​ For one thing, SARS virus can be grown in Vero cells (a fibroblast cell line isolated in 1962 from a primate)—a novel property for HCoV’s, most of which cannot be cultivated. ​ In these cells, virus infection results in a cytopathic effect, and budding of coronavirus-like particles from the endoplasmic reticulum within infected cells. ​ The outbreak is believed to have originated in February 2003 in the Guangdong province of China, where 300 people became ill, and at least five died.  After initial reports that a paramyxovirus was responsible,​ the true cause appears to be a novel coronavirus with some unusual properties. ​ For one thing, SARS virus can be grown in Vero cells (a fibroblast cell line isolated in 1962 from a primate)—a novel property for HCoV’s, most of which cannot be cultivated. ​ In these cells, virus infection results in a cytopathic effect, and budding of coronavirus-like particles from the endoplasmic reticulum within infected cells. ​
Line 208: Line 215:
  
 The availability of the sequence data and functional dissection of the SARS-CoV genome is the first step towards developing diagnostic tests, antiviral agents, and vaccines. The availability of the sequence data and functional dissection of the SARS-CoV genome is the first step towards developing diagnostic tests, antiviral agents, and vaccines.
- 
cs-312/project-5.txt · Last modified: 2015/03/04 09:58 by ringger
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0