CSC 17 Assignment: DNA Sequence Alignment with Dynamic Programming Part I: Planning the Implementation We are going to implement the algorithm for global sequence alignment in two stages. First you are going to form groups and formulate a detailed outline of how to organize your implementation. Each person in the group should have a copy of this outline. We will then hold a "meeting" to discuss the plan you've come up with, then write the programs. You will write the programs individually. 0. Be sure to read the tutorials I gave you (also see web links). You will need to implement two versions of the algorithm, using the following scoring schemes: Simple Scoring Scheme: match = 1 mismatch penalty = 0 gap penalty = 0 Advanced Scoring Scheme: match = 2 mismatch penalty = -1 gap penalty = -2 You will need to use what you learned about OOP (inheritance, overriding) to implement new scoring schemes using as little code as possible. You need to think about how to set up your code for the simple algorithm so that it can be easily extended to the advanced one. This time you will have to write the program almost from scratch, I will not give you skeleton code to begin with. You will need to organize your implementation around a central class. You may define other classes as you see fit. 1. Determine the data structures you will need. In particular, answer the following: a. how to represent the two DNA sequences. That is, are you going to use array of chars or Strings, etc... (recommend strings) b. How to represent the matrix. This is a no-brainer. You'll obviously need a 2-D array. But also think about where to declare and create the array. c. How to represent the final alignment. Your program MUST produce output that looks like: GGCTTATG-CTCGAGCCCGGG || || ||| | | | -GCACTTGCCTCTCG-GC-GC Note that this sequence is longer than either original sequence, since there could be gaps in either final sequence. How would you represent this information? What data structure would you use to represent the final alignment (Strings, linked lists, etc..). (recommend strings) 2. How are you going to represent the scoring function and the gap penalties (also see part 5 below). You may assume (as in the tutorials) that the gap penalty (w) is a constant. 3. As the tutorial indicates, the algorithm has three basic stages: Initialization, Matrix fill, and Traceback. We should think of implementing each stage as a seperate function. Sketch out what Each function should do. Be clear as to what parameters and return value the methods will have (if any). Outline the algorithm each function will execute (write it out in pseudo code). Think especially about how your algorithm is going to use the data structures you've decided on above. For traceback, you may want to use a separate data structure to keep track of the *where* each value in the matrix came from. As you trace back, you should generate structures (strings?) to represent the final alignment (like the one above, including the verticle lines that indicate matching characters). ***** I will grade part I qualitatively. That is, it doesn't ***** have to be completely correct, but it needs to be carefully done ***** and well-thought out. Please spend some time on this. Sloppy ***** last-minute put-togethers are spotted easily. 4. Write out (as actual code if possible) the outline of your class: what methods and variables it will contain, as well as pseudocode for the main functions. 5. Consider what changes will need to be made to your program if the scoring scheme changes (what if we adopted the advanced scoring scheme?). You'll notice that the "advance" scoring scheme uses mostly the same techniques as the "simple" one. How would you incorporate the new scoring scheme into your program while making minimal changes to your code. In fact, you should think about how to make your implementation general enough to handle any variation in the scoring scheme. ============================================================================= Part II: Implementation and Testing You MUST implement BOTH the "simple" and "advanced" scoring schemes as described in the handout, and do it in a modular manner (such as using inheritance). I will NOT accept two different versions of the program, one for each scoring scheme. I will also be looking at your code, and it should follow the outline you've agreed upon, unless there was a significant problem no one could have forseen. To test your program, you need to: 1. Print out the alignment and alignment score, such as: CATTAATTACACTCTCGCACTCAC-CACCAAACATCCTA-AACCCAGACAGGCCTCGACTCC | ||| | | ||||| || | || | | || || | ||| | ||| | -ACTAAACA-AGACTCGC-CTGTCTAACTAGGGAGTTTATAATGAACCGTGGCGTAGAC-CA Alignment score is 27 (sample was produced using the advanced scoring scheme) 2. Use small, hand crafted examples, such as (but not limited to): AATCG ATCAG (checks if it figures out where to put the gaps) ********* For at least one example, you need to show how your program generated output that corresponds to what you have worked out on paper. You need to be able to print the matrix generated by your program, and show that it corresponds to what you have done by hand. ********* 3. Use large, randomly generated sequences. Test your program several times before declaring that "it works". Be sure to use examples where the lengths of the sequences are not the same. Here's a function that will generate a random sequence of length n as a string public static String randseq(int n) { char[] S = new char[n]; // faster than building string char by char String DNA = "ACGT"; for(int i=0;i (int)'9' then the first argument is not numeric. Alternatively, you can use the following trick: int n = 20; String sequence=""; try { n = Integer.parseInt(argv[0]); // attempt convert 1st arg to integer; sequence = randseq(n); } catch (NullPointerException npe) { sequence = randseq(n); } catch (NumberFormatException nfe) { sequence = argv[0]; } When a try {...} block of code throws an exception, it can be caught by one of the catch clauses, which determines what to do if the exception/error occured. try-catch should in general be avoided This code will attempt to parse the first command line argument passed to main (argv[0]) as an integer. If an error resulted because argv was null, then a random sequence of length 20 (default) will be generated. If an error resulted because argv[0] does not represent an integer, then the argv[0] string itself will be used for the sequence string. Otherwise, a sequence of length represented by argv[0] will be generated. ------ Finally, as a WARNING, in the simple scoring scheme it is ok to intiallize the entire matrix to 0s because there is no gap penalty. But with more advanced scoring schemes, the first row and column of the matrix need to reflect initial gaps in either sequence. This aspect of the algorithm needs to be programmed correctly.