CSC 17 Assignment: DNA Sequence Alignment with Dynamic Programming

	      Part I: Planning the Implementation


We are going to implement the algorithm for global sequence alignment
in two stages.  First you are going to form groups and formulate a
detailed outline of how to organize your implementation.  Each person
in the group should have a copy of this outline.  We will then hold a
"meeting" to discuss the plan you've come up with, then write the
programs.  You will write the programs individually.

0. Be sure to read the tutorials I gave you (also see web links).  You will
need to implement two versions of the algorithm, using the following scoring
schemes:

Simple Scoring Scheme:
   match = 1
   mismatch penalty = 0
   gap penalty = 0

Advanced Scoring Scheme:
   match = 2
   mismatch penalty = -1
   gap penalty = -2

You will need to use what you learned about OOP (inheritance,
overriding) to implement new scoring schemes using as little code as
possible.  You need to think about how to set up your code for the
simple algorithm so that it can be easily extended to the advanced
one.

This time you will have to write the program almost from scratch, I
will not give you skeleton code to begin with.  You will need to
organize your implementation around a central class.  You may define
other classes as you see fit.

1. Determine the data structures you will need.  In particular, answer
   the following:

     a. how to represent the two DNA sequences.  That is, are you going to
        use array of chars or Strings, etc...  (recommend strings)

     b. How to represent the matrix.  This is a no-brainer.  You'll obviously
        need a 2-D array.  But also think about where to declare and
        create the array.  

     c. How to represent the final alignment.  Your program MUST produce
        output that looks like:

             GGCTTATG-CTCGAGCCCGGG
              ||   || |||  |  | |
             -GCACTTGCCTCTCG-GC-GC

        Note that this sequence is longer than either original sequence, since
	there could be gaps in either final sequence.  How would you represent
	this information?  What data structure would you use to represent the
	final alignment (Strings, linked lists, etc..). (recommend strings)

2. How are you going to represent the scoring function and the gap penalties
   (also see part 5 below).  You may assume (as in the tutorials) that
   the gap penalty (w) is a constant.
 
3. As the tutorial indicates, the algorithm has three basic stages:
   Initialization, Matrix fill, and Traceback.  We should think of
   implementing each stage as a seperate function.  Sketch out what
   Each function should do.  Be clear as to what parameters and
   return value the methods will have (if any).  Outline the algorithm
   each function will execute (write it out in pseudo code).  Think
   especially about how your algorithm is going to use the data structures
   you've decided on above.  For traceback, you may want to use a 
   separate data structure to keep track of the *where* each value in the
   matrix came from.

   As you trace back, you should generate structures (strings?) to represent
   the final alignment (like the one above, including the verticle lines that
   indicate matching characters).

   ***** I will grade part I qualitatively.  That is, it doesn't
   ***** have to be completely correct, but it needs to be carefully done
   ***** and well-thought out.  Please spend some time on this.  Sloppy
   ***** last-minute put-togethers are spotted easily.

4. Write out (as actual code if possible) the outline of your class: what
   methods and variables it will contain, as well as pseudocode for the main
   functions.

5. Consider what changes will need to be made to your program if the scoring
   scheme changes (what if we adopted the advanced scoring scheme?).  You'll
   notice that the "advance" scoring scheme uses mostly the same techniques
   as the "simple" one.  How would you incorporate the new scoring scheme
   into your program while making minimal changes to your code.  In fact,
   you should think about how to make your implementation general enough 
   to handle any variation in the scoring scheme.

=============================================================================


  	        Part II: Implementation and Testing


You MUST implement BOTH the "simple" and "advanced" scoring schemes as
described in the handout, and do it in a modular manner (such as using
inheritance).  I will NOT accept two different versions of the
program, one for each scoring scheme.  I will also be looking at your
code, and it should follow the outline you've agreed upon, unless there
was a significant problem no one could have forseen.


To test your program, you need to:

1. Print out the alignment and alignment score, such as:

CATTAATTACACTCTCGCACTCAC-CACCAAACATCCTA-AACCCAGACAGGCCTCGACTCC
 | |||  | |  ||||| ||  |  || |   |   || ||   |    ||| | ||| |
-ACTAAACA-AGACTCGC-CTGTCTAACTAGGGAGTTTATAATGAACCGTGGCGTAGAC-CA

Alignment score is 27

(sample was produced using the advanced scoring scheme)


2. Use small, hand crafted examples, such as (but not limited to):

   AATCG
   ATCAG    (checks if it figures out where to put the gaps)

********* For at least one example, you need to show how your program
generated output that corresponds to what you have worked out on paper.
You need to be able to print the matrix generated by your program, and show
that it corresponds to what you have done by hand.
*********

3. Use large, randomly generated sequences.  Test your program several times
before declaring that "it works". Be sure to use examples where the lengths
of the sequences are not the same.

Here's a function that will generate a random sequence of length n as a string

    public static String randseq(int n)
    {
	char[] S = new char[n];  // faster than building string char by char
	String DNA = "ACGT";
	for(int i=0;i<n;i++)
	    {
		int r = (int)(Math.random()*4);
		S[i] = DNA.charAt(r);
	    }
	return new String(S); // constructor converts char[] to String
    } // randseq 

However, the above function assumes you're going to represent the
sequence as a String.  But you need to remember that the matrix actually
contains one more row and one more column, and you must adjust your
string and array indices accordingly.  One way to do this is to add a
dummy char in front of the string: s += "."+s;

--------

String manipulations.  You will of course want to print the sequence as
a string.  Alternatively, you may choose to use a string to begin with
instead of a char[].  Consult the Java API documentation for the String
class, where you'll find many useful functions built-in.  For example,
given char[] A: String s= new String(A); will transform the array of
chars to a string, and s.toCharArray() will convert it back.   Strings 
have the advantage in that they can be easily modified.  For example,
s = 'A'+s; will add a new character 'A' to the front of the string.
The length of String s is s.length().  s.chatAt(0) is the first char of the
string, etc...  Strings are non-destructable (unlike arrays).


-------

Use command-line arguments (the String[] argv passed to main).  Here's a
transcript of my own program.  The first run uses the two arguments, 
argv[0] and argv[1] as the dna sequences.  The second run uses the
advanced scoring scheme on the same sequences.  The third run recognized
that argv[0], argv[1] are numbers and generated random sequences of the
specified lengths.  Your program should be just as flexible and ouput 
the same amount of information (i.e., be almost identical).

$ java dna ATGC ATGAC
ATG_C
||| |
ATGAC
Alignment score = 4, and

$ java dna ATGC ATGAC adv
ATG_C
||| |
ATGAC
Alignment score = 6

$ java dna 30 32
CGGCTCTCA___CAGTA_ATACGTACC_CG_GCCTG__
 ||| | ||    || | || |   ||  | ||| |  
_GGCGC_CATTTGAGCACAT_C___CCGAGCGCCAGAT
Alignment score = 19


To tell the difference between a numeric or non-numeric string, simply
check the ascii value of the first character of the first argument.
That is, if args[0].charAt(0) > (int)'9' then the first argument is
not numeric.  Alternatively, you can use the following trick:

int n = 20;
String sequence="";
try
{
  n = Integer.parseInt(argv[0]); // attempt convert 1st arg to integer;
  sequence = randseq(n);
}
catch (NullPointerException npe)
{
  sequence = randseq(n);
}
catch (NumberFormatException nfe)
{
  sequence = argv[0];
}

When a try {...} block of code throws an exception, it can be caught by one
of the catch clauses, which determines what to do if the exception/error
occured.  try-catch should in general be avoided 

This code will attempt to parse the first command line argument passed
to main (argv[0]) as an integer.  If an error resulted because argv was
null, then a random sequence of length 20 (default) will be generated.
If an error resulted because argv[0] does not represent an integer, then
the argv[0] string itself will be used for the sequence string.  Otherwise,
a sequence of length represented by argv[0] will be generated.
------

Finally, as a WARNING, in the simple scoring scheme it is ok to
intiallize the entire matrix to 0s because there is no gap penalty.
But with more advanced scoring schemes, the first row and column of
the matrix need to reflect initial gaps in either sequence.  This
aspect of the algorithm needs to be programmed correctly.