CSC 15 DNA Lab 

The target for this lab is to write some procedures for string
matching.  We will apply these procedures to randome DNA sequences.
Procedures for file IO will be given to you.

######### Prelims: Some operations on strings that may be useful:

# Given some string
s = "abcdefg"
# the expression s[a:b] will return the substring of s from 
# s[a] to to s[b-1], so s[2:5] will return "cde".  The length of the substring
# is always b-a.  The expression s[:3] is the same as s[0:3] and s[3:] is
# the same as s[3:len(s)].  For example:
s = s[0:2] + 'x' + s[3:]  # will change 'c' to 'x', s becomes "abxdefg"

# However, for large strings, it is more efficient to change individual
# characters by first converting the string into an array of characters.

sl = list(s)  # will set sl to ['a','b','c','d','e','f','g']
# Since arrays (also called lists in python) are mutable, it's more efficient
# to make changes to arrays than reconstructing the entire string:
sl[1] = 'z'  # changes second char to z

t = "".join(sl)  # this operation will convert an array of chars to a string
# t will become "azcdefg"
#########

                            PART I:

0. Write a procedure that generates a random DNA sequence of length n:

def randomDNA(n):  # should return a STRING such as "ATTGACGAC"

   Think about the algorithm first: you'll start with an empty string "",
   then run a loop n times.  Inside the loop, you need to generate a 
   randint(0,3) and depending on the value of this random number, you
   will add either an "A", "C", "G" or "T" to the end of your string.
   For example, to add the character "C" to the end of string S, use the
   statement S = S+"C".

   Alternatively, you can pre-allocate memory for a large array, then
   fill the array with random DNA characters.  At the end you will
   convert the array of chars to a string using "".join(arrayname).

   The function should return the string generated.

   To use the randint function, put  *from random import randint*  at the
   top of your program.  Then randint(a,b) generates a random number between
   a and b inclusively.


#######
1. Write a function count(X,S) that returns the number of times that the
character X appears in string S.  For example, count("A","GATCAATC") should
return 3.  Test it on your random DNA sequences.


######
2. A DNA sequence mutates when some of its symbols change randomly.
   Write a procedure that randomly mutates up to N of the symbols in the
   DNA sequence.  For example, mutate("ATTGCGA",2) may return
  "ATAGCGC".  

        Algorithm: first convert the DNA string into an array of chars
        so that it can be easily modified.  Then run a loop n times.
        inside the loop, generate a random array index, and modify the
        array content at the index.  Calling your randomDNA function
        with an argument of 1 should return a random letter: "A", "C",
        "G" or "T".  After the loop is finished, convert the array
        back to a string and return the string.

        Note that it's possible that a mutation could end up with the
        same letter as before.  That's why the function is specified
        to mutate "up to" n of the symbols.

        If you convert the string to an array first, remember to convert
        it back to a string before returning it.

    CHALLENGE:  Write another version of this function that mutates EXACTLY
    N symbols


def mutate(dna,n):

######
3. Write a function "splice":  (READ THE SPECS CAREFULLY!)

  def splice(A,B,j):

  Where A is a DNA string, B is a DNA string and j is a number.

  The function needs to replace a segment of B with the contents of A.
  j will indicate the starting index in B where this replacement will
  take place.  For example, splice("AAA","ATGTTGC",2) should return the
  string "ATAAAGC": the contents of string B from index 2 (third code) 
  is replaced by string A.  Please note that you're not "inserting" A into
  B, but replacing a substring in B with something else.  The length of B
  should remain the same after the operation.

  Your function also needs to detect error conditions: it's possible that
  the length of A is larger than the length of B.  It's also possible that
  j+len(A) is larger than the length of B.  In either case, your function
  should return B unchanged.
   
  This time, you'll have to come up with the algorithm yourself.


######
4. Using your functions above, generate several dna sequences of varying
   lengths.  Mutate and splice the smaller sequences into the larger ones.
   Write the dna sequences to file.  Please do not create DNA sequences 
   with lengths longer than 2**20 (1 megabyte).

You can use the following function to write a string to a file:

# write a string S to a file:
def writefile(S, filename):
    fd = open(filename,"w")
    fd.write(S)
    fd.close()
# write to file (function does not return value, just writes to file)

...and the following function to read and return a string from a file:

def readfile(filename):  # return a string containing the file contents
    fd = open(filename,"r")
    S = fd.read()
    fd.close()
    return S
# read from file and returns string read (could be very big string)

Given string s, writefile(s,"file.txt") will write string to named file
t = readfile("somefile.txt") will read string in named file and return it.


############################################################################

#   Part II:

Consult notes on writing the substring function as a Exists-Forall nested
loop, and write a final, refined version bestmatchAt(A,B,mxe) which
finds the starting position in B where A is found as a substring with
the least number of errors.  The maximum number of errors allowed is mxe.
This function should return a pair of numbers, (ai,ae) where ai is the
position B where the match was found, and ae is the number of errors in the
match.  If no match is found that doesn't exceed mxe errors, the function 
should return (-1,mxe+1).  Note that bestmatchat(A,B,len(A)) should find
the best possible match, with no maximum ceiling on errors.

For example, bestmatchAt("ATCG","AATGCACTATAGA",2) should return (8,1),
which represents the substring "ATAG".


###########  PART III #####################################################

Hofstra student P. Artyan Imal came from a family of distinguished
Hostra alumni.  However, he has always felt that he was adopted.  His
personality is completely the opposite of his parents'.  His father
B. Arnan Imal and mother Farma N. Imal are both quiet and reserved.
Farma loves to tend her vegetable garden while "Arnan" prefers to work
indoors. Both were good students at Hofstra who never got into
trouble.  In contrast, Artyan is outgoing and fun loving.  He doesn't
care if he's indoors or out as long as he's having wild, crazy fun
with his friends.  He just can't believe that he's the offspring of
such boring parents and he has managed to obtain samples of their DNA
(don't ask how).  Along with a sample of his own DNA, you are to help finally 
unravel the Imal family's genetic riddle.

Find attached files
artyan.dna
farma.dna  (mother)
arnan.dna  (father)

Artyan is the offspring of the Imals if each of their DNA sequences are
embedded inside Artyan's with less than 30 errors.

Take the "code" quiz to insert your answers (you can take the quiz three
times if you don't get it right the first time).