CSC 15 DNA Lab The target for this lab is to write some procedures for string matching. We will apply these procedures to randome DNA sequences. Procedures for file IO will be given to you. ######### Prelims: Some operations on strings that may be useful: # Given some string s = "abcdefg" # the expression s[a:b] will return the substring of s from # s[a] to to s[b-1], so s[2:5] will return "cde". The length of the substring # is always b-a. The expression s[:3] is the same as s[0:3] and s[3:] is # the same as s[3:len(s)]. For example: s = s[0:2] + 'x' + s[3:] # will change 'c' to 'x', s becomes "abxdefg" # However, for large strings, it is more efficient to change individual # characters by first converting the string into an array of characters. sl = list(s) # will set sl to ['a','b','c','d','e','f','g'] # Since arrays (also called lists in python) are mutable, it's more efficient # to make changes to arrays than reconstructing the entire string: sl[1] = 'z' # changes second char to z t = "".join(sl) # this operation will convert an array of chars to a string # t will become "azcdefg" ######### PART I: 0. Write a procedure that generates a random DNA sequence of length n: def randomDNA(n): # should return a STRING such as "ATTGACGAC" Think about the algorithm first: you'll start with an empty string "", then run a loop n times. Inside the loop, you need to generate a randint(0,3) and depending on the value of this random number, you will add either an "A", "C", "G" or "T" to the end of your string. For example, to add the character "C" to the end of string S, use the statement S = S+"C". Alternatively, you can pre-allocate memory for a large array, then fill the array with random DNA characters. At the end you will convert the array of chars to a string using "".join(arrayname). The function should return the string generated. To use the randint function, put *from random import randint* at the top of your program. Then randint(a,b) generates a random number between a and b inclusively. ####### 1. Write a function count(X,S) that returns the number of times that the character X appears in string S. For example, count("A","GATCAATC") should return 3. Test it on your random DNA sequences. ###### 2. A DNA sequence mutates when some of its symbols change randomly. Write a procedure that randomly mutates up to N of the symbols in the DNA sequence. For example, mutate("ATTGCGA",2) may return "ATAGCGC". Algorithm: first convert the DNA string into an array of chars so that it can be easily modified. Then run a loop n times. inside the loop, generate a random array index, and modify the array content at the index. Calling your randomDNA function with an argument of 1 should return a random letter: "A", "C", "G" or "T". After the loop is finished, convert the array back to a string and return the string. Note that it's possible that a mutation could end up with the same letter as before. That's why the function is specified to mutate "up to" n of the symbols. If you convert the string to an array first, remember to convert it back to a string before returning it. CHALLENGE: Write another version of this function that mutates EXACTLY N symbols def mutate(dna,n): ###### 3. Write a function "splice": (READ THE SPECS CAREFULLY!) def splice(A,B,j): Where A is a DNA string, B is a DNA string and j is a number. The function needs to replace a segment of B with the contents of A. j will indicate the starting index in B where this replacement will take place. For example, splice("AAA","ATGTTGC",2) should return the string "ATAAAGC": the contents of string B from index 2 (third code) is replaced by string A. Please note that you're not "inserting" A into B, but replacing a substring in B with something else. The length of B should remain the same after the operation. Your function also needs to detect error conditions: it's possible that the length of A is larger than the length of B. It's also possible that j+len(A) is larger than the length of B. In either case, your function should return B unchanged. This time, you'll have to come up with the algorithm yourself. ###### 4. Using your functions above, generate several dna sequences of varying lengths. Mutate and splice the smaller sequences into the larger ones. Write the dna sequences to file. Please do not create DNA sequences with lengths longer than 2**20 (1 megabyte). You can use the following function to write a string to a file: # write a string S to a file: def writefile(S, filename): fd = open(filename,"w") fd.write(S) fd.close() # write to file (function does not return value, just writes to file) ...and the following function to read and return a string from a file: def readfile(filename): # return a string containing the file contents fd = open(filename,"r") S = fd.read() fd.close() return S # read from file and returns string read (could be very big string) Given string s, writefile(s,"file.txt") will write string to named file t = readfile("somefile.txt") will read string in named file and return it. ############################################################################ # Part II: Consult notes on writing the substring function as a Exists-Forall nested loop, and write a final, refined version bestmatchAt(A,B,mxe) which finds the starting position in B where A is found as a substring with the least number of errors. The maximum number of errors allowed is mxe. This function should return a pair of numbers, (ai,ae) where ai is the position B where the match was found, and ae is the number of errors in the match. If no match is found that doesn't exceed mxe errors, the function should return (-1,mxe+1). Note that bestmatchat(A,B,len(A)) should find the best possible match, with no maximum ceiling on errors. For example, bestmatchAt("ATCG","AATGCACTATAGA",2) should return (8,1), which represents the substring "ATAG". ########### PART III ##################################################### Hofstra student P. Artyan Imal came from a family of distinguished Hostra alumni. However, he has always felt that he was adopted. His personality is completely the opposite of his parents'. His father B. Arnan Imal and mother Farma N. Imal are both quiet and reserved. Farma loves to tend her vegetable garden while "Arnan" prefers to work indoors. Both were good students at Hofstra who never got into trouble. In contrast, Artyan is outgoing and fun loving. He doesn't care if he's indoors or out as long as he's having wild, crazy fun with his friends. He just can't believe that he's the offspring of such boring parents and he has managed to obtain samples of their DNA (don't ask how). Along with a sample of his own DNA, you are to help finally unravel the Imal family's genetic riddle. Find attached files artyan.dna farma.dna (mother) arnan.dna (father) Artyan is the offspring of the Imals if each of their DNA sequences are embedded inside Artyan's with less than 30 errors. Take the "code" quiz to insert your answers (you can take the quiz three times if you don't get it right the first time).