""" CSC 15 DNA LAB PART II Since there are only 4 DNA symbols, A, C, G and T, it's a waste to use a character to represent each symbol. An ASCII char takes one byte, or 8 bits of memory, but one of 4 different symbols should only take 2 bits. Large DNA sequences can have billions of symbols, so we definitely want to reduce the memory requirements for storing them. We can 'pack' four symbols into a single byte. Assign binary values to the symbols as follows: A : 00 C : 01 G : 10 T : 11 In other words, each symbol is represented by a two-bit number, which can be 0, 1, 2 or 3, representing A, C, G or T respectively. We can define this mapping using the following python "association array" or "hash map": """ DNAP = {'A':0, 'C':1, 'G':2, 'T':3} # DNAP['C'] returns 1, DNAP['T'] returns 3, for example. DNAN = "ACGT" # use this string as the inverse map to DNAP #The DNAP map is equivalent to the following function: def DNACODE(x): if x=='A': return 0 elif x=='C': return 1 elif x=='G': return 2 elif x=='T': return 3 else: return "error" #DNACODE """ To encode a DNA sequence into a sequence of 8-bit integers, we have to first pad it with extra symbols so that its length is a multiple of 4: Then we can encode the sequence 4 symbols at a time: """ def encodeDNA(seq): # encode DNA seq (string), returns array of numbers N = [] # array of numbers to be returned while (len(seq)%4>0): seq = seq+'A' # add dummy chars to the end. # we need to keep track of the true length of the sequence elsewhere i = 0 while i