"""
               CSC 15 DNA LAB PART II

Since there are only 4 DNA symbols, A, C, G and T, it's a waste to use
a character to represent each symbol.  An ASCII char takes one byte,
or 8 bits of memory, but one of 4 different symbols should only take 2
bits.  Large DNA sequences can have billions of symbols, so we
definitely want to reduce the memory requirements for storing them.
We can 'pack' four symbols into a single byte.  Assign binary values
to the symbols as follows:

   A : 00
   C : 01
   G : 10
   T : 11

In other words, each symbol is represented by a two-bit number, which can be
0, 1, 2 or 3, representing A, C, G or T respectively.

We can define this mapping using the following python "association array"
or "hash map":
"""

DNAP = {'A':0, 'C':1, 'G':2, 'T':3}
# DNAP['C'] returns 1, DNAP['T'] returns 3, for example.
DNAN = "ACGT"  # use this string as the inverse map to DNAP

#The DNAP map is equivalent to the following function:
def DNACODE(x):
   if x=='A': return 0
   elif x=='C': return 1
   elif x=='G': return 2
   elif x=='T': return 3
   else: return "error"
#DNACODE

"""
To encode a DNA sequence into a sequence of 8-bit integers, we have to first
pad it with extra symbols so that its length is a multiple of 4:
Then we can encode the sequence 4 symbols at a time:
"""

def encodeDNA(seq):  # encode DNA seq (string), returns array of numbers
   N = []  # array of numbers to be returned
   while (len(seq)%4>0):
      seq = seq+'A'  # add dummy chars to the end.
   # we need to keep track of the true length of the sequence elsewhere
   i = 0
   while i<len(seq):
      x = DNACODE(seq[i])   # add first symbol
      x = x*4+DNACODE(seq[i+1]) # *4 shifts x left 2 bits, then add second sym
      x = x*4+DNACODE(seq[i+2])
      x = x*4+DNACODE(seq[i+3])
      N.append(x)
      i += 4  # jump to next 4 chars
   #while i
   return N
#encodeDNA

sampledna = "GGTAAATTCCAGTGACTGAACCTGACCG"
x = encodeDNA(sampledna)
print(x)


"""
1.  Rewrite encodeDNA so we can encode DNA sequences 16 bits at a time, i.e.,
    represent each sequence of 8 DNA symbols as a single 16 bit integer.
    This time, you should write a nested loop instead of using i+1, i+2, etc.

2.  Write the inverse of the above function: given an array of (max)
    16 bit numbers, return the original DNA sequence as a string.  """


"""
As a programming language, however, python has a problem with
manipulating data at the binary level.  In other languages such as C,
integers have fixed lengths: either 8, 16, 32, or 64 bits.  But in
python it can be more variable. That makes it difficult to store binary
information to a file: we won't know where one number ends and the next 
one begins.  Python is great with strings, but not so great
with binary data.  But representing a number as a string is not
efficient: "120" takes 3 bytes (3 chars), but 120 should take only
one byte (8 bits).  Thus, instead of storing integers as strings, we
should instead convert each 8-bit sequence to an ascii char, or each
16-bit sequence to a Unicode char.  The chr function does this, and
the ord function converts it back to a number.

For example, the DNA sequence "CTAG" is represented in binary as
01110010, which is the number 2+16+32+64 = 114.  We can represent this
sequence as a single ascii char, chr(114), which is 'r'.  Please think
about this carefully: why do we need to represent "CTAG" as 'r'?
Because now we can store the sequence using 1 byte instead of 4 bytes,
and we can still use string IO instead of binary IO, which is next
to impossible in Python.

Most Python tutorials and references will tell you to write an array
of integers to a file using 'JSON', but JSON is a string-based protocol,
which is not efficient enough for real scientific applications.  Instead
we should just convert the array of integers into a unicode string:

"""
# encode DNA sequence string as a more compact string for file IO.
def DNAtoUnicode(D): # D is a DNA string consisting of A, C, T, G
   NS = encodeDNA(D) # first convert to array of ints
   s = ""  # string to be returned
   for n in NS: s = s+chr(n)
   return s
# DNAtoUnicode
u = DNAtoUnicode(sampledna)
print(u)  # don't expect this to be readable


"""
3. Write the inverse to the above function, which converts a unicode/ascii
string to a string of DNA symbols (containing A, C, G and T).  You
should call the function you wrote for question 2 in the new function.


4.
The work of a former, famous geneticist, Dr. Farma N. Imal, has been
found.  Years ago she had discovered a gene that causes some computer
science students to doze off during class.  Because they also have a
tendency to drool while sleeping, our intrepid scientist was able to
collect numerous dna samples from the drool for her research.  But
eventually it got to her.  She quit genetic research and now works as
a software developer.  She doesn't want to have anything to do with her
past work: just the thought of it disgusts her.

But you have to continue her work so we can eventually find a cure,
for the sake of better students and cleaner classrooms.
Unfortuantely, she left the gene encoded as unicode characters, with
each such 16-bit character representing 8 DNA symbols. The code for
the "dozer gene" is found in file called "dozer.udna". The length of
the gene just happens to be a multiple of 8 (because computer science
students "have binary in their genes").

Decode the dozer gene and print it out as a string consisting of the symbols
A, C, G and T.  You can use the following functions to read a string from
a file, as well as write one to a file:

"""

def readseq(filename):
   fd = open(filename,"r")
   s = fd.read()
   fd.close()
   return s
#readseq

def writeseq(filename,seq):
   fd = open(filename,"w")
   fd.write(seq)
   fd.close()
#writeseq