CSC 259 Hadoop-Map/Reduce Assignment 2


DNA sequences are represented using the symbols A, C, G and T.  As
part of a federally funded project, a group of geneticists collected
numerous DNA samples from lions, tigers, leopards, and other animals
of the feline family.  After years of painstaking analysis, they have
discovered, to their shock, a DNA strand that seems to be common to
all feline species, big or small.

The Feline DNA:

CATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCATCAT


Whether or not you believe in the scientific accuracy of this finding, you are
asked to write a program to help search for the presence of the feline DNA.
Since mutations may occur, we are also interested in matches that include
a few errors.  

Although one can efficiently represent DNA sequences using base-4
numbers, we will simply use Strings.  I am giving you the following basic
function to search for matches:

// find best possible substring of line matching pattern
static int[] bestmatch(String pattern, String line)
{
    int errs = 0;  // var to keep track of number of errors
    int plen = pattern.length();  // also value of max number of errors
    int llen = line.length();
    int beste = plen+1;  // best number of errors found so far
    int besti = -1;     // position of best match so far
    int i, j;
    for(i=0;i<=llen-plen;i++) // determine match starting at i
     {  errs = 0;
	for(j=0;j<plen && errs<beste;j++)
	    if (pattern.charAt(j) != line.charAt(i+j)) errs++;
	if (errs<beste) { beste=errs;  besti = i; }
     }//for i
     int[] answer = {besti,beste};
     return answer;
}//bestmatch

For example, a call to bestmatch("ATGC","ACGATCC") will return an array
{3,1} indicating that the best match for the pattern "ATGC" occurred
at starting position 3 (counting from 0), and has one error.  Note that
there's also a match at position 0, "ACGA", but with two errors. The function
returns the best match.  If there are multiple best matches, it returns the
position of the first one.  If the length of the pattern is greater than
the length of the 'line', it will return {-1,5}, which indicate an error
condition.  Otherwise, it should at least find a match with errors <= to
the length of the pattern.  You may of course assume that the length of
the line fits inside a 32 bit signed integer (0x7fffffff).

---------------

First, download the file *gendna.class* from blackboard.

Running this class file (java gendna) will create 10 files, dna0.txt
to dna9.txt, in the current directory (move gendna.class elsewhere
after running it).  Each file represents a DNA fragment of 3 million
symbols (the file sizes are slightly longer because of the "\n" that
separates lines).  You should place all these files into one directory
so you can pass it to your hadoop application.  The data in each file
is separated by lines (by "\n").  Each line contains 20000 symbols.

The purpose of your program is to find the best possible match with
the feline sequence in each dna file, and report the position and
number of errors of the match.  This would be pretty easy EXCEPT that
the feline pattern could be spread across two successive lines.  The
default TextInputFormat passes a single line to each call of map.  It
does no good to try to create an input format that passed it two lines
at a time, because then you'll have the same problem between the
second and third lines.  It is also not permitted to feed each map the
entire file (or by changing the file to get rid of the \n's), because
that would destroy the main potential for parallelism in this problem.
Additionally, you might be tempted to use the
Mapper.Context.getNextKeyValue() method, but that doesn't work either
because it's destructive - other mappers won't get a duplicate - I
tried that!).

In general, communication between the different mappers is difficult in Hadoop.
Your processes are really only expected to communicate via input/output 
key,value pairs.  To use Hadoop for this purpose, you should consider using
a combiner stage.  Your map function need to output enough information for
the combiner to detect matches spread accross two lines (this is the most 
important hint that I'm giving you).  

For this assignment you may assume that each line is always 20000 symbols
long.  You can even fix the feline sequence as a constant, although I 
encourage you to use the distributed file cache.  You should already know
that a "file" is an operating system abstraction: it's not necessarily
something written to hard disk, it could be buffered in memory.  So using
the distributed cache is not necessarily as inefficient as you might think.
In any case, it's just part of the task startup cost.

I'm holding on to the source code of gendna so I can check your output.

However, I suggest you first use a smaller sequence and a small set of files
to make sure your program works.  For example, with the pattern sequence

CATCATCATCATCAT

and a file consisting of 

ATCTGACATCATCATG
ATCATGGGTGGGGCAT
CATCATCATCATGGGC
CATCATCATCATCGTG
CATCATCATCATGTTG

My own program produced the following final output, where <file1> is the
key and the rest is the string representation of the value.

<file1>	position 29, 0 errors]

If there were two copies of the same file in the directory, it would have
produced

<file1>	position 29, 0 errors]
<file2>	position 29, 0 errors]

Note that you don't need to worry about the pattern spread across
different files, just different lines in the same file.  The file
position takes into account the "\n" at the end of each line, which is
not read into the Text/String passed to map.  If you wish, you can
adjust the position value to discount these, so the position value can
be slightly different from what I expect.

Please submit both your source and (in comments) the output of the
program on the data produced by gendna.  You can use either separate files
or use static inner classes to fit everything into one file, like the example
in the Hadoop tutorial at hadoop.apache.org.

--------------
Additional hint:

Hadoop is still evolving and has some teething problems.  I encountered the
following nasty problem.  I had the following kind of loop in my reduce method:

reduce(... Iterable<ValType> Values ...)
{...

   ValType best = v;
   for(ValType v : Values)
      {
          if (v.betterthan(best)) { best = v; }
      }
...


This loop didn't work as expected because apparently the memory locations
that each v points to are being manipulated in the background by the
Hadoop implementation code.  That is, if you point best to a certain
memory location inside the loop, don't expect it to have necessarily
the same value after the loop.  Very annoying indeed.  To work around
it you should clone the object instead: best = v.clone(), where .clone()
returns a new object with the same content as v.