--------  THREAD POOL ASSIGNMENT PART II:   --------

Consider the conventional quicksort algorithm (I'm a boring computer
science professor, so what else do I know? :( ).  I've attached an
implementation of quicksort called quickdouble.java.  

Briefly, quicksort works as follows:

Given an array such as 2 7 4 6 3 1 8 

1. Find a "pivot element" to partition the list into two halves.  In my
   implemenation the findpivot function returns the index of the first 
   element out of order (the index of 4 in the above example).  If no such
   pivot exists, then the array segement is already sorted.

2. Partition the array (via in-place swapping) into two halfs, the left
   one <= pivot, and the right one > pivot.  My parition function does this,
   and also returns the starting index of the second partition.
   With the above example, the two partitions will be

   2 4 3 1    and    7 6 8

   Both my findpivot and parition functions are parameterized by the start
   and end indices of the array segment they're responsible for.

3. "Recursively" quicksort both partitions. 

But there is a problem if we implement (in a Java-type language) quicksort
using direct recursion (have quicksort call itself), because quicksort is
only average-case O(n*log n); in the worst case (if given an array in
reverse-sorted order to begin with), it's O(n*n).  Also in the worst case,
the depth of recursion becomes O(n) instead of O(log n), which means for 
large enough arrays, we would overflow the runtime stack if we used direct
recursion (usually when the array contains more than about 8000 elements).

Thus, in my implmentation of quicksort, I used a separate stack data structure.
This stack replaces the runtime stack, and instead of making recursive calls,
I simply push a new set of start,end indices onto this stack.

It so happens that this way of implementing quicksort also suggests how it
can be parallelized.  Think of the stack as a sort of job queue for your
thread pool.  The findpivot and partition functions cannot be parallelized,
but once we divide a partition into two smaller partitions, then each can be
sorted independently by a different thread.  Importantly, the partitions will
never overlap.  It's also not important that we use a lifo-style stack so long
as all partitions get sorted soon or later.  Thus conceptually, instead of
pushing a new partition onto the stack, we can just submit a new job, which
includes the start and end indices of the partition it should sort, to a
thread pool.

Of course, the cost of creating the job objects and the thread-pool
management routines can be a significant overhead.  It's not a good
idea to submit a new job for a partition that contains only 2 or 3
elements.  So keep the conventional implementation that I gave
you.  If the partition is too small, then sort it in place using the
conventional algorithm, otherwise, submit a new job.

In my implementation, I submitted new jobs only when the partition
size is greater than 50000.  Using my rather old 2.4ghz 4-core Intel
processor, I was able to observe a roughly 2X speedup when sorting an
array of 50 million randomly generated doubles.

First compile and run the quickdouble program as follows:

java -Xmx1024m quickdouble 50000000

The -Xmx1024m option increases the java heap space to 1 gigabyte.
If you don't have enough memory, use smaller arrays or add delays to the
quicksort routines to intentionally slow it down.

On my quad core this took about 16 seconds.

My parallelized version took about 8 seconds, but this probably could be
better with some fine-tunning.

Uncomment the .print() line to print the sorted array (not recommended for
very large arrays though!)

Implement the parallelized quicksort.  First make sure it works on small arrays
(print the sorted arrays).  Find a multi-core processor and run some
experiments.  Report your findings.

-------

Additional notes and hints on the parallel quicksort problem:

The best experimental result I was able to obtain was a 2.7X speedup
while sorting 120 million random doubles.  I created new threadpool
jobs when the partition size is greater than 1000, and used the
conventional algorithm otherwise (the 50000 threshold I mentioned
earlier turned out to be too large).   The theoretical maximum speedup on
my system is 4x since I have 4 cores, but this is unlikely.


How do you know if your program has really sorted such a large array?

Call the following function to check it:


    public static boolean checksorted(double[] A) // check if array is sorted
    {
	boolean ok = true;
	for(int i=0;ok && i<A.length-1;i++) 
	    if (A[i]>A[i+1]) ok = false;
	return ok;
    }

Don't factor this into the timing measurements, however.


Making the array double[] A a static member of your class may also help
a little.


***** 
Observe the principles of abstraction as much as possible.  You
can't just tailor the threadpool to work with this example.  I.e., if
your threadpool class makes a reference to elements of quicksort code,
you will not receive credit even if it "works".  The threadpool should
stay generic - something that can be used by any program.

***** 
When running the timing experiments, make sure that your system is not
running a lot of things in the background. Close browsers and
antiviruses and take it off the network. On windows, run the task
manager and make sure that the "System Idle Process" shows close to
100%. You can also use the task manager to montior CPU activity.

I would be interested to know who can achieve the best results.