/*    MPI Assignment 1  */

You are to do this assignment in pairs.


0.  Write a program that compares the relative overhead of various forms
    of communication between two processors (-np 2).

    Declare and generate random 2D arrays of doubles, just as in the
    sample matrix multiplication program.
  
    Each process is to have 
   
      double A[N][N];  // send buffer
      double B[N][N];  // recieve buffer

    The rank 0 process needs to send A to the rank 1 process and
    receive the contents of B from the rank 1 process.  Correspondingly,
    the rank 1 process needs to receive its B from rank 0 and send its A
    to rank 0.

    The MPI_Wtime() call gets the current time in seconds and can
    be used to measure the elapsed time of a program:

    double t1, t2;
    t1 = MPI_Wtime();
    ...
    t2 = MPI_Wtime() - t1; // records elapsed time.

    Measure the total time it takes to complete the exchange for the following
    cases:  First, use only synchronous send/receive

    1. Sending the entire 2D array with a single send operation.

    2. Sending the array row by row using a loop.

    3. Sending each element (double) of the array separately (for this one,
       you should start with very small values of N, like 10).

    4. If you know some networking, you probably know that the maximum ethernet
       packet size is around 1500 bytes.  Larger packets will have to be 
       fragmented,  either by you or by the system.  Write yet another 
       version of the send/receives that only sends out 150 doubles 
       (1200 bytes) at a time.   Hint: use (double*)A and (double*)B
       and pointer arithmetics.

    For these experiments, I recommend that you put them all into one
    program instead of in seperate programs.  Do not printf anything
    until the very end (you should know why).  Also, you'll need to
    vary the size of the matrix.  I used N=2000 (4 million doubles =
    32 megabytes) to get good measurements (except for version 3!).

    Record and report as to the times.

    5.  With synchronous send/receive, you must make sure that the rank 0
        processor sends first and the rank 1 processor recieves first,	
	otherwise you may get deadlock.  Asynchronous communication 
	can be used to avoid such situations.  Roughly, you can write
	both processes as follows

	start asynch send
	start asynch receive
	wait for sends to complete
	wait for recieves to complete.

	Please modify part 1 of your program to use this technique.


  ===================================================================

                           Part II:

/*
  A prime number is any number > 1 that's divisible only by 1 and itself.
  There are many sophisticated ways to test for primes.  A simple one is
  to test if the number is divisible by 2 or any odd number between
  3 and n/2.  

  Here is a conventional, 1-process implementation:
*/

int isprime(unsigned int n)
{
  unsigned int i;  
  if (n==2) return 1;
  if ((n<2) || (n%2 == 0)) return 0;
  for(i=3; i<=n/2 && ((n%i)>0); i+=2);
  return (i>n/2);
}


/*
    Is this algorithm a good candidate for parallel speedup?  You should
    show that it is by deriving the complexity of computation versus
    the complexity of communication.  Then you should implement a 
    multi-processor solution to demonstrate your hypothesis.

    The rank-0 process will either input or randomly generate the
    number to be tested.  Each process will then be responsible for
    checking if the number is divisible by a range of values, and send
    the result back to the rank-0 process.  For the final
    communication, when the results are gathered from each process,
    use MPI_REDUCE with operation MPI_LAND (logical and), instead of
    the normal send/receive operations.  MPI_REDUCE is similar to
    MPI_Bcast: all processes must call it.  Read the description of
    this function in the textbook and online docs carefully.

    The C datatype unsigned int corresponds to the MPI datatype
    MPI_UNSIGNED.

    Here are some large-enough primes for you to test:

    2147483647
    2147483629
    2147483587

    Also test your program on non-primes, and compare the answer with
    that returned by the conventional function above.  Compare the
    elapsed time of the conventional function and your multi-processor
    implementation.  You should see speed up for the large numbers above.


    -------------
    Additional hints:

    To take a command line argument and convert it to a numerical value:

    #include<stdlib.h>
    ...
    int main(int argc, char **argv)
    {
        unsigned int n;
	n =(unsigned int) atoi(*++argv);
        ...

    For you java-ish folks, *++argv is basically the same as argv[1], with
    an auto-increment of argv first.  Note that argv[1], not argv[0] is
    the first command-line argument.
*/