CSC 124 Assignment 2 Writing The MiniJava Lexical Analyzer with JLex -- With this assignment we begin assembling the different components of a compiler for the "MiniJava" language described in the text. We may not have time to implement every feature of the language (such as inheritance) and we may want to add additional features. The first step is to build a lexical analyzer for the language. That is, identify all atomic tokens. 1. The two pieces of software we will use, JLex and Java_Cup, can be found on husun3 at /shared/csc/local, which should be mounted from the Adams204 workstations. You can also install JLex and Java_Cup from the links on the class homepage. For JLex, create a directory called JLex and download the source code (Main.java) into the directory. Then, compile Main.java with javac. To use JLex with a filename.lex specification file, do java -cp /shared/csc/local JLex.Main filename.lex That is assuming that /shared/csc/local is not on your permanent path. 1b. To install the sibling program java_cup on your own system (if you're not using your hofstra account), download the java_cup-v11a.jar file from the java_cup site, and put it inside your jdk1.x/jre/lib/ext directory. This is the best way. If you are using your Hofstra account however, you'll have to include /shared/csc/local in your CLASSPATH, i.e. java -cp /shared/csc/local java_cup.Main < filename.cup (we won't do this for this assignment, but you WILL need to have it installed). 2. Find on our homepage a file called "sym.java". This file contains the constant definitions of all the tokens of minijava (generated by java_cup, which we will learn about very shortly). This file contains the following token constants (represented as unique integers). I've put the MiniJava tokens they're supposed to represent in comments on the right. Note that sym.INTLIT represents an actual integer literal, such as 32, 45, etc..., whereas sym.INTtok represents the keyword "int", and similarly for doubles and Strings. Doubles, Strings and System.out.printf are not in the MiniJava specs proper, but we're going to include them in our version of the language. If time allow you will have the oppotunity to implement new language features of your own design. Also note that with these set of tokens, "==" will be parsed as two separate sym.EQUAL tokens. We will have to make the distinction between "=" and "==" at the next stage of compiling (parsing stage). /** CUP generated class containing symbol constants. */ public class sym { /* terminals */ public static final int INTtok = 28; // int public static final int STRINGLIT = 4; // "abcd..." public static final int SLASH = 9; // / (divide) public static final int DOUBLEtok = 44; // double public static final int SEMI = 38; // ; public static final int LPAREN = 10; // ( public static final int PRINTLN = 30; // System.out.println public static final int MINUS = 8; // - public static final int STATIC = 22; // static public static final int RPAREN = 11; // ) public static final int NOT = 33; // ! public static final int AND = 31; // && public static final int INTLIT = 3; // 32 (an actual integer) public static final int LESSTHAN = 43; // < public static final int OR = 32; // || public static final int COMMA = 39; // , public static final int CLASS = 20; // class public static final int DOUBLELIT = 5; // 3.14 (an actual num) public static final int PLUS = 6; // + public static final int MAIN = 24; // main public static final int IF = 34; // if public static final int THIS = 18; // this public static final int DOT = 37; // . public static final int ID = 2; // identifier (e.g, var name) public static final int EOF = 0; // end-of-file (admin purpose) public static final int RETURN = 27; // return public static final int EQUAL = 36; // = public static final int TRUE = 12; // true public static final int NEW = 19; // new public static final int error = 1; // (internal admin) public static final int VOID = 23; // void public static final int PRINTF = 35; // System.out.printf public static final int LBRACK = 14; // [ public static final int TIMES = 7; // * public static final int ELSE = 42; // else public static final int LBRACE = 16; // { public static final int RBRACK = 15; // ] public static final int WHILE = 40; // while public static final int BOOLEANtok = 29; // boolean public static final int PUBLIC = 21; // public public static final int RBRACE = 17; // } public static final int EXTENDS = 26; // extends public static final int STRINGtok = 25; // String public static final int FALSE = 13; // false public static final int LENGTH = 41; // length } In addition to recognizing these tokens, your code should also be able to ignore comments, both those that start with // (hint: "//".*\n) and those that are surround by /* and */ 3. *** The following steps are recommended: Create a directory for the compiler project, like "minijava" and put all sources, including sym.java into this directory. You now need to write a .lex specification file to create a lexical analyzer for MiniJava. Study the example I gave you Tuesday as well as the one that came with the JLex documentation. Your lexical analyzer should include the following declarations import java_cup.runtime.*; // at the very top of the file // in between the two "%%" lines: %function next_token %type java_cup.runtime.Symbol %char %line These declarations state that JLex will create a function class Yylex { ... java_cup.runtime.Symbol next_token() { ... That is, calling next_token() will return the next token recognized in the form of an instance of java_cup.runtime.Symbol. The constructor of this symbol should be called (from within your .lex file) like this: new java_cup.runtime.Symbol(sym.ELSE, yyline, yychar, yytext()) The first parameter expects an integer, which should be the sym class code representing the appropriate token, in this case the "else" token. The next two parameters represent the character positions of the start and end of the token. The last parameter's type is Object, so it can be anything. For tokens such as "else", this parameter can be null. However, since we need to test if the scanner works, it'd be better to return a printable string. yytext() always returns the text of the current token. You should pass yytext() to the constructor of Symbol EXCEPT for INTLIT, DOUBLELIT and STRINGLIT The Object associated with DOUBLELIT AND INTLIT should be Double and Integer respectively. That is, you would do: new java_cup.runtime.Symbol(sym.INTLIT, yyline, yychar, new Integer(yytext())) and new java_cup.runtime.Symbol(sym.DOUBLELIT, yyline, yychar, new Double(yytext())); (The constructors of Integer and Double can accept strings) For STRINGLIT, when you see a string literal such as "abcd", you want the Symbol object to record that. However, yytext() will also include the two double-quotes as part of the string. to get rid of them, you should do: String s = yytext().substring(1,yytext().length()-1); return new java_cup.runtime.Symbol(sym.STRINGLIT,yyline,yychar,s); To make the job of creating a token easier, you might want to declare utility procedures inside the %{ ... %} (see .lex examples). 4. use JLex to produce a .lex.java file. Download the test program mjlexertest.java from the homepage. Study this program a bit. Note the given a java_cup.runtime.Symbol object A, A.sym is the integer token code (such as sym.ELSE) and A.value is the Object associated with the token. Compile everything: javac -cp .:/shared/csc/local *.java Now follow the "textbook resources" link on our webpage and find the MiniJava project. On this page you'll find sample MiniJava programs such as Factorial.java and BinarySearch.java. Test your program on ALL of these examples, such as: java -cp .:/shared/csc/local mjlexertest Factorial.java Finall, here are three additional lines of code to get you started: \n { System.out.println(); } {NONNEWLINE_WHITE_SPACE}+ {} "int" { return new java_cup.runtime.Symbol(sym.INTtok,yyline,yychar,yytext()); } ------ BE CAREFUL AND THUROUGH. check the output of your program. Were comments properly skipped? Was any token misrepresented?