CSC 124: MiniJava+ Compiler Stage 4: Code Generation We will proceed incrementally in generating code. First, we want to be able to compile simple programs with just "main". If you can manage that, then you will find that compiling functions, objects and arrays follow naturally. ------------------------ Point 1: choosing an abstract representation of machine code. If you look at code produced by the online calculator example, a compiler, because of its uniform structure, can easily generate code such as push %eax pop %eax or mov %eax,%ebx mov %ebx,%eax Any kind of optimization routine that seeks to avoid or elminate such redundant code would need to be written at the abstract syntax level. Once we produce assembly or machine code, it will be too late. Therefore, instead of generating code directly, we shall first produce an abstract representation of x86 assembly code, that is, an assembly code that still exists as a java data structure that can be analyzed and transformed. Another important benefit of using an intermediate form is that it allows us the flexibility to produce code consistent with a number of different assemblers, such as MASM, or even straight to some loadable format. Thus you will find on the homepage "linabs.java", which contains my version of the abstract assembly. I tried to make these structures as simple as possible. But there is still one place where inheritance makes sense: operands are implemented by three subclasses: register, immediate, and memory. This will allow us to identify and compare operands of different instructions. Each machine instruction is encapsulated as an instance of the operation class. Memory labels are represented by the memlabel class, which also has a facility for generating new labels. Please study these structures carefully and make appropriate changes. The memlabel class is a bit of a hack and you may want to clean it up or build your own mechanisms for dealing with labels. To understand how the abstract structures are used, your code generation engine will most likely revolve around a visitor structure that's roughly: class generator implements visitor { // For 32bit x86 only: public static final register EAX = new register("eax"); public static final register EBX = new register("ebx"); public static final register ECX = new register("ecx"); public static final register EDX = new register("edx"); public static final register EDI = new register("edi"); public static final register ESI = new register("esi"); public static final register ESP = new register("esp"); public static final register EBP = new register("ebp"); // code to be generated: ArrayList Code = new ArrayList(); ..... Now, to produce code such as mov $2, %ebx mov -4(%ebp), %eax label2: add %eax,%ebx push %ebx call function1 ret you would do: Code.add(new operation("mov", new immediate(2),EBX)); Code.add(new operation("mov", new mem(EBP,-4),EAX)); Code.add(new operation(new memlabel("label"))); Code.add(new operation("add",EAX,EBX)); Code.add(new operation("push",EBX)); Code.add(new operation("call",new immediate("function1",true))); Code.add(new operation("ret")); Note that an immediate operand can be an integer constant or a string label preceeded with a "$", or in the case of a call or jmp label, no "$". ..... I've written a visitor "x86" (on operations and operands), which can then be used to print assembly code in gnu/at&t syntax: x86 filecoder = new x86(); pw.println("/* generated by minijava+ compiler */\n"); for (operation op : Code) System.out.print.print(op.accept(filecoder)); See "mjpc.java" on the homepage, which you should study and modify before using (or write your own using it as an example). You can easily write a visitor similar to "x86" that generates code in some other syntax, such as for MASM. ----------------------------- Point 2: dealing with string constants The minijava+ language contains string expressions. However, it does NOT contain string operators such as + for concatenation. This will make our compiler simpler because we would not have to worry about allocating space for string on the heap, but simple code up all string literals as static arrays. In other words, since our strings won't ever change, they can be treated as constants. So if you have three strings in a program, they can all be declared at the beginning: .section .rodata STR0: .string "abc" STR1: .string "%d\n" STR2: .string "hello world" ... How do we find all the strings in a program, we can either traverse the rather absyn datastructure, or make the following simple modifications: 1. In the Global class in Global.java, add the variable class Global { ... public static ArrayList Sconsts = new ArrayList(); ... } 2. In the stringexp class of mjabsyn.java, add the following item: class stringexp extends baseinfo implements exp { ... int labelind; // index of string in Global.Sconsts ... } 3. In your .cup file, modify the semantic action associated with string expressions. Whenever a string expression is seen by your parser, you should add the string to the Global.Sconsts lists, and set the labelind index parameter of the generated stringexp node to the appropriate value. Now when you visit a stringexp in your code generation visitor, you will know at what label the string is stored in by looking at the labelind. Before you generate other code, you will first use the Global.Sconsts structure to generate the .rodata section where string constants are declared. ------------------------------- Point 3: Stack and post-order traversal The basic idea of code generation is to execute a post-order traversal on complex expressions. The result of computing any complex expression should leave the result of the expression on top of the stack. For example, here's a visit method you might use to generate code to compute plus expressions: public Object visit(plusexp x) { x.e2.accept(this); // visit right branch x.e1.accept(this); // visit // pop elements into eax,ebx first Code.add(new operation("pop",EAX)); Code.add(new operation("pop",EBX)); Code.add(new operation("add",EBX,EAX)); Code.add(new operation("push",EAX)); return null; } Note implicit in this code is the assumption that x.e2.accept(this) and x.e1.accept(this) will use the stack in a clean way, and will leave ONLY the results of the their respective subexpressions on the stack when complete. Otherwise, you cannot assume that the values of the two subexpressions are right next to eachother on the stack. You must be very careful generating code to use the stack. Everything pushed must be popped off at some point. Always remember that expressions may be complex - don't think "1+2" think (1-(a*4)) + 2/(4-b); Be uniform in your approach to generating code. Don't take shortcuts in the hope of an easier way out (there isn't any!). --------------------------------- Point 4: dealing with identifiers (variables) When encountering a variable expression, you need to consult the symbol table, which will tell you where to look for the variable in memory. As you should recall, the symbol table contains entries for each variable in the form: class varentry implements tableentry { boolean isformal = false; boolean isinstance = false; String type; int position; // position : needed for later assembly tableentry previous; // pointer to previous stack frame public tableentry lookup(String s) { return previous.lookup(s); } } (on homepage - see typechecking assignment links). The most important field is "position". This will allow you to generate a memory index. For example, given varentry V, if V.isformal, and V.position is 1 (second parameter), then you would generate the instruction push 12(%ebp) In your code generation visitor if you keep track of three variables: programentry top : pointer to top-level symbol table classentry cclass : pointer to current class table entry funcentry cmethod : pointer to current method table entry. Then to look up the entry for a variable with name N, you can do something like: varentry V; if (cmethod!=null) V = (varentry)cmethod.lookup(N); // formalarg or local var else V = (varentry)cclass.lookup(N); // instance variable then if (V.isformal) then V represents an argument to the current method, and if (V.isinstance) then V represents an instance variable of the current class; else V represents a local variable of the current function. In addition to varexp, you'll also need to consult the symbol table when compiling assignment statements. ---------------------------------- Point 5: dealing with Objects the above notes should allow you to write a code generator for simple programs with just "main" - with no function calls and no objects. However, you still need to be aware of at least one factor: *** A.f(x,y) should really be thought of as f(A,x,y) *** That is, the instance that the function is called on is really compiled as the first parameter of the function. ---- more on objects and methods later... ============================================================================= Some code to help you get started: class generator2 implements visitor { // for easy reference to register operands: public static final register EAX = new register("eax"); public static final register EBX = new register("ebx"); public static final register ECX = new register("ecx"); public static final register EDX = new register("edx"); public static final register EDI = new register("edi"); public static final register ESI = new register("esi"); public static final register ESP = new register("esp"); public static final register EBP = new register("ebp"); // Arraylist to hold generated abstract code: ArrayList Code = new ArrayList(); // pointers to symbol table programentry top; // pointers to symbol table; classentry cclass = null; // pointer to current class entry in table funcentry cmethod = null; // method entry, may be null if not in method // constructor accepts pointer to symbol table's top-level entry public generator2(programentry st) {top=st;} // sample visit method: public Object visit(plusexp x) { x.e2.accept(this); // visit right branch x.e1.accept(this); // visit // pop elements into eax,ebx first Code.add(new operation("pop",EAX)); Code.add(new operation("pop",EBX)); Code.add(new operation("add",EBX,EAX)); Code.add(new operation("push",EAX)); return null; } }