C-Sharp | Java | Python | Swift | GO | WPF | Ruby | Scala | F# | JavaScript | SQL | PHP | Angular | HTML
A compiler is a special type of program. It validates. It optimizes. It transforms. Compilers teach us how to solve complex programs.
Here: We review compiler theory. And we look at some compiler-related features of the C# language.
Errors. Compile-time errors are special. With a runtime error, your program may be causing trouble in the world. But with a compile-time error, the problem never progresses to that point. These errors improve program quality.
Books. A dragon is a formidable beast. It breathes fire and might even eat you. Compiler theory is so complex it is represented as a dragon. But we can fight this dragon with syntax directed translation.
Note: Quotes from Aho on this site are taken from the dragon book, Compilers: Principles, Techniques, and Tools.
Note 2: Quotes from Abelson and Sussman are taken from Structure and Interpretation of Computer Programs.
Note 3: Quotes from Lidin are taken from Expert .NET 2.0 IL Assembler. This book describes low-level details.
Intro. Compiler theory divides the compilation of programs into separate phases. At first, the program must be read from the text file. And then important characters are recognized as lexemes.
Lexeme: The term lexeme is used to refer to the textual representation of a token.
A token is a structure that combines a lexeme and information about that lexeme. After the tokens are determined, the compiler uses internal data structures (intermediate representations) to improve the form of programs.
Note: Lexical refers to the text representation of programs. Lexeme refers to the text representation of keywords and more.
And: Tokens combine lexemes and symbolic information about lexemes. The symbol table stores information about tokens.
Phases. Let us walk through some compiler phases. These are used the C# compiler system and the .NET Framework. When you compile a C# program in Visual Studio, the csc.exe program is invoked on the program text.
Next: All the compilation units are combined in a preliminary step. The C# compiler proves errors in your program.
Definite assignment. The C# compiler uses definite assignment analysis. Here it proves that variables are not used before they are initialized. This step reduces the number of security problems and bugs in C# programs.
Tip: Definite assignment analysis ensures higher program quality because the programs are tested more at compile-time.
Overloads. The C# compiler applies inferential logic at compile-time. This has no penalty at execution. It finds the best overloaded method based on its parameters. The parameter types too are considered.
Tip: Overloaded methods can be used as a performance optimization. No runtime penalty is caused by using them.
Numbers. At the C# compilation stage, number transformations are applied. Numbers are "promoted" to larger representations to enable compilation with certain operators. And some casts not present in the program text are added.
Note: This is done to enable shorter and clearer high-level source code, and to ensure an accurate lower-level implementation.
If. The compiler uses node-based logic to rearrange conditional statements and loops, which both use jump instructions. Code often will be compiled to use branch instructions that do not reflect exactly the source text.
For example, the C# compiler will change while-loops into the same thing as certain for-loops. It has sophisticated logic, presumably based on graph theory, to transform your loops and nested expressions into efficient representations.
Constants. In compiler theory, some levels of indirection can be eliminated by actually injecting the constant values into the representation of the program directly. This is termed constant folding.
Note: My benchmarks have shown that constant values do provide performance benefits over variables.
And: If you look at your compiled program, all constants will be directly inside the parts of methods where they were referenced.
Strings. String literals are pooled together. And constant references to the stream of string data in the compiled program are placed where you used the literals. So the literals themselves are not located where you use them in methods.
Instead: The string literal in your program is transformed into a pointer to pooled data.
Next: This program shows that the two string literals, declared separately, are actually the same string reference.
C# program that shows string pool using System; class Program { static void Main() { string value = "Python"; string value2 = "Python"; // ... These are the same string! Console.WriteLine(string.ReferenceEquals(value, value2)); } } Output True
Metadata. A C# program is compiled into a relational database (metadata). The metadata is an abstract binary representation. It is an efficient encoding of the program. But it is not easy to read by humans.
Also: The metadata is stored on the disk. It contains no comments from your source code.
The metadata is divided into tables. These tables contain records that point to different tables and different records. It is not typically important to study the metadata format unless you are writing a compiler.
Methods. Structural programming represents logic as procedure calls. It uses methods. In the metadata, method bodies omit the names of their local variables. This information is lost as compile-time. But parameter names are retained.
Note: The goal was to improve the level of optimization on method bodies and eliminate unneeded information, reducing disk usage.
Runtime. A high-level C# program is translated into a relational database called metadata. The Common Language Runtime (CLR) executes this metadata. This incurs some overhead. Startup time is affected.
Then: As you run the program, each method is read from the metadata. Intermediate language code is translated into machine-level code.
In just-in-time compilation, the CLR applies many optimizations to the methods. It sometimes (based on heuristics) inserts the methods at their call sites. This optimization is called function inlining.
It rewrites instruction layouts in memory to eliminate unnecessary indirections. Each pointer dereference costs time. By removing this dereference, fewer instructions (and clock cycles) are needed.
Note: The JIT system causes a slowdown when first used. It is most beneficial on long-running programs.
Induction. Compilers can apply many optimizations. But sometimes applying them manually is more effective. Code motion moves code outside of a loop. And induction variables are used to analyze data dependencies.
Tip: You can model the accesses of the array with vector sets. This tells you how the data accesses depend on each other.
Next: In this program, each element is accessed only once and in sequential order with the int i variable.
C# program that uses int i and loop using System; class Program { static void Main() { // Create an array. // ... It has ten elements. int[] array = new int[10]; array[0] = 1; array[9] = 1; // Loop through ten integers with int i. // ... This is an induction variable. for (int i = 0; i < 10; i++) { int value = array[i]; // Affine index expression Console.WriteLine("{0} {1}", i, value); } } } Output 0 1 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 1
In this program, the int i variable is an induction variable. Its value depends on the iteration space of the for-loop. The induction variable i, then, can be applied as an affine expression to index into the array of integers.
An affine expression is one that is based on the induction variable in a loop. The array access in this program would multiply the induction variable by 4, which is the byte size of an integer.
Then: A compiler optimization could apply strength reduction to translate this multiplication into a series of adds.
Tip: In compiler theory, arrays are heavily focused on for optimization opportunities as they are so frequently used.
Code motion is a machine-independent optimization. It can reduce the amount of work done in loops. By moving code outside of loops that are not dependent on the iteration—called loop-invariant computations—we optimize performance.
First: This example program introduces two methods: Method1 and Method2. It benchmarks these two methods.
Note: In Method1, a statement is inside the loop. In Method2, that same statement is pulled out of the loop (hoisted).
Tip: The important point is that the statement does not depend on the induction variable in the loop (i).
C# program that shows code motion using System; using System.Diagnostics; class Program { const int _max = 100000; static void Main() { Console.WriteLine(Method1()); Console.WriteLine(Method2()); var s1 = Stopwatch.StartNew(); for (int i = 0; i < _max; i++) { Method1(); } s1.Stop(); var s2 = Stopwatch.StartNew(); for (int i = 0; i < _max; i++) { Method2(); } s2.Stop(); Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000 * 1000) / _max).ToString("0.00 ns")); Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000 * 1000) / _max).ToString("0.00 ns")); Console.Read(); } static string Method1() { int i; string value = null; for (i = 0; i < 100; i++) { value = "abc" + 5.ToString(); } return value + i.ToString(); } static string Method2() { int i; string value = null; value = "abc" + 5.ToString(); for (i = 0; i < 100; i++) { } return value + i.ToString(); } } Results abc5100 abc5100 11324.32 ns 484.68 ns
In this program, we can see that pulling the loop-invariant statement out of the loop resulted in a dramatic performance increase. The results also show that both methods produce the same result.
And: Most interesting is that the compiler did not perform this optimization itself. With the C# language, you must do this by hand.
Note: Code motion can be performed by a compiler. But the compiler in the .NET Framework 4.0 misses opportunities for this optimization.
Optimization. The best way to compile a program is "undecidable." So no program can truly be considered optimal. But as programmers, trying to improve performance of critical paths is worth the effort.
JIT: In this form of compilation, methods are compiled as they are accessed by a program's control flow.
Summary. Compilers are complicated. They use an elaborate series of phases to transform program source. Modern computers, and all computer software, rely on compiler theory. It is at the core of all software.