C# Compiler Phases

These C# pages explore compiler theory and implementation. They address optimization.

Compiler. Programs are the most complicated engineering artifacts known.

A compiler is a special type of program. It validates. It optimizes. It transforms. Compilers teach us how to solve complex programs.

Here: We review compiler theory. And we look at some compiler-related features of the C# language.

Errors. Compile-time errors are special. With a runtime error, your program may be causing trouble in the world. But with a compile-time error, the problem never progresses to that point. These errors improve program quality.

Compile-Time Error

Books. A dragon is a formidable beast. It breathes fire and might even eat you. Compiler theory is so complex it is represented as a dragon. But we can fight this dragon with syntax directed translation.

Note: Quotes from Aho on this site are taken from the dragon book, Compilers: Principles, Techniques, and Tools.

Note 2: Quotes from Abelson and Sussman are taken from Structure and Interpretation of Computer Programs.

Note 3: Quotes from Lidin are taken from Expert .NET 2.0 IL Assembler. This book describes low-level details.

Intro. Compiler theory divides the compilation of programs into separate phases. At first, the program must be read from the text file. And then important characters are recognized as lexemes.

Lexeme: The term lexeme is used to refer to the textual representation of a token.

A token is a structure that combines a lexeme and information about that lexeme. After the tokens are determined, the compiler uses internal data structures (intermediate representations) to improve the form of programs.

Note: Lexical refers to the text representation of programs. Lexeme refers to the text representation of keywords and more.

And: Tokens combine lexemes and symbolic information about lexemes. The symbol table stores information about tokens.

Token

Phases. Let us walk through some compiler phases. These are used the C# compiler system and the .NET Framework. When you compile a C# program in Visual Studio, the csc.exe program is invoked on the program text.

Next: All the compilation units are combined in a preliminary step. The C# compiler proves errors in your program.

Definite assignment. The C# compiler uses definite assignment analysis. Here it proves that variables are not used before they are initialized. This step reduces the number of security problems and bugs in C# programs.

Tip: Definite assignment analysis ensures higher program quality because the programs are tested more at compile-time.

Definite Assignment

Overloads. The C# compiler applies inferential logic at compile-time. This has no penalty at execution. It finds the best overloaded method based on its parameters. The parameter types too are considered.

Overload

Tip: Overloaded methods can be used as a performance optimization. No runtime penalty is caused by using them.

Numbers. At the C# compilation stage, number transformations are applied. Numbers are "promoted" to larger representations to enable compilation with certain operators. And some casts not present in the program text are added.

Note: This is done to enable shorter and clearer high-level source code, and to ensure an accurate lower-level implementation.

Numeric Promotion

If. The compiler uses node-based logic to rearrange conditional statements and loops, which both use jump instructions. Code often will be compiled to use branch instructions that do not reflect exactly the source text.

For example, the C# compiler will change while-loops into the same thing as certain for-loops. It has sophisticated logic, presumably based on graph theory, to transform your loops and nested expressions into efficient representations.

Constants. In compiler theory, some levels of indirection can be eliminated by actually injecting the constant values into the representation of the program directly. This is termed constant folding.

Note: My benchmarks have shown that constant values do provide performance benefits over variables.

And: If you look at your compiled program, all constants will be directly inside the parts of methods where they were referenced.

Const

Strings. String literals are pooled together. And constant references to the stream of string data in the compiled program are placed where you used the literals. So the literals themselves are not located where you use them in methods.

Instead: The string literal in your program is transformed into a pointer to pooled data.

String Literal

Next: This program shows that the two string literals, declared separately, are actually the same string reference.

C# program that shows string pool

using System;

class Program
{
    static void Main()
    {
	string value = "Python";
	string value2 = "Python";
	// ... These are the same string!
	Console.WriteLine(string.ReferenceEquals(value, value2));
    }
}

Output

True

Metadata. A C# program is compiled into a relational database (metadata). The metadata is an abstract binary representation. It is an efficient encoding of the program. But it is not easy to read by humans.

Also: The metadata is stored on the disk. It contains no comments from your source code.

The metadata is divided into tables. These tables contain records that point to different tables and different records. It is not typically important to study the metadata format unless you are writing a compiler.

Methods. Structural programming represents logic as procedure calls. It uses methods. In the metadata, method bodies omit the names of their local variables. This information is lost as compile-time. But parameter names are retained.

Note: The goal was to improve the level of optimization on method bodies and eliminate unneeded information, reducing disk usage.

Methods

Runtime. A high-level C# program is translated into a relational database called metadata. The Common Language Runtime (CLR) executes this metadata. This incurs some overhead. Startup time is affected.

Then: As you run the program, each method is read from the metadata. Intermediate language code is translated into machine-level code.

In just-in-time compilation, the CLR applies many optimizations to the methods. It sometimes (based on heuristics) inserts the methods at their call sites. This optimization is called function inlining.

It rewrites instruction layouts in memory to eliminate unnecessary indirections. Each pointer dereference costs time. By removing this dereference, fewer instructions (and clock cycles) are needed.

Note: The JIT system causes a slowdown when first used. It is most beneficial on long-running programs.

.NET

Induction. Compilers can apply many optimizations. But sometimes applying them manually is more effective. Code motion moves code outside of a loop. And induction variables are used to analyze data dependencies.

Tip: You can model the accesses of the array with vector sets. This tells you how the data accesses depend on each other.

Next: In this program, each element is accessed only once and in sequential order with the int i variable.

Int

C# program that uses int i and loop

using System;

class Program
{
    static void Main()
    {
	// Create an array.
	// ... It has ten elements.
	int[] array = new int[10];
	array[0] = 1;
	array[9] = 1;
	// Loop through ten integers with int i.
	// ... This is an induction variable.
	for (int i = 0; i < 10; i++)
	{
	    int value = array[i]; // Affine index expression
	    Console.WriteLine("{0} {1}", i, value);
	}
    }
}

Output

0 1
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1

In this program, the int i variable is an induction variable. Its value depends on the iteration space of the for-loop. The induction variable i, then, can be applied as an affine expression to index into the array of integers.

Int Array

An affine expression is one that is based on the induction variable in a loop. The array access in this program would multiply the induction variable by 4, which is the byte size of an integer.

Then: A compiler optimization could apply strength reduction to translate this multiplication into a series of adds.

Tip: In compiler theory, arrays are heavily focused on for optimization opportunities as they are so frequently used.

Code motion is a machine-independent optimization. It can reduce the amount of work done in loops. By moving code outside of loops that are not dependent on the iteration—called loop-invariant computations—we optimize performance.

For

First: This example program introduces two methods: Method1 and Method2. It benchmarks these two methods.

Note: In Method1, a statement is inside the loop. In Method2, that same statement is pulled out of the loop (hoisted).

Tip: The important point is that the statement does not depend on the induction variable in the loop (i).

C# program that shows code motion

using System;
using System.Diagnostics;

class Program
{
    const int _max = 100000;
    static void Main()
    {
	Console.WriteLine(Method1());
	Console.WriteLine(Method2());

	var s1 = Stopwatch.StartNew();
	for (int i = 0; i < _max; i++)
	{
	    Method1();
	}
	s1.Stop();
	var s2 = Stopwatch.StartNew();
	for (int i = 0; i < _max; i++)
	{
	    Method2();
	}
	s2.Stop();
	Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds * 1000 * 1000) /
	    _max).ToString("0.00 ns"));
	Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds * 1000 * 1000) /
	    _max).ToString("0.00 ns"));
	Console.Read();
    }

    static string Method1()
    {
	int i;
	string value = null;
	for (i = 0; i < 100; i++)
	{
	    value = "abc" + 5.ToString();
	}
	return value + i.ToString();
    }

    static string Method2()
    {
	int i;
	string value = null;
	value = "abc" + 5.ToString();
	for (i = 0; i < 100; i++)
	{
	}
	return value + i.ToString();
    }
}

Results

abc5100
abc5100
11324.32 ns
484.68 ns

In this program, we can see that pulling the loop-invariant statement out of the loop resulted in a dramatic performance increase. The results also show that both methods produce the same result.

And: Most interesting is that the compiler did not perform this optimization itself. With the C# language, you must do this by hand.

Note: Code motion can be performed by a compiler. But the compiler in the .NET Framework 4.0 misses opportunities for this optimization.

Optimization. The best way to compile a program is "undecidable." So no program can truly be considered optimal. But as programmers, trying to improve performance of critical paths is worth the effort.

Optimizations

JIT: In this form of compilation, methods are compiled as they are accessed by a program's control flow.

JIT Compilation

Summary. Compilers are complicated. They use an elaborate series of phases to transform program source. Modern computers, and all computer software, rely on compiler theory. It is at the core of all software.

.Net

.NET Array Dictionary List String 2D Async DataTable Dates DateTime Enum File For Foreach Format IEnumerable If IndexOf Lambda LINQ Parse Path Process Property Regex Replace Sort Split Static StringBuilder Substring Switch Tuple

Java

Core Array ArrayList HashMap String 2D Cast Character Console Deque Duplicates File For Format HashSet If IndexOf Lambda Math ParseInt Process Random Regex Replace Sort Split StringBuilder Substring Switch Vector While

TheDeveloperBlog.com

C# Compiler Phases

Compiler. Programs are the most complicated engineering artifacts known.

.Net

Java

Related Links