Decompilation of CIL code into some high level code - do I need to introduce new variables during data flow analysis?

Question

I'm writing a compiler from .NET CIL code to some high level language. Process is similar to decompilation. I have done some control flow analysis - detecting loops, ifs, and so on. In terms of data flow analysis i've done simple expression propagation by simulating some instructions involving evaluation stack - I treat evaluation stack as hidden variable, push more complex expressions on it, and whenever there is any assignment instruction to some variable (for example starg or stloc) - I pop and assign propagated expression from stack to this variable and translate this expression into statement in high language code. Of course for now it is so simple that it generates failures. Consider a function written in C#:
int GCD(int n1, int n2)
{
    while(n2 != 0)
    {
        int c = n1 % n2;
        n1 = n2;
        n2 = c;
    }

return n1;
}

This function compiles to IL:
.method private hidebysig 
    instance int32 GCD (
        int32 n1,
        int32 n2
    ) cil managed 
{
    .maxstack 8

IL_0000: br.s IL_000a
        IL_0002: ldarg.1    // load n1 on eval stack
        IL_0003: ldarg.2    // load n2 on eval stack
        IL_0004: rem        // pop n1 and n2 from stack, compute n1 % n2 and push it on stack
        IL_0005: ldarg.2    // load n2 on stack
        IL_0006: starg.s n1 // pop n2 from stack and store it in n1
        IL_0008: starg.s n2 // pop n1 % n2 from stack and store it in n2

IL_000a: ldarg.2
        IL_000b: brtrue.s IL_0002

IL_000d: ldarg.1
    IL_000e: ret
}

With this simple propagation we push n1 % n2 on stack, then load n2 on stack, then we have starg instruction, so we pop expression from stack and translate assignment to statement. Then we pop again, and do the same. Result code looks like this:
GCD(n1, n2) {
    while (n2) { 
        n1 = n2;
        n2 = (n1 % n2); 
    }
    return n1;
}

This indicates that I have to do something inverse to dead code elimination, maybe called like "necessary code introduction". I searched for some sources about methods to introduce new variables in decompilation, but I did not find any. Do you have any ideas?

Ian Cook · Accepted Answer

The problem, as I think you realise, is that you are logically moving the calculation of an expression past the point whereby one of its input values gets changed.
Given your current implementation, I'd suggest the easiest approach to handle this is -

When an instruction writes to a variable, look to see if there are any expressions involving this variable on your expression stack.
If there are,

spill each such existing expression to a new temporary variable, and
replace the existing expressions on the stack with the respective new temporary variable.

Only then generate the write to the original variable

Applying this procedure to your code gives -
IL_0002: // stack = { n1 }

IL_0003: // stack = { n2, n1 }

IL_0004: // stack = { n1 % n2 }

IL_0005: // stack = { n2, n1 % n2 }

IL_0006: // this instruction will write to n1
         // inspection shows that we have an expression involving n1 on the stack so
         // (a) spill this expression to a new temporary variable and
         // (b) replace the expression on the stack with the new temporary variable

temp = n1 % n2;     // stack = { n2, temp }

// now the write to n1 can safely take place

n1 = n2;            // stack = { temp };

IL_0008: // this instruction will write to n2, 
         // inspection shows no expressions on the stack involving n2
         // the write to n2 can safely take place

n2 = temp;         // stack = {}

Removing the comments, the inner loop now looks like the original code.
temp = n1 % n2;
n1 = n2;
n2 = temp;

Decompilation of CIL code into some high level code - do I need to introduce new variables during data flow analysis?

One Answer

Add your own answers!

Ask a Question