Devirtualization

Question

I've recently become pretty fascinated with virtualization and retrieving original code from a randomly generated byte code, such as protectors like VMProtect/etc. But I can not get a grasp on how it would actually be done.
Although I have read a few writings to help understand virtualization better, even articles specifically written to target the protector I'm trying to devirtualize, I can not relate their articles to my own sample. There are discrepancies between the two which confuse me.
I made a small program that takes in a string input and outputs it, then added Virtualization to the function. First thing I notice is that the function now calls another function, and looks like it adds random junk instructions after the call. I also notice that it pushes 4 bytes onto the stack.
.vmp0:00440C0F 68 28 02 8E 51                          push    518E0228h
.vmp0:00440C14 E8 68 BB FC FF                          call    sub_40C781

Inside sub_40C781 I see it is pushing all registers on the stack and also the pushf instruction which pushes flags. Inbetween these pushes and main instructions, I can't help but notice pointless instructions inbetween, which is a little odd since I only specified Virtualization in protection.
Anyway, the main valid instructions I managed to scavenge from the function called go as
push    ebp
push    ebx
push    ecx
push    esi
push    edi
pushf
push    edx
push    eax
mov     eax, 0
push    eax
mov     esi, [esp+24h+arg_0]
lea     esi, [esi+eax]
lea     esp, [esp-0C0h]

But from there, I'm not sure how to proceed. The rest of instructions past the last lea go as
.vmp0:0040E187                         loc_40E187:                             ; CODE XREF: .vmp0:00416516↓j
.vmp0:0040E187                                                                 ; .vmp0:loc_42E5EB↓j ...
.vmp0:0040E187 8B DE                                   mov     ebx, esi
.vmp0:0040E189 B8 00 00 00 00                          mov     eax, 0
.vmp0:0040E18E 0F BA F7 E7                             btr     edi, 0E7h ; 'ç'
.vmp0:0040E192 66 23 F8                                and     di, ax
.vmp0:0040E195 2B D8                                   sub     ebx, eax
.vmp0:0040E197
.vmp0:0040E197                         loc_40E197:                             ; DATA XREF: sub_40C781:loc_40E197↓o
.vmp0:0040E197 8D 3D 97 E1 40 00                       lea     edi, loc_40E197
.vmp0:0040E19D 66 0F A4 D0 09                          shld    ax, dx, 9
.vmp0:0040E1A2 C1 C8 49                                ror     eax, 49h
.vmp0:0040E1A5 81 EE 04 00 00 00                       sub     esi, 4
.vmp0:0040E1AB 0F A3 F8                                bt      eax, edi
.vmp0:0040E1AE 33 C6                                   xor     eax, esi
.vmp0:0040E1B0 C6 C4 27                                mov     ah, 27h ; '''
.vmp0:0040E1B3 8B 06                                   mov     eax, [esi]
.vmp0:0040E1B5 F9                                      stc
.vmp0:0040E1B6 F8                                      clc
.vmp0:0040E1B7 E9 79 12 05 00                          jmp     loc_45F435

But I can't see how this would be interpreting code and translating it to its x86 representation. From my understanding, esi is probably the VM's instruction ptr. loc_40E197 seems to be the instruction 'dispatcher' for lack of better words. But I can not get a grasp of the inner workings, esi seems to be decremented by 4 each loop, which I thought would be 1 instead.
Any insight on to how to proceed would be greatly appreciated.

Devirtualization

Add your own answers!

Ask a Question