Theseus, a Static Windows Emulator

(neugierig.org)

112 points | by zdw 4 days ago ago

19 comments

  • jcranmer 2 days ago ago

    So I've definitely played around with a lot of these ideas, because the thing that I'm trying to emulate/port/reverse engineer is a 16-bit Windows application, which coincidentally means that a lot of the existing infrastructure just keels over and dies in supporting it.

    I originally started the emulation route, but gave that up when I realized just how annoying it was to emulate the libc startup routine that invoked WinMain, and from poking around at the disassembly, I really didn't need to actually support anything before WinMain. And then poking further, I realized that if I patched the calls to libc, I only had to emulate a call to malloc at the, well, malloc level, as opposed to implementing the DOS interrupt that's equivalent to sbrk. (And a side advantage to working with Win16 code is that, since every inter-segment call requires a relocation, you can very easily find all the calls to all libc methods, even if the disassembler is still struggling with all of the jump tables.)

    After realizing that the syscalls to modify the LDT still exist and work on Linux (they don't on Windows), I switched from trying to write my own 386 emulator to actually just natively executing all the code in 16-bit mode in 64-bit mode and relying on the existing relocation methods to insert the thunking between 16-bit and 64-bit code [1]. The side advantage was that this also made it very easy to just call individual methods given their address rather than emulating from the beginning of main, which also lets me inspect what those methods are doing a lot better.

    Decompilation I've tried to throw in the mix a couple of times, but... 16-bit code really just screws with everything. Modern compilers don't have the ontology to really support it. Doing my own decompilation to LLVM IR runs into the issue that LLVM isn't good at optimizing all of the implemented-via-bitslicing logic, and I still have to write all of the patterns manually just to simplify the code for "x[i]" where x is a huge pointer. And because it's hard to reinsert recompiled decompiled code into the binary (as tools don't speak Win16 file formats anymore), it's difficult to verify that any decompilation is correct.

    [1] The thunking works by running 16-bit code in one thread and 64-bit code in another thread and using hand-rolled spinlocks to wait for responses to calls. It's a lot easier than trying to make sure that all the registers and the stack are in the appropriate state.

    • evmar 2 days ago ago

      [post author] I went down some similar paths in retrowin32, though 32-bit x86 is likely easier.

      I was also surprised by how much goop there is between startup and main. In retrowin32 I just implemented it all, though I wonder how much I could get away with not running it in the Theseus replace-some-parts model.

      I mostly relied on my own x86 emulator, but I also implemented the thunking between 64-bit and 32-bit mode just to see how it was. It definitely was some asm but once I wrapped my head around it it wasn't so bad, check out the 'trans64' and 'trans32' snippets in https://github.com/evmar/retrowin32/blob/ffd8665795ae6c6bdd7... for I believe all of it. One reframing that helped me (after a few false starts) was to put as much code as possible in my high-level language and just use asm to bridge to it.

      • jcranmer 2 days ago ago

        Yeah, 32-bit x86 is somewhat easier because everything's in the same flat address space, and you at least have a system-wide code32 gdt entry that means you can ignore futzing around with the ldt. 16-bit means you get to deal with segmented memory, and the cherry on top is that gdb just stops being useful since it doesn't know anything about segmented memory (I don't think Linux even makes it possible to query the LDT of another process, even with ptrace, to be fair).

        As for trying to ignore before main... well, the main benefit for me was being able to avoid emulating DOS interrupts entirely, between skipping the calls to set up various global variables, stubbing out some of the libc implementations, and manually marking in the emulator that code page X was 32-bit (something else that sends tools in a tizzy, a function switching from 16-bit to 32-bit mid-assembly code).

        16-bit is weird and kinda fun to work with at times... but there's also a reason that progress on this is incredibly slow for me.

        • evmar a day ago ago

          Slow progress is fine, it took me like two years to get where I got! (Not that I was working on it full time or anything, but also there were just many false starts and I had no idea what I was doing...)

    • trollbridge 2 days ago ago

      Interesting work. I’m also doing the same thing with a Win16 Visual Basic app (with some third party libraries).

      Eventually I just decided to run the app inside an accurate emulator with a full Windows 3.1 environment since nothing else was stable.

      • NetMageSCW a day ago ago

        Have you tried otvdm? (Back port of WineVDM to Windows x64.) I tried with a Win16 (VB appearing application) that interfaces to a device using an optical transceiver on a serial port and it worked fine.

    • charcircuit 2 days ago ago

      Have you tried just asking an LLM to reverse it to C and then work from there?

      • trollbridge 2 days ago ago

        I tried that, and ran into roadblocks since the app under test is old Visual Basic (which is half compiled and half interpreted) and then it used a third party library which has quite sophisticated anti-decompilation features.

  • NetMageSCW a day ago ago

    I find it interesting that the article doesn’t seem to understand that Rosetta 2 essentially does most of this (it is an ahead of time translator/compiler) with a few exceptions for runtime dynamic code. It creates an ARM binary for an x64 binary the first time macOS on Apple Silicon runs an x64 only application. It evens mentions Rosetta.

  • mmastrac 2 days ago ago

    I think "recompiler" is the accepted term in the art, is it not?

    • evmar 2 days ago ago

      Looking through Wikipedia at least, it's not exactly clear to me. They have separate pages for 'binary recompiler' and 'binary translation' that link to each other, and the latter is more about going between architectures (which is the main objective here).

  • da-x 2 days ago ago

    The issue with static recompilation are edge cases such as self modifying code, or code copying a variations of itself to be executed on the stack, jump tables of structure that cannot be inferred from analysis without solving for the halting problem and all sorts of other tricks that necessitate the use of a JIT eventually. I wrote a static recompiler for Win32 in 2012 for fun, and I've seen old 1990's programs do these things.

    • evmar a day ago ago

      Do you have any notes or other artifacts from your recompiler? I’d love to learn more.

  • nxobject 2 days ago ago

    A follow-up question I'd like to ask to the author: if I understand correctly, one of the ideas is that you can write "high-level", almost platform-independent translations like:

        regs.eax = 3;
        regs.eax = add(regs.eax, 4);
        windows_api();  // some native implementation of the API that was called
    
    and rely on optimizers to do as much "lowering" or specialization to a platform as possible. I'd like to ask: how reliable have optimizers been in gettting those results (e.g. how much fiddling with various -Oxxx flags)?
    • evmar 2 days ago ago

      It depends on what results you’re expecting! Relative to an interpreter, even the simplest unoptimized translation of that code is already significantly more efficient code at runtime.

      In the post there is a godbolt link showing a compiler inlining a simple add, but a real implementation of x86 add would be much more complex.

      I have read other projects where the authors put some effort into getting exactly the machine code they wanted out. For example, maybe you want the virtual regs.eax to actually exist in a machine register, and one way you might be able to convince a compiler to do that is by passing it around as a function parameter everywhere instead as a struct element. I have not investigated this myself.

  • curtis0101001 a day ago ago

    you're essentially trading runtime flexibility for ahead-of-time correctness guarantees. The debugger benefit you mention is underrated. Being able to use native tooling on translated code instead of building a custom debugger is a huge quality of life improvement that doesn't get talked about enough in emulator discussions. have you considered using a conservative superset approach where you translate everything that looks like it could be code, even if some of it never executes? The overhead of dead translated code seems worth it compared to missing a live path.

    • evmar a day ago ago

      Yes, I agree that there is little harm in gathering too much code. I have tried out just scanning data memory for values that refer to addresses within the region marked as code and disassembling from those points, as well as scanning the instructions I traverse for any immediate values in the same range.

  • mysterydip 2 days ago ago

    That’s a really clever idea/solution I hadn’t thought of before, and yet it makes “of course, duh” sense as well.

  • pema99 2 days ago ago

    See also the authors previous project retrowin32 https://github.com/evmar/retrowin32