Various Oddities Graphics, Game Dev, Emulators, and other geeky stuff

21Aug/114

Building an Xbox 360 Emulator, part 6: Code Translation Techniques

One of the most important pieces of an emulator is the simulation of the guest processor - it drives every other subsystem and is often the performance bottleneck. I'm going to spend a few posts looking into how to build a (hopefully) fast and portable PowerPC simulator.

Overview

At the highest level the emulator needs to be able to take the original PowerPC instructions and run them on the host processor. These instructions come in the form of a giant assembled binary instruction stream loaded into memory - there is no concept of 'functions' or 'types' and no flow control information. Somehow, they must be quickly translated into a form the host processor can understand, either one at a time or in batches. There are many techniques for doing this, some easy (and slow) and many hard (and fastish). I've learned through experience that  that the first choice one makes is almost always the wrong one and a lot of iteration is required to get the right balance.

The techniques for doing this translation parallel the techniques used for systems programming. There's usually a component that looks like an optimizing compiler, a linker, some sort of module format (even if just in-memory), and an ABI (application binary interface - a.k.a. calling convention). The tricky part is not figuring out how these components fit together but instead deciding how much effort to put into each one to get the right performance/compatibility trade off.

I recommend checking out Virtual Machines by James Smith and Ravi Nair for a much more in-depth overview of the field of machine emulation, but I'll briefly discuss the major types used in console emulators below.

Interpreters

Implementation: Easy
Speed: Slow
Examples: BASIC, most older emulators

Interpreters are the simplest of the bunch. They are basically exactly what you would build if you coded up the algorithm describing how a processor works. Typically, they look like this:

  • While running:
    • Get the next instruction from the stream
    • Decode the instruction
    • Execute the instruction (maybe updating the address of the next instruction to execute)

There are bits that make this a bit more complex than this, but generally they are implementation details about the guest CPU (how does it handle jumping around, etc). Ignoring the code that actually executes each instruction, a typical interpreter can be just a few dozen lines of easy-to-read, easy-to-maintain code. That's why when performance doesn't matter you see people go with interpreters - no one wants to waste time building insanely complex systems when a simple one will work (or at least, they shouldn't!).

Even when an interpreter isn't fast enough to run the guest code at sufficient speeds there are still many advantages to use one. Mainly, due to the simplicity of the implementation, it's often possible to ensure correctness without much trouble. A lot of emulation projects start of with interpreters just to ensure all the pieces are working and then later go back and add a more complex CPU core when things are all verified correct. Another great advantage is that it's much easier to build debuggers and other analysis tools, as things like stepping and register inspection can be one or two lines in the main execution loop.

Unfortunately for this project, performance does matter and an interpreter will not suffice. I would like to build one just to make it easier to verify the correctness of the instruction set, however even with as (relatively) simple it is there is still a large time investment that may never pay off.

Pros

  • Easy to implement
  • Easy to build debuggers/step-through
  • Can get something running fairly fast
  • Snapshotting/save states trivial to implement

Cons

  • Several orders of magnitude slower than the host machine (think 100-1000x)

JITs

Implementation: Sort-of easy
Speed: Sort-of fast
Examples: Modern Javascript, .NET/JVM, most emulators

This technique is the most common one used in emulators today. It's much more complex than an interpreter but still relatively easy to get implemented. At a high level the flow is the same as in an interpreter, but instead of operating on single instructions the algorithm handles basic blocks, or a short sequence of instructions excluding flow control (jumps and branches).

  • While running:
    • Get the next address to process
    • Lookup the address in the code cache
    • If cached basic block present:
      • Execute cached block
    • Otherwise:
      • Translate the basic block
      • Store in code cache
      • Execute block

The emulator spins in this loop and in the steady state is doing very little other than lookups in the cache and calls of the cached code. Each basic block runs until the end and then returns back to this control method to let it decide where to go next.

There is a bit of optimization that can be done here, but because the unit of work is a basic block a lot don't make sense or can't be used to full advantage. The best optimizations often require much more context than a couple instructions in isolation can give and the most expensive code is usually the stuff in-between the basic blocks, not inside of it (jumps/branches/etc).

Compared to the next technique this simple JIT does have a few advantages. Namely, if the guest system is allowed to dynamically modify code (think of a guest running a guest) this technique will get the best performance with as little penalty for regeneration as possible. On the Xbox 360, however, it is impossible to dynamically generate code so that's a whole big area that can be ignored.

In terms of the price/performance ratio, this is the way to go. The analysis is about as simple as the interpreter, the translation is straightforward, and the code is still pretty easy to debug. There's a performance penalty (that can often be eliminated with tricks) for the cache lookups and dynamic translation, but it's still much faster than interpreting.

That said, it's still not fast enough. To emulate 3 3GHz processors with an advanced instruction set on x86-64, every single cycle needs to count. It's also boring - I've built a few before, and building another would just be going through the motions.

Pros

  • Pretty fast (10-100x slower than native)
  • Pretty simple
  • Allows for recompilation if code is changed on-the-fly

Cons

  • Few optimizations possible
  • Debugging is harder

Recompilers / 'Advanced' JITs

Implementation: Hard
Speed: Fastest possible
Examples: V8, .NET ngen/AOT

Recompilers (often referred to as 'binary translation' in literature) are the holy grail of emulation. The advantages one gets from being able to do full program analysis, ahead-of-time compilation, and trivially debuggable code cannot be beat... except by how hard they usually are to write.

Where as an interpreter works on individual instructions and a simple JIT works on basic blocks, a recompiler works on methods or even entire programs. This allows for complex optimizations to be used that, knowing the target host architecture, can yield even faster code than the original instruction stream.

I split this section between 'advanced' JITs and recompilers, but they are really two different things implemented in largely the same way. An advanced JIT (as I call it) is a simple JIT extended to work on methods and using some simple intermediate representation (IR) that allows for optimizations, but at the end of the day is still dynamically compiling code at runtime on demand. A full recompiler is usually a system that does a lot of the same analysis and uses a similar intermediate representation, but does it all at once and before attempting to execute the code. In some ways a JIT is easier to think about (still operating on small units of code, still doing it inside the program flow), but in many others the recompiler simplifies things. For example, being able to verify an entire program is correct and optimize as much as possible before the guest executes allows for much better tooling to be created. Depending on what library/tools are being used you can also get reflection, debugging, and things usually reserved for original source-level programming like profile-guided optimization.

The hard part of this technique is actually analyzing the instruction stream to try to figure out how the code is organized. Tools like IDA Pro will do this to a point, but are not meant to be used to generate executable code. How this process is normally done is largely undocumented - either because it's considered a trade secret or no one cares - but I puzzled out some of the tricks and used them to build my PSP emulator. There I implemented an advanced JIT, generating code on the fly, but that's only because of the tools that I had at the time and not really knowing what I wanted.

Recompilers are very close to full-on decompilers. Decompilers are designed to take machine instructions up through various representations and (hopefully) yield human-readable high level source code. They are constructed like a compiler but in reverse: compilers usually contain a frontend (input source code -> intermediate representation (IR)), optimizers/analyzers, and a backend (IR -> machine code (MC)); decompilers have a frontend (MC -> IR), optimizers/analyzers, and a backend (IR -> source, like C). The decompiler frontend is fairly trivial, the analysis much more complex, and the backend potentially unsolvable. What makes recompilers interesting is that at no point do they aim to high human-readable output - instead, a recompiler has a frontend like a decompiler (MC -> IR), analyzers like a decompiler, optimizers like a compiler, and a backend like a compiler (IR -> MC).

Pros

  • As close to native as possible (1-10x slower, depending on architectures)
  • No translation overhead while running (unless desired)
  • Debuggable using real tools
  • Still pretty novel (read: fun!)

Cons

  • Incredibly hard to write
  • Can't handle dynamically modifiable code (well)

Building a Recompiler

There are many steps down the path of building a recompiler. Some people have tried building general purpose binary translation toolkits, but that's a lot of work and requires a lot of good design and abstraction. For this project I just want to get something working and I have learned that after I'm done I will never want to use the code again - by the time I attempt a project like this again, I will have learned enough to consider it all garbage ;) I'll be focusing on a Power PC frontend and reusing the PPC instruction set (+ tags) as my intermediate representation (IR) - this simplifies a lot of things and allows me to bake assumptions about the source platform into the entire stack without a lot of nasty hacks. One design concession I will be making is letting the backend (IR -> MC) be pluggable. From experience, the frontend rarely changes between different implementations while the backend varies highly - source PPC instructions are source PPC instructions regardless of whether you're implementing an interpreter, a JIT, or a recompiler. For now I'm planning on using LLVM for the backend (as it also gives me some nice optimizers), but may re-evaluate this later on and would like not to have to reimplement the frontend.

Frontend (MC -> IR)

Assuming that the source module has already been loaded (see previous posts on loading XEX files), the frontend stage includes a few major components:

From this stage a skeleton of the program can be generated that should fairly closely model the original structure of the code. This includes a (mostly correct) list of methods, tagged references to global variables, and enough information to identify the control flow in any given method.

When working on my PSP emulator I didn't factor out the frontend correctly and ended up having to reimplement it several times. For this project I'll be constructing this piece myself and ensuring that it is reusable, which will hopefully save a lot of time when experimenting with different backends.

Disassembler

A simple run-of-the-mill disassembler that is able to take a byte stream and produce an instruction stream. There are a few table generation toolkits out there that can make this process much faster at runtime but initially I'll be sticking with the tried and true chained table lookup. A useful feature to add to early disassemblers is pretty printing of the instructions, which enables much better output from later parts of system and makes things significantly easier to debug.

Basic Block Slicing / Control Flow Graph Construction / Method Recognition

Basic blocks in this context refer to a sequence of instructions that starts at an instruction that is jumped to by another basic block and ends at the first flow control instruction. This gets the instruction stream into a form that enables the next step.

Once basic blocks are identified they can be linked together to form a Control Flow Graph (CFG). Using the CFG it is possible to identify unique entry and exit points of portions of the graph and call those 'methods'. Sometimes they match 1:1 with the original input code, but other times due to optimizing compilers (inlining, etc) may not - it doesn't matter to a recompiler (but does to a decompiler). Usually the process of CFG generation is combined with the basic block slicing step and executed in multiple passes until all edges have been identified.

There are some tricky details here that make this stage not 100% reliable, namely function pointers. Simple C function pointer passing (callbacks/etc) as well as things like C++ vtables can prevent a proper whole-program CFG from being constructed. In these cases, where the target code may appear to have never been called (as there were no jumps/branches into it) it is important to have a method recognition heuristic to identify them and bring them into the recompiled output. The first stage of this is scanning for holes in the code address space: if after all processing has been done there are still regions that are not assigned to methods they are now suspicious - it's possible that the contents of the holes are actually data or padding and it's important to have a set of rules to follow to identify that. Popular heuristics are looking for function prologues and ensuring a region has all valid instructions. Once found the recompiler isn't done, though, as even if the method gets to the output there is still no way to connect it up to the original callers accessing it by pointer. One way to solve this is to make all call-by-pointer sites instead look up the function in a table of all functions in the module. It can be significantly slower than the native call, but caches can help.

Analysis

The results of the frontend are already fairly usable, however to generate better output the recompiler needs to do a bit of analysis on the source instructions. What comes out of the frontend is a literal interpretation of the source and as such is missing a lot of the extra information that the backend can potentially use to optimize the output. There are hundreds of different things that can be done at this stage as required, but for recompilers there are a few important ones:

Data Flow Analysis

Since PPC is a RISC architecture there is often a large number of instructions that work on intermediate registers just for the sake of accomplishing one logical instruction. For example, look at this bit of disassembly (what would come out of the frontend):

.text:8210E77C                 lwz     %r11, 0x54(%r31)
.text:8210E780                 lwz     %r9, 0x90+var_40(%sp)
.text:8210E784                 lwz     %r10, 0x50(%r31)
.text:8210E788                 mullw   %r11, %r11, %r9
.text:8210E78C                 add     %r11, %r11, %r10
.text:8210E790                 lwz     %r10, 0x90+var_3C(%sp)
.text:8210E794                 add     %r11, %r11, %r10
.text:8210E798                 stw     %r11, 0(%r29)

A simple decompilation of this is:

var x = (r31)[0x54];
var y = (sp)[0x90+var_40];
var z = (r31)[0x50];
x *= y;
x += z;
z = (sp)[0x90+var_3C];
x += z;
(r29)[0] = x;

If you used this as output in your recompiler, however, you would end up with many more instructions being executed than required. A simple data flow analysis would enable result propagation (x = x * y + z, etc) and SSA (knowing that the second lwz into %r10 is a different use of z). Performing this step would allow an output that is much more simple and easier for down-stream optimization passes to deal with:

(r29)[0] = ((r31)[0x54] * (sp)[0x90+var_40]) + (r31)[0x50] + (sp)[0x90+var_3C];

Control Flow Analysis

With a constructed Control Flow Graph it's possible to start trying to identify what the original flow control constructs were. It's not required to do this in a recompiler, as the output of the compiler will still be correct if passed through directly to the backend, however the more information that can be provided to the backend the better. Compilers will often change for- and while-loops into post-tested loops (do-while) or specific processor forms, such as the case of PPC which has a special branch counter instruction. By inspecting the structure of the graph it's possible to figure out what are loops vs. conditional branches, and just where the loops are and what basic blocks are included inside the body of the loop. By encoding this information in the IR the backend can do better local variable allocation by knowing what variables are accessible from which pieces of code, better code layout by knowing which side of the branch/loop is likely to be executed the most, etc.

CFA is also required for correct DFA - you could imagine scenarios where registers or locals are set before a jump and changed in the flow. You would not be able to perform the data propagation or collapsing without knowing for certain what the potential values of the register/local could be at any point in time.

Backend (IR -> MC)

I'll be doing an entire post (or more) about what I'm planning on doing here for this project, but for completeness here's an overview:

Backends can vary in both type and complexity. The simplest backend is an interpreter, executing the instruction stream produced by the frontend one instruction at a time (and ignoring most of the metadata attached). JITs can use the information to produce either simple basic block-based code chunks or entire method chunks. Or, as I'm aiming for, a compiler/linker can be used to do a whole bunch more.

Right now I'm targeting LLVM IR (plus some custom code) for the backend. This enables me to run the entire LLVM optimization pass over the IR to produce some of the best possible output, use the LLVM interpreter or JIT to get runtime translation, or the offline LLVM tools to generate executables that can be run on the host machine. The complex part of this process is getting the IR used in the rest of this process into LLVM IR, which is at a much higher level than machine instructions. Luckily others have already used LLVM for purposes like this, so at least it's possible!

 

21Aug/110

Emulation Books

I've been asked a few times if there are any good resources for learning about writing emulators, so I pulled some books off my shelf that I consider solid buys. Non-book resources are almost always code from real emulators, mainly recent projects for older systems (like JSNES).

Computer Architecture: A Quantitative Approach, John Hennessy and David Patterson

The classic computer architecture textbook. Having this on your bookshelf instantly makes you 10x cooler.

The Elements of Computing Systems, Noam Nisan and Shimon Schocken

A quick tour of an entire system, from gate logic through running user code. A bit brief, but it's well written and has a very nice flow.

Virtual Machines, James Smith and Ravi Nair

One of the few books specifically about virtual machines, this book touches on just about every type from the JVM to system level VM's like Xen. The balance of breadth and depth here is perfect - it's essentially a survey of the field with enough detail about common techniques and gotchas to really help in the research process.

Reversing: Secrets of Reverse Engineering, Eldad Eilam

The book on reverse engineering. Everything one needs to learn to reverse modern, complex applications, operating systems, and file formats. Can't recommend this book enough!

The IDA Pro Book, Chris Eagle

If you sink the money into a copy of IDA Pro, an extra bit of change for this book is well worth it to help get the most out of the tool. Covers some similar topics to Reversing, but with much more of a practical slant.

Hacker's Delight, Henry S. Warren, Jr.

The bible of perverse math hacks. When it comes to implementing some of the more bizarre instructions in a system (bit counting/swapping, numerical conversion, rounding, etc) a lot of the tricks covered in this book can help get the extra speed required.

 

On a related note, here's a book not to buy: Xbox 360 Forensics by Steven Bolt. It's horrible. Really bad. Avoid.

13Aug/116

Building an Xbox 360 Emulator, part 5: XEX Files

I thought it would be useful to walk through a XEX file and show what's all in there so that when I continue on and start talking about details of the loader, operating system, and CPU things make a bit more sense. There's also the selfish reason: it's been 6 months since I last looked at this stuff and I needed to remind myself ^_^

What's a XEX?

XEX files, or more accurately XEX2 files, are signed/encrypted/compressed content wrappers used by the 360. 'XEX' = Xbox EXecutable, and pretty close to 'EXE' - cute, huh? They can contain resources (everything from art assets to configuration files) and executable files (Windows PE format .exe files), and every game has at least one: default.xex.

When the Xbox goes to launch a title, it first looks for the default.xex file in the game root (either on the disc or hard drive) and reads out the metadata. Using a combination of a shared console key and a game-specific key, the executable contained within is decrypted and verified against an embedded checksum (and sometimes decompressed). The loader uses some extra bits of information in the XEX, such as import tables and section maps, to properly lay out the executable in memory and then jumps into the code.

Of interest to us here is:

  • What kind of metadata is in the XEX?
  • How do you get the executable out?
  • How do you load the executable into memory?
  • How are imports resolved?

The first part of this document will talk about these issues and then I'll follow on with a quick IDA walkthrough for reversing a few functions and making sure the world is sane.

Getting a XEX File

Disc Image

There are tons of ways to get a working disc image, and I'm not going to cover them here. The information is always changing, very configuration dependent, and unfortunately due to stupid US laws in a grey (or darker) area. Google will yield plenty of results, but in the end you'll need either a modded 360 or a very specific replacement drive and a decent external SATA adapter (cheap ones didn't seem to work for me).

The rest of this post assumes you have grabbed a valid and legally-owned disc image.

Extracting the XEX

Quite a few tools will let you browse around the disc filesystem and grab files, but the most reliable and easy to use that I've found is wx360. There are some command line ones out there, FUSE versions for OSX/etc, and some 360-specific disc imaging tools can optionally dump files. If you want some code that does this in a nice way, wait until I open up my github repo and look XEGDFX.c.

You're looking for 'default.xex' in the root of the disc. For all games this is where the game metadata and executable code lies. A small number of games I've looked at have additional XEX files, but they are often little helper libraries or just resources with no actual code.

Once you've got the XEX file ready it's time to do some simple dumping of the contents.

Peeking into a XEX

The first tool you'll want to use is xorloser's XexTool. Grab it and throw both it and the included x360_imports.idc file into a directory with your default.xex.

02/25/2011  10:10 AM         3,244,032 default.xex
12/04/2010  12:25 AM           173,177 x360_imports.idc
12/07/2010  11:29 PM           185,856 xextool.exe

 
XexTool has a bunch of fun options. Start off by dumping the information contained in default.xex with the -l flag:

D:\XexTutorial>xextool.exe -l default.xex
XexTool v6.1  -  xorloser 2006-2010
Reading and parsing input xex file...

Xex Info
  Retail
  Compressed
  Encrypted
  Title Module
  XGD2 Only
  Uses Game Voice Channel
  Pal50 Incompatible
  Xbox360 Logo Data Present

Basefile Info
  Original PE Name:   [some awesome game].pe
  Load Address:       82000000
  Entry Point:        826B8B48
  Image Size:           B90000
  Page Size:             10000
  Checksum:           00000000
  Filetime:           4539666C - Fri Oct 20 17:14:36 2006
  Stack Size:            40000
.... a lot more ....

 
You'll find some interesting bits, such as the encryption keys (required for decrypting the contained executable), basic import information, contained resources, and all of the executable sections. Note that the XEX container has the information about the sections, not the executable. We will see later that most of the PE header in the executable is bogus (likely valid pre-packaging, but certainly not after).

Extracting the PE (.exe)

Next up we will want to crack out the PE executable contained within the XEX by using the '-b' flag. Since we will need it later anyway, also add on the '-i' flag to output the IDC file used by IDA.

D:\XexTutorial>xextool -b default.exe -i default.idc default.xex
XexTool v6.1  -  xorloser 2006-2010
Reading and parsing input xex file...
Successfully dumped basefile idc to default.idc
Successfully dumped basefile to default.exe

Load basefile into IDA with the following details
DO NOT load as a PE or EXE file as the format is not valid
File Type:       Binary file
Processor Type:  PowerPC: ppc
Load Address:    0x82000000
Entry Point:     0x826B8B48

 
Take a second to look at the output and note the size difference in the input .xex and output .exe:

02/25/2011  10:10 AM         3,244,032 default.xex
08/12/2011  11:04 PM        12,124,160 default.exe

 
In this case the XEX file was both encrypted and compressed. When XexTool spits out the .exe it not only decompresses it, but also pads in some of the data that was stripped when it was originally shoved into the .xex. Thinking about rotational speeds of DVDs and the data transfer rate, a 4x compression ratio is pretty impressive (and it makes me wonder why all PE's aren't packed like this...).

PE Info

You can try to peek at the executable but the text section is PowerPC and most Microsoft tools shipped to the public don't support even looking at the headers in a PPC PE. Luckily there are some 3rd party tools that do a pretty good job of dumping most of the info. Using Matt Pietrek's pedump you can get the headers:

D:\XexTutorial>pedump /B /I /L /P /R /S default.exe > default.txt

 
Check out the results and see the PE headers. Note that most of them are incorrect and (I imagine) completely ignored by the real 360 loader. For example, the data directories and section table have incorrect addresses, although their sizes are fairly accurate. During XEX packing a lot of reorganizing and alignment is performed, and some sections are removed completely.

The two most interesting bits in the resulting file from a reversing perspective are the imports table and the runtime function table. The imports table data referenced here is pretty bogus, and instead the addresses from the XEX should be used. The Runtime Function Table, however, has valid addresses and provides a useful resource: addresses and lengths of most methods in the executable. I'll describe both of these structures in more detail later.

Loading a XEX

[I'll patch this up once I open the github repo, but for future reference XEXEX2LoadFromFile, XEXEX2ReadImage, and XEPEModuleLoadFromMemory are useful] Now that's possible to get a XEX and the executable it contains, let's talk about how a XEX is loaded by the 360 kernel.

Sections

Take a peek at the '-l' output from XexTool and down near the bottom you'll see 'Sections'. What follows is a list of all blocks in the XEX, usually 64KB chunks (0x10000), and their type:

Sections
    0) 82000000 - 82010000 : Header/Resource
    1) 82010000 - 82020000 : Header/Resource
-- snip --
   12) 820C0000 - 820D0000 : Header/Resource
   13) 820D0000 - 820E0000 : Header/Resource
   14) 820E0000 - 820F0000 : Code
   15) 820F0000 - 82100000 : Code
-- snip --
  108) 826C0000 - 826D0000 : Code
  109) 826D0000 - 826E0000 : Code
  110) 826E0000 - 826F0000 : Data
  111) 826F0000 - 82700000 : Data
.... and many more ....

 
Usually they seem to be laid out as read-only data, code, and read-write data, and always grouped together. All of the fancy sections in the PE are distilled down to these three types, likely due to the fact that the 360 has these three memory protection modes. Everything is in 64KB chunks because that is the allocation granularity for the memory pages. Those two things together explain why the headers in the PE don't match up to reality - the tool that takes .exe -> .xex and does all of this stuff never fixes up the data after it butchers everything. When it comes to actually loading XEX's, all of these transformations actually make things easier. Ignoring all of the decryption/decompression/checksum craziness, loading a XEX is usually as simple as a mapped file read/memcpy. Much, much faster than a normal PE file, and a very smart move on Microsoft's part.

Imports

Both the XEX and PE declare that they have imports, but the XEX ones are correct. Stored in the XEX is a multi-level table of import library (such as 'xboxkrnl.exe') where each library is versioned and then references a set of import entries. You can see the import libraries in the XexTool output:

Import Libraries
    0) xboxkrnl.exe   v2.0.3529.0  (min v2.0.2858.0)
    1) xam.xex        v2.0.3529.0  (min v2.0.2858.0)

 
It's way out of scope here to talk about howthe libraries are embedded, but you can check out XEXEX2.c in the XEX_HEADER_IMPORT_LIBRARIES bit and later on in the XEXEX2GetImportInfos method for more information. In essence, there is a list of ordinals exported by the import libraries and the address in the loaded executable at where the resolved import should be placed. For example, there may be an import for ordinal 0x26A from 'xboxkrnl.exe', which happens to correspond to 'VdRetrainEDRAMWorker'. The loader will place the real address of 'VdRetrainEDRAMWorker' in the location specified in the import table when the executable is loaded.

The best way to see the imports of an executable (without writing code) is to look at the default.idc file dumped previously by XexTool with the '-i' option. For each import library there will be a function containing several SetupImport* calls that give the location of the import table entry in the executable, the location to place the value in, the ordinal, and the library name. Cross reference x360_imports.idc to find the ordinal name and you can start to decode things:

SetupImportFunc(0x8200085C, 0x826BF554, 0x074, "xboxkrnl.exe"); --> KeInitializeSemaphore
SetupImportFunc(0x82000860, 0x826BF564, 0x110, "xboxkrnl.exe"); --> ObReferenceObjectByHandle
SetupImportFunc(0x82000864, 0x826BF574, 0x1F7, "xboxkrnl.exe"); --> XAudioGetVoiceCategoryVolumeChangeMask

 
An emulator would perform this process (a bit more efficiently than you or I) and resolve all imports to the corresponding functions in the emulated kernel.

Static Libraries

An interesting bit of information included in the XEX is the list of static libraries compiled into the executable and their version information:

Static Libraries
    0) D3DX9          v2.0.3529.0
    1) XGRAPHC        v2.0.3529.0
    2) XONLINE        v2.0.3529.0
.... plus a few more ....

 
In the future it may be possible to build a library of optimized routines commonly found in these libraries by way of fingerprint and version information. For example, being able to identify specific versions of memcpy or other expensive routines could allow for big speed-ups.

IDA Pro

That's about it for what can be done by hand - now let's take a peek at the code!

Getting Setup

I'm using IDA Pro 6 with xorloser's PPCAltivec plugin (required) and xorloser's Xbox360 Xex Loader plugin (optional, but makes things much easier). If you don't want to use the Xex Loader plugin you can load the .exe and then run the .idc file to get pretty much the same thing, but I've had things go much smoother when using the plugin.

Go and grab a copy of IDA Pro 6, if you don't have it. No really, go get it. It's only a foreign money wire of a thousand dollars or two. Worth it. It takes a few days, so I'll wait...

Install the plugins and fire it up. You should be good to go! Close the wizard window and then drag/drop the default.xex file into the app. It'll automatically detect it as a XEX, don't touch anything, and let it go. Although you'll start seeing things right away I've found it's best to let IDA crunch on things before moving around. Wait until 'AU: idle' appears in the bottom left status tray.

XEX Import Dialog

IDA Pro Initial View

Navigating

XEX files traditionally have the following major elements in this order (likely because they all come from the same linker):

  1. Import tables - a bunch of longs that will be filled with the addresses of imports on load
  2. Generic read-only data (string tables, constants/etc)
  3. Code (.text)
  4. Import function thunks (at the end of .text)
  5. Random security code

IDA and xorloser's XEX importer are nice enough to organize most things for you. Most functions are found correctly (although a few aren't or are split up wrong), you'll find the __savegpr/__restgpr blocks named for you, and all import thunks will be resolved.

The Anatomy of an Xbox Executable

I'm not going to do a full walkthrough of IDA (you either know already or can find it pretty easily), but I will show an example that reveals a bit about what the executables look like and how they function. It's fun to see how much you can glean about the tools and processes used to build the executable from the output!

Reversing Sleep

Starting simple and in a well-known space is generally a good idea. Scanning through the import thunks in my executable I noticed 'KeDelayExecutionThread'. That's an interesting function because it is fairly simple - the perfect place to get a grounding. Almost every game should call this, so look for it in yours (your addresses will differ).

.text:826BF9B4 KeDelayExecutionThread:                 # CODE XREF: sub_826B9AF8+5Cp
.text:826BF9B4                 li      %r3, 0x5A
.text:826BF9B8                 li      %r4, 0x5A
.text:826BF9BC                 mtspr   CTR, %r11
.text:826BF9C0                 bctr

 
All of the import thunks have this form - I'm assuming the loader overwrites their contents with the actual jump calls and none of the code here is used. Time to check out the documentation. MSDN shows KeDelayExecutionThread as taking 3 arguments and returning one:

NTSTATUS KeDelayExecutionThread(
  __in  KPROCESSOR_MODE WaitMode,
  __in  BOOLEAN Alertable,
  __in  PLARGE_INTEGER Interval
);

 
For a function as core as this it's highly likely that the signature has not changed between the documented NT kernel and the 360 kernel. This is not always the case (especially with some of the more complex calls), but is a good place to start. Right click and select 'Set function type' (or hit Y) and enter the signature:

int __cdecl KeDelayExecutionThread(int waitMode, int alertable, __int64* interval)

 
Because this is a kernel (Ke*) function it's unlikely that it is being called by user code directly. Instead, one of the many static libraries linked in is wrapping it up (my guess is 'xboxkrnl'). IDA shows that there is one referencing routine - almost definitely the wrapper. Double click it to jump to the call site.

Now we are looking at a much larger function wrapping the call to KeDelayExecutionThread:

IDA Pro View of KeDelayExecutionThread wrapper

Tracing back the parameters to KeDelayExecutionThread, waitMode is always constant but both interval and alertable come from the parameters to the function. Note that argument 0 ends up as 'interval' in KeDelayExecutionThread, has special handling for -1, and has a massively large multiplier on it (bringing the interval from ms to 100-ns).  Down near the end you can see %r3 being set, which indicates that the function returns a value. From this, we can guess at the signature of the function and throw it into the 'Set function type' dialog:

int __cdecl DelayWrapper(int intervalMs, int alertable)

 
We can do one better than just the signature, though. This has to be some Microsoft user-mode API. Thinking about what KeDelayExecutionThread does and the arguments of this wrapper, the first thought is 'Sleep'. Win32 Sleep only takes one argument, but SleepEx takes two and matches our signature exactly!

Check out the documentation to confirm: MSDN SleepEx. Pretty close to what we got, right?

DWORD WINAPI SleepEx(
  __in  DWORD dwMilliseconds,
  __in  BOOL bAlertable
);

 
Rename the function (hit N) and feel satisfied that you now have one of hundreds of thunks completed!

Reversed SleepEx


 

Now do all of the rest, and depth-first reverse large portions of the game code. You now know the hell of an emulator author. Hackers have it easy - they generally target very specific parts of an application to reverse... when writing an emulator, however, breadth often matters just as much as depth x_x

11Aug/111

Building an Xbox 360 Emulator, part 4: Tools and Information

Before going much further I figured it'd be useful to document the setup I've been using to do my reversing. It should make it easier to follow along, and if I forget myself I've got a good reference. Note that I'm going off of publicly searchable information here - everything I've been doing is sourced from Google (and surprisingly sometimes Bing, which ironically indexes certain 360 hacking related information better than Google does!).

I'll try to update this list if I find anything interesting.

Primary Sources

Various Internet Searches

There's a wealth of information out there on the interwebs, although it's sometimes hard to find. Google around for these and you'll find interesting things...

Most of the APIs I've been researching turn up hits on pastebin, providing snippets of sample code from hackers and what I assume to be legit developers. None of it is tagged or labeled, but the API names are unique enough to find exactly the right information. Most of the method signatures and usage information I've gathered so far have come from sources like this.

PowerPC references:

  • Altivec_PEM.pdf / AltiVec Technology Programming Environments Manual
  • MPCFPE_AD_R1.pdf / PowerPC Microprocessor Family: The Programming Environments
  • pem_64bit_v3.0.2005jul15.pdf / PowerPC Microprocessor Family: The Programming Environments Manual for 64-bit Microprocessors
  • PowerISA_V2.06B_V2_PUBLIC.pdf / Power ISA Version 2.06 Revision B
  • vector_simd_pem_v_2.07c_26Oct2006_cell.pdf / PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual
  • vmx128.txt
GPU references:
  • R700-Family_Instruction_Set_Architecture.pdf / ATI R700-Family Instruction Set Architecture
  • ParticleSystemSimulationAndRenderingOnTheXbox360GPU.pdf / interesting bits about memexport
Xbox 360 specific:
  • xbox360-file-reference.pdf / Xbox 360 File Specifications Reference

Free60

The Free60 project is probably the best source of information I've found. The awesome hackers there have reversed quite a bit of the system for the commendable purpose of running Linux/homebrew, and have got a robust enough toolchain to compile simple applications.

I haven't taken the time to mod and setup their stack on one of my 360's, but for the purpose of reversing and building an emulator it's invaluable to have documentation-through-code and the ability to generate my own executables that are far simpler than any real game. If not for the hard work of the ps2dev community I never would have been able to do my PSP emulator.

Microsoft

There's quite a bit of information out there if you know where to look and what to look for. Although not all specific to the 360, a lot of the details carry over.

mmlite

Microsoft Invisible Computing is a project that has quite a bit of code in it that provides a small RTOS for embedded systems. What makes it interesting (and how I found it) is that it contains several files enabling support for PowerPC architectures. It appears as though the code that ended up in here came from either the same place the Xbox team got their code from, or is even the origin of some of the Xbox toolchain code.

Specifically, the src/crt/md/ppc/xxx.s file has the __savegpr/__restgpr magic that tools like XexTool patch in to IDA dumps. Instead of guessing what all those tiny functions do it's now possible to see the commented code and how it's used. I'm sure there's more interesting stuff in here too (possibly related to the CRT), but I haven't had the need for it yet.

DDI API

A complete listing of all Direct3D9 API calls down to driver mode, with all of the arguments to them. The user-mode D3D implementation on Windows provided by Microsoft calls down into this layer to pass on commands to the driver. On the 360, this API (minus all the fixed function calls) is the command buffer format! Although the exact command opcodes are not documented here, there's enough information (combined with tmbinc's code below) to ensure a somewhat-proper initial implementation.

Direct3D Shader Codes

The details of the compiled HLSL bytecode are all laid out in the Windows Driver Kit. Minus some of the Xenos-specific opcodes, it's all here. This should make the shader translation code much easier to write.

DirectX Effect Compiler

By using the command line effect compiler (fxc.exe) it is possible to both assemble and disassemble HLSL for the Xenos - with some caveats. After rummaging through some games I was able to get a few compiled shaders and by munging their headers get the standard fxc to dump their contents (as most shaders are just vs_3_0 and ps_3_0).

The effect compiler in XNA Game Studio does not include the disassembler, but does support direct output of HLSL to binaries the Xenos can run. This tool, combined with the shader opcode information, should be plenty to get simple shaders going.

tmbinc's gpu code / 'gpu-0.0.5.tar.gz'

Absolutely incredible piece of reversing here, amounting to the construction of an almost complete API for the Xenos GPU. Even includes information about how the Xbox-specific vfetch instruction is implemented. When it comes to processing the command buffers, this code will be critical to quickly getting things normalized.

Cxbx

Once thought to be a myth, Cxbx is an Xbox 1 emulator capable of running retail games. A very impressive piece of work, its code provides insights into the 360 by way of the Xbox 1 having a very similar NT-like kernel. Although Cxbx had an easier life by virtue of being able to serve as mainly a runtime library and patching system (most of the x86 code is left untouched), the details of a lot of the NT-like APIs are very interesting. Through some research it seems like a lot have changed between the Xbox 1 and the 360 (almost enough so to make me believe there was a complete rewrite in-between), some of the finer differences between things like security handles and such should be consistent.

abgx360

Although it may be some of the worst C I've ever seen, the raw amount of information contained within is mind boggling - everything one needs to read discs and most of the files on those discs (and the meanings of all of the data) reside within the single-file, 860KB, 16000 line code.

xextool (xor37h/Hitmen)

There are many 'xextool's out there, but this version written way back in 2006 is one of the few that does the full end-to-end XEX decryption and has source available. Having not been updated in half a decade (and having never been fully developed), it cannot decode modern XEX files but does have a lot of what abgx360 does in a much easier-to-read form.

XexTool (xorloser)

The best 'xextool' out there, this command line app does quite a bit. For my purposes it's interesting because it allows the dumping of the inner executable from XEX files along with a .idc file that IDA can use to resolve import names. By using this it's possible to get a pretty decent view of files in IDA and makes reversing much easier. Unfortunately (for me), xorloser has never released the code and it sounds like he never will.

PPCAltivec IDA plugin (xorloser)

Another tool released by xorloser (originally from Dean Ashton), this IDA plugin adds support for all of the Altivec/VMX instructions used by the 360 (and various other PPC based consoles). Required when looking at real code, and when going to implement the decoder for these instructions the source (included) will prove invaluable, as most of the opcodes are undocumented.