I've decided to name the project Xenia, so if you see that name thrown around that's what I'm referring to. I thought it was cute because of what it means in Greek (think host/guest as common terms in emulation) and it follows the Xenon/Xenos naming of the Xbox 360.
Xbox Operating System
The operating system on the Xbox 360 is commonly thought to be a paired down version of the Windows 2000 kernel. Although it has a similar API, it is in fact a from-scratch implementation. The good news is that even though the implementation differs (although I'm sure there's some shared code) the API is really all that matters and that API is largely documented and publicly reverse engineered.
Cross referencing some disassembled executables and xorloser's x360_imports file, there's a fairly even split of public vs. private APIs. For every KeDelayExecutionThread that has nice MSDN documentation there's a XamShowMessageBoxUIEx that is completely unknown. Scanning through many of the imported methods and their call signatures their behavior and function can be inferred, but others (like XeCryptBnQwNeModExpRoot) are going to require a bit more work. Some map to public counterparts that are documented, such as XamGetInputState being equivalent to XInputGetState.
One thing I've noticed while looking through a lot of the used API methods is that many are at a much lower level than one would expect. Since I know games aren't calling kernel methods directly, this must mean that the system libraries are actually statically compiled into the game executables. Let that sink in for a second. It'd be like if on Windows every single application had all of Win32 compiled into it. I can see why this would make sense from a versioning perspective (every game has the exact version of the library they built against always properly in sync), but it means that if a bug is found every game must be updated. What this means for an emulator, though, is that instead of having to implement potentially thousands of API calls there are instead a few hundred simple, low-level calls that are almost guaranteed to not differ between games. This simultaneously makes things easier and harder; on one hand, there are fewer API calls to implement and they should be easier to get right, but on the other hand there may be several methods that would have been much easier to emulate at a higher level (like D3D calls).
Xbox Kernel (xboxkrnl.exe)
Every Xbox has an xboxkrnl module on it, exporting an API similar to ntoskrnl on desktop Windows plus a bunch of additional APIs.
It provide quite a useful set of functionality:
- Program control
- Synchronization primitives (events, semaphores, critical sections)
- Memory management
- Common string routines (Unicode comparison, vsprintf, etc)
- Cryptographic providers (DES, HMAC, MD5, etc)
- Raw IO
- XEX file handling and module loading (like LoadLibrary)
- XAudio, XInput, XMA
- Video driver (Vd*)
Of the 800 or so methods present there are a good portion that are documented. Even better, Wine and ReactOS both have most method signatures and quite a few complete implementations.
Some methods are trivial to get emulated - for example, if the host emulator is built on Windows it can often pass down things like (Ex)CreateThread and RtlInitializeCriticalSection right down to the OS and utilize the optimized implementation there. Because the NT API is used there are a lot of these. Some aren't directly exposed to user code (as these are all kernel functions) but can be passed through the user-level functions with only a bit of rewriting. It's possible, with the right tricks, to make these calls directly on desktop Windows (it usually requires the Windows Device Driver Kit to be setup), which would be ideal.
The set that looks like it will be the hardest to properly get figured out are the video methods, like VdInitializeRingBuffer and VdRegisterGraphicsNotification, as it appears like the API is designed for direct writing to a command buffer instead of making calls. This means that, as far as the emulator is concerned, there are no methods that can be intercepted to do useful work - instead, at certain points in time, giant opaque data buffers must be processed to do interesting things. This can make things significantly faster by reducing the call overhead between guest code (the emulated PowerPC instructions) and host code, but makes reverse engineering what's going on much more difficult by taking away easily identifiable call sites and instead giving multi-megabyte data blobs. Ironically, if this buffer format can be reversed it may make building a D3D11/GL3 backend easier than if all the D3D9 state management had to be emulated perfectly.
XAM/Xbox ?? (xam.xex)
Besides the kernel there is a giant library that provides the bulk of the higher-level functionality in Xbox titles. Where the kernel is the tiny set of low-level methods required to build a functioning OS, XAM is the dumping ground for the rest.
- Winsock/XNet Xbox Live networking
- On-screen keyboard
- Message boxes/UI
- XContent package file handling
- Game metadata
- XUI (Xbox User Interface) loading/rendering/etc
- User-level file IO (OpenFile/GetFileSize/etc)
- Some of the C runtime
- Avatars and other user information
This is where things get interesting. Luckily it looks like XAM is everything one can do on a 360, including what the dashboard and system uses to do its work (like download queues and such) and not all games use all of the methods: most seem to only use a few dozen out of the 3000 exports.
In the context of getting an emulator up and running almost all of the methods can be ignored. Simple 'hello world' applications link in only about 4, and the games I've looked at largely depend on it for error messages and multiplayer functionality - if the emulator starts without any networking, most of those methods can be stubbed. I haven't seen a game yet that uses XUI for its user interface, so that can be skipped too.
Emulating the OS
Now that the Xbox OS is a bit more defined, let's sketch out how best to emulate it. There are two primary means of emulating system software: low-level emulation (LLE) and high-level emulation (HLE).
Most emulators for early systems (pre-1990's) use low-level emulation because the game systems didn't include an OS and often had a very minimal BIOS. The hardware was also simple enough (even though often undocumented) that it was easier to model the behavior of device registers than entire BIOS/OS systems - this is why early emulators often require a user to find a BIOS to work.
As hardware grew more complex and expensive to emulate high-level emulation started to take over. In HLE the BIOS/OS is implemented in the emulator code (so no original is required) and most of the underlying hardware is abstracted away. This allows for better native optimizations of complex routines and eliminates the need to rip the copyrighted firmware off a real console. The downside is that for well-documented hardware it's often easier to build hardware simulators than to match the exact behavior of complex operating systems.
I had good luck with building an HLE for my PSP emulator as the hardware was sufficiently complex as to make it impossible to support a perfect simulation of it. The Xbox 360 is the same way - no one outside of Microsoft (or ATI or whoever) knows how a lot of the hardware in the system operates (and may never know, as history has shown). We do know, however, how a lot of the NT kernel works and can stub out or mimic things like the dashboard UI when required. For performance reasons it also makes sense to not have a full-on simulation of things like atomic primitives and other crazy difficult constructs.
So there's one of the first pinned down pieces of the emulator: it will be a high-level emulator. Now let's dive in to some of the major areas that need to be implemented there. Note that these areas are what one would look at when building an operating system and that's essentially what we will be doing. These are roughly in the order of implementation, and I'll be covering them in detail in future posts.
Before code can be loaded into memory there has to be a way to allocate that memory. In an HLE this usually means implementing the common alloc/free methods of the guest operating system - in the Xbox's case this is the NtAllocateVirtualMemory method cluster. Using the same set of memory routines for all internal host emulator functions (like module loading, mentioned below) as well as requests from game code keeps things simple and reliable. Since the NT-like API of the 360 matches the Windows API it means that in almost all cases the emulator can use the host memory manager for all its request. This ensures performance (as it can't get any faster than that) and safety (as read/write/execute permissions will be enforced). Since everything is working in a sane virtual address space it also means that debugging things is much easier - memory addresses as visible to the emulated game code will correspond to memory addresses in the host emulator space. Calling host routines (like memcpy or video decompression libraries) require no fixups, and embedded pointers should work without the need for translation.
With my PSP emulator I made the mistake of not doing this first and ended up with two completely different memory spaces and ways of referencing addresses. Lesson learned: even though it's tempting to start loading code first, figuring out where to put it is more important.
There are two minor annoyances that make emulating the 360 a bit more difficult than it should be:
The 360 is big-endian and as such data structures will need to be byte swapped before and after system API calls. This isn't nearly as elegant as being able to just pass things around. It's possible to write some optimized swap routines for specific structures that are heavily used such that they get inserted directly into translated code as optimally as possible, but it's not free.
On 64-bit Windows all pointers are 64-bits. This makes sense: developers want the large address space that 64-bit pointers gives them so they can make use of tons of RAM. Most applications will be well within the 4GB (or really 2GB) memory limit and be wasting 4 bytes per pointer but memory is cheap so no one cares. The Xbox 360, on the other hand, has only 512MB of memory to be shared between the OS, game, and video memory. 4 wasted bytes per pointer when it's impossible to ever have an address outside the 4 byte pointer range seems too wasteful, so the Xbox compiler team did the logical thing and made pointers 4 bytes in their 64-bit code.
This sucks for an emulator, though, as it means that host pointers cannot be safely round-tripped through the guest code. For example, if NtAllocateVirtualMemory returns a pointer that spills over 4 bytes things will explode when that 8 byte pointer is truncated and forced into a 4 byte guest pointer. There are a few ways around this, none of which are great, but the easiest I can think of is to reserve a large 512MB block that represents all of the Xbox memory at application start and ensure it is entirely within the 32-bit address range. This is easy with NtAllocateVirtualMemory (if I decide to use kernel calls in the host) but also possible with VirtualAlloc, although not as easy when talking about 512MB blocks. If all future allocations are made from this space it means that pointers can be passed between guest and host without worrying about them being outside the allowable range.
Executables and Modules
Operating systems need a way to load and execute bundles of code. On the 360 these are XEX files, which are packages that contain a bunch of resources detailing a game as well as an embedded PE-formatted EXE/DLL file containing the actual code. The emulator will then require a loader that can parse one of these files, extract the interesting content, and place it into memory. Any imports, like references to kernel methods implemented in the emulator, will be resolved and patched up and exports will be cataloged until later used. Finally the code can be submitted to the subsystem handling translation/JIT/etc for actual execution.
There are a few distinct components here:
This is fairly easy to do as the XEX file format is well documented and there are many tools out there that can load it. Basic documentation is available on the Free60 site but the best resource is working code and both the xextool_v01 and abgx360 code are great (although the abgx360 source is disgusting and ugly). Some things of note are that all metadata required by various Xbox API calls like XGetModuleSection are here, as well as fun things to pull out that the dashboard usually consumes like the game icon and title information.
PE (Portable Executable) Parsing
The PE file format is how Microsoft stores its binaries (equivalent to ELF on Unix) for both executables and dynamically linked libraries - the only difference between an EXE and a DLL is its extension, from a format perspective. Inside each XEX is a PE file in the raw. This is great, as the PE format is officially documented by Microsoft and kept up to date, and surprisingly they document the entire spec including the PowerPC variant.
A PE file is basically a bunch of regions of data (called sections); some are placed there by the linker (such as the TEXT section containing compiled code and DATA section containing static data) and others can be added by the user as custom resources (like localized strings/etc). Two other special sections are IDATA, describing what methods need to be imported from system libraries, and RELOC, containing all the symbols that must be relocated off of the base address of the library.
Once the PE is extracted from the XEX it's time to get it loaded. This requires placing each section in the PE at its appropriate location in the virtual address space and applying any relocations that are required based on the RELOC section. After that an ahead-of-time translator can run over the instructions in memory and translate them into the target machine format. The translator can use the IDATA section to patch imported routines and syscalls to their host emulator implementations and also log exported code if it will be used later on. This is a fairly complex dance and I'll be describing it in a future post. For now, the things to note are that the translated machine code lives outside of the memory space of the emulator and data references must be preserved in the virtual memory space. Think of the translated code and data as a shadow copy - this way, if the game wants to read itself (code or data) it would see exactly what it expects: PowerPC instructions matching those in the original XEX.
Module Data Structures
After loading and translating a module there is a lot of information that needs to stick around. Both the guest code and the emulator will need to check things in the future to handle calls like LoadLibrary (returning a cached load), GetProcAddress (getting an export), or XGetModuleSection (getting a raw section from the PE). This means that for everything loaded from a XEX and PE there will need to be in-memory counterparts that point at all the now-patched and translated resources.
Interface for Processor Subsystem
One thing I've been glossing over is how this all interacts with the subsystem that is doing the translation/JIT/etc of the PowerPC instructions. For now, let's just say that there has to be a decent way for it to interact with the loader so that it can get enough information to make good guesses about what code does, how the code interacts with the rest of the system, and notify the rest of the emulator about any fixes it performs on that code.
Threads and Synchronization
Because the 360 is a multi-core system it can be assumed that almost every game uses many software threads. This means that threading primitives like CreateThread, Sleep, etc will be required as well as all the synchronization primitives that support multi-threaded applications (locks and such). Because the source API is fairly high-level most of these should be easy to pass down to the host OS and not worry too much about except where the API differs.
This is in contrast to what I had to do when working on my PSP emulator. There, the Sony threading APIs differed enough from the normal POSIX or Win32 APIs that I had to actually implement a full thread scheduler. Luckily the PSP was single-core, meaning that only one thread could be running at a time and a lot of the synchronization primitives could be faked. It also greatly reduced the JIT complexity as only one thread could be generating code at a time and it was easy to ensure the coherency of the generated code.
A 360 emulator must fully utilize a multi-core host in order to get reasonable performance. This means that the code generator has to be able to handle multiple threads executing and potentially generating code at the same time. It also means that a robust thread scheduler has to be able to handle the load of several threads (maybe even a few dozen) running at the same time with decent performance. Because of this I'm deciding to try to use the host threading system instead of writing my own. The code generator will need to be thread safe, but all threading and synchronization primitives will defer to the host OS. Windows, as a host, will have a much better thread scheduler than I could ever write and will enable some fancy optimizations that would otherwise be unattainable, such as pinning threads to certain CPUs and spreading out threads such that they share cores when they would have on the original hardware to more closely match the performance characteristics of the 360.
Unlike threading primitives the IO system will need to be fully emulated. This is because on a real Xbox it's reading the physical DVD where as the emulator will be sourcing from a DVD image or some other file.
Most calls found in games come in two flavors:
Low-Level IO (Io* calls)
These kernel-level calls include such lovely methods as IoCreateDevice and IoBuildDeviceIoControlRequest. Since they are not usually exposed to user code my hope is that full general implementations won't be required and they will be called in predictable ways. Most likely they are used to access read-only game DVD data, so supporting custom drivers that direct requests down to the image files should be fairly easy (this is how tools on Windows that let you mount ISOs work). Once things like memory cards and harddrives are supported things get trickier, but it's not impossible and can be skipped initially.
High-Level IO (Nt* calls)
Roughly equivalent to the Win32 file API, NtOpenFile, NtReadFile, and various other functions allow for easier implementation of file IO. That said, if a full implementation of the low-level Io* routines needs to be implemented anyway it may make sense to implement these as calls onto that layer. The reason is that the Xbox DVD format uses a custom file system that will need to be built and kept in memory and calling down to the host OS file system won't really happen (although there are situations where I could it imagine it being useful, such as hot-patching resources).
Just like the memory management functions are best to be shared throughout both guest and host code, so are these IO functions. Getting them implemented early means less code later on and a more robust implementation.
A lot of the code for these methods can be found in ReactOS, but unfortunately they are GPL (ewww). That means some hack-tastic implementations will probably be written from scratch.
Once more than 'hello world' applications are running things like audio and video will be required. Due to Microsoft pushing the XNA brand and libraries a lot of the technologies used by the Xbox are the same as they are on Windows. Video files are WMV and play just fine on the desktop and audio is processed through XAudio2 and that's easily mappable to the equivalent desktop APIs.
That said, the initial versions of the emulator will have to try to hard to skip over all of this stuff. Games are still perfectly playable without cut-scenes or music, and it's enough to know it's possible to continue on with implementation.
Static Linking Verification
As mentioned above it looks like many system methods get linked in to applications at compile-time. To quickly verify that this is happening I disassembled some games and looked at the import for KeDelayExecutionThread (I figured it was super simple). In every game I looked at there was only one caller of this method and that caller was identical. Since KeDelayExecutionThread is essentially sleep I looked at the x360_imports file and found both Sleep and RtlSleep. Sleep, being a Win32 API, is most likely identical to the signature of the desktop version so I assumed it took 1 parameter. The parent method to KeDelayExecutionThread takes 2, which means it can't be Sleep but is likely RtlSleep. The parent of this RtlSleep method takes exactly one parameter, sets the second parameter to 0, and calls down - sounds like Sleep! So then even though xboxkrnl exports Sleep, RtlSleep, and KeDelayExecutionThread the code for both Sleep and RtlSleep are compiled into the game executable instead of deferring to xboxkrnl. I have no idea why xboxkrnl exports these methods if games won't use them (it would certainly make life easier for me if they weren't there), but since it seems like no one is using them they can probably be skipped in the initial implementation.
Patching high-level APIs
Not all things are easier to emulate at a low level for both performance reasons and implementation quality.
To see this clearly take memcpy, a C runtime method that copies large blocks of memory around. Right now this method is compiled into every game which makes it difficult to hook into, unlike CreateThread and other exports of the kernel. Of course it'll work just fine to emulate the compiled PowerPC code (as to the emulator it's just another block of instructions), but it won't be nearly as fast as it could be. I'll dive into this more in a future article, but the short of it is that an emulated memcpy will require thousands to tens of thousands of host instructions to handle what is basically a few hundred instruction method. That's because the emulator doesn't know about the semantics of the method: copy, bit for bit, memory block A to memory block B. Instead it sees a bunch of memory reads and writes and value manipulation and must preserve the validity of that data every step of the way. Knowing what the code is really trying to do (a copy) would enable some optimized host-specific code to do the work as fast as possible.
The problem is that identifying blocks of code is difficult. Every compiler (and every version of that compiler), every runtime (and every version of that runtime), and every compiler/optimizer/linker setting will subtly or non-so-subtly change the code in the executable such that it'll almost always differ. What is memcpy in Game A may be totally different than memcpy in Game B, even though they perform the same function and may have originated from the same source code.
There are three ways around this:
- Ignore it completely
- Specific signature matching
- Fuzzy signature matching
The first option isn't interesting (although it'll certainly be how the emulator starts out).
Matching specific signatures requires a database of subroutine hashes that map to some internal call. When a module is loaded the subroutines are discovered, the database is queried, and matching methods are patched up. The problem here is that building that database is incredibly difficult - remember the massive number of potential variations of this code? It's a good first step and the database/patching functionality may be required for other reasons (like skipping unimplemented code in certain games/etc), but it's far from optimal.
The really interesting method is fuzzy signature matching. This is essentially what anti-virus applications do when trying to detect malicious code. It's easy for virus authors to obfuscate/vary their code on each version of their virus (or even each copy of it), so very sophisticated techniques have been developed for detecting these similar blocks of code. Instead of the above database containing specific subroutine hashes a more complex representation would allow for an analysis step to extract the matching methods. This is a hard problem, and would take some time, but it'd be a ton of fun.
We've now covered the 3 major areas of the emulator in some level of detail (CPU, GPU, and OS) and now it's getting to be time to write some code. Before starting on the actual emulator, though, one major detail needs to be nailed down: what does the code translator look like? In the next post I'll experiment with a few different ways of building a CPU emulator and detail their pros and cons.
Following up from last post, which dove into the Xbox 360 CPU, this post will look at the GPU.
- ATI R500 equivalent at 500MHz
- 48 shader pipeline processors
- 8 ROPs - 8-32 gigasamples/second
- 6 billion vertices/second, 500 million triangles/second, 48 bilion shader ops/second
- Shader Model 3.0+
- 26 texture samplers, 16 streams, 4 render targets, 4096 VS/PS instructions
- VFETCH, MEMEXPORT
The Xenos GPU was derived from a desktop part right around the time when Direct3D 10 hardware was starting to develop. It's essentially Direct3D 9 with a few additions that enable 10-like features, such as VFETCH. Performance-wise it is quite slow compared to modern desktop parts as it was a bit too early to catch the massive explosion in generally programmable hardware.
The Xenos is great, but compared to modern hardware it's pretty puny. Where as the Xenon (CPU) is a bit closer to desktop processors, the GPU hardware has been moving forward at an amazing pace and it's clearly visible when comparing the specs.
- Modern (high-end) GPUs run at 800+MHz - many more operations/second.
- Modern GPUs have orders of magnitude more shader processors (multiplying the clock speed change).
- 32-128 ROPs multiply out all the above even more.
- Most GPUs have shader ops measured in trillions of operations per second.
- All stats (for D3D10+) are way over the D3D9 version used by Xenos (plenty of render targets/etc).
Assuming Xenos operations could be run on a modern card there should be no problem completing them with time to spare. The host OS takes a bit of GPU time to do its compositing, but the is plenty of memory and spare cycles to handle a Xenos-maxing load.
One unique piece of Xenos is the vfetch shader instruction, available from both vertex and pixel shaders, which gives shader programs the ability to fetch arbitrary vertex data from samplers set to vertex buffers. This instruction is fairly well documented because it is usable from XNA Game Studio, and some hardcore demoscene guy actually reversed a lot of the patch up details (available here with a bunch of other goodies). It also looks like you can do arbitrary texture fetches (TFETCH?) in both vertex and pixel shaders - kind of tricky.
Unfortunately, the ability to sample from arbitrary buffers is not something possible in Direct3D 9 or GL 2. It's equivalent to the Buffer.Load call in HLSL SM 4+ (starting in Direct3D 10).
Unlike vfetch, this shader instruction is not available in XNA Game Studio and as such is much less documented. There are a few documents and technical papers out there on the net describing what it does, which as far as I can tell is similar to the RWBuffer type in HLSL SM5 (starting in Direct3D 11). It basically allows structured write of resource buffers (textures, vertex buffers, etc) that can then be read back by the CPU or used by another shader.
This will be the hardest thing to fully support due to the lack of clear documentation and the fact that it's a badass instruction. I'm hoping it has some fatal flaw that makes it unusable in real games such that it won't need to be implemented...
Emulating a Xenos
So we know the performance exists in the hardware to push the triangles and fill the pixels, but it sounds tricky. VFETCH is useful enough to assume that every game is using it, while the hope is that MEMEXPORT is hard enough to use that no game is. There are several big open questions that need more investigation to say for sure just how feasible this project is:
- Is it possible to translate compiled shader code from xvs_3_0/xps_3_0 -> SM4/5? (Sure... but not trivial)
- Can VFETCH semantics be implemented in SM4/5? (I think yes, from what I've seen)
- Can MEMEXPORT semantics (whatever they are) be implemented in SM4/5?
- Special z-pass handling may be needed (seen as 'zpass' instruction) - may require replicating draw calls and splitting shaders!
Unlike the CPU, which I'm pretty confident can be emulated, the Xenos is a lot trickier. Ignoring the more advanced things like MEMEXPORT for a second there is a tremendous amount of work that will need to be done to get anything rendering once the rest of the emulator is going due to the need to translate shader bytecode. The XNA GS shader compiler can compile and disassemble shaders, which is a start for reversing, but it'll be a pain.
Because of all the GPGPU-ish stuff happening it seems like for at least an initial release Direct3D 11 (with feature level 10.1) is the way to go. I was really hoping to be cross-platform right away, but I'm not positive OpenGL has the necessary support.
So after a day of research I'm about 70% confident I could get something rendering. I'm about 20% confident with my current knowledge that a real game that fully utilized the hardware could be emulated. If someone like the guy who reversed the GPU hardware interface decided to play around, though, that number would probably go up a lot ^_^
The next post will talk about the operating system and the software side of the emulator.
Emulators are complex pieces of software and often push the bounds of what's possible by nature of having to simulate different architectures and jump through crazy hoops. When talking about the 360 this gets even crazier, as unlike when emulating an SNES the Xbox is a thoroughly modern piece of hardware and in some respects is still more powerful than most mainstream computers. So there's the first feasibility question: is there a computer powerful enough to emulate an Xbox 360? (sneak peak: I think so)
Now assume for a second that a sufficiently fast emulator could be built and all the hardware exists to run it: how would one even know what to emulate? Gaming hardware is almost always completely undocumented and very special-case stuff. There are decades-old systems that are just now being successfully emulated, and some may never be possible! Add to the potential hardware information void all of the system software, usually locked away under super strong NDA, and it looks worse. It's amazing what a skilled reverse engineer can do, but there are limits to everything. Is there enough information about the Xbox 360 to emulate it? (sneak peak: I think so)
The Xbox 360 is an embedded system, geared towards gaming and fairly specialized - but at the end of the day it's derived from the Windows NT kernel and draws with DirectX 9. The hardware is all totally custom (CPU/GPU/memory system/etc), but roughly equivalent to mainstream hardware with a 64-bit PPC chip like those shipped in Macs for awhile and an ATI video chipset not too far removed from a desktop card. Although it's not going to be a piece of cake and there are some significant differences that may cause problems, this actually isn't the worst situation.
The next few posts will investigate each core component of the system and try to answer the two questions above. They'll cover the CPU, GPU, and operating system.
- 64-bit PowerPC w/ in-order execution and running big-endian
- 3.2GHz 3 physical cores/6 logical cores
- L1: 32KB instruction/32KB data, L2: 1MB (shared)
- Each core has 32 integer, 32 floating-point, and 128 vector registers
- Altivec/VMX128 instructions for SIMD floating-point math
- ~96GFLOPS single-precision, ~58GFLOPS double-precision, ~9.6GFLOPS dot product
The PowerPC instruction set is RISC - this is a good thing, as it's got a fairly small set of instructions (relative to x86) - it doesn't make things much easier, though. Building a translator for PPC to x86-* is a non-trivial piece of work, but not that bad. There are some considerations to take into account when translating the instruction set and worrying about performance, highlighted below:
- Xenon is 64-bit - meaning that it uses instructions that operate on 64-bit integers. Emulating 64-bit on 32-bit instruction sets (such as x86) is not only significantly more code but also at least 2x slower. May mean x86-64 only, or letting some other layer do the work if 32-bit compatibility is a must.
- Xenon uses in-order execution - great for simple/cheap/power-efficient hardware, but bad for performance. Optimizing compilers can only do so much, and instruction streams meant for in-order processors should always run faster on out-of-order processors like the x86.
- The shared L2 cache, at 1MB, is fairly small considering there is no L3 cache. General memory accesses on the 360 are fast, but not as fast as the 8MB+ L3 caches commonly found in desktop processors.
- PPC has a large register file at 32I/32F/128V relative to x86 at 6I/8F/8V and x86-64 at 12I/16F&V - assuming the PPC compiler is fully utilizing them (or the game developers are, and it's safe to bet they are) this could cause a lot of extra memory swaps.
- Being big-endian makes things slightly less elegant, as all loads and stores to memory must take this into account. Operations on registers are fine (a lot of the heavy math where perf really matters), but because of all the clever bit twiddling hacks out there memory must always be valid. This is the biggest potentially scary performance issue, I believe.
Luckily there is a tremendous amount of information out there on the PowerPC. There are many emulators that have been constructed, some of which run quite fast (or could with a bit of tuning). The only worrisome area is around the VMX128 instructions, but it turns out there are very few instructions that are unique to VMX128 and most are just the normal Altivec ones. (If curious, the v*128 named instructions are VMX128 - the good news is that they've been documented enough to reverse).
'6 cores' sounds like a lot, but the important thing to remember is that they are hardware threads and not physical cores. Comparing against a desktop processor it's really 3 hardware cores at 3.2GHz. Modern Core i7's have 4-6 hardware cores with 8-12 hardware threads - enough to pin the threads used on a 360 to their own dedicated hardware threads on the host.
There is of course extra overhead running on a desktop computer: you've got both other applications and the host OS itself fighting for control of the execution pipeline, caches, and disk. Having 2x the hardware resources, though, should be plenty from a raw computing standpoint:
- SetThreadAffinityMask/SetThreadIdealProcessor and equivalent functions can control hardware threading.
- The properties of out-of-order execution on the desktop processors should allow for better performance of hardware threads vs. the Xenon.
- The 3 hardware cores are sharing 1MB of L2 on the Xenon vs. 8-16MB L3 on the desktop so cache contention shouldn't happen nearly as often.
- Extra threads on the host can be used to offload tasks that on a real Xenon are sharing time with the game, such as decompression.
The Xbox marketing guys love to throw around their fancy GFLOP numbers, but in reality they are not all that impressive. Due to the aforementioned in-order execution and the strange performance characteristics of a lot of the VMX128 instructions it's almost impossible to hit the reported numbers in anything but carefully crafted synthetic benchmarks. This is excellent, as modern CPUs are exceeding the Xenon numbers by a decent margin (and sometimes by several multiples). The number of registers certainly helps the Xenon out but only experimentation will tell if they are crucial to the performance.
Emulating a Xenon
With all of the above analysis I think I can say that it's not only possible to emulate a Xenon, but it'll likely be sufficiently fast to run well.
To answer the first question above: by the time a Xenon emulation is up to 95% compatibility the target host processors will be plenty fast; in a few years it'll almost seem funny that it was ever questioned.
And is there enough information out there? So far, yes. I spent a lot of nights reverse engineering the special instructions on the PSP processor and the Xenon is about as documented now. The Free60 project has a decent toolchain but is lacking some of the VMX128 instructions which will make testing things more difficult, but it's not impossible.
Combined with some excellent community-published scripts for IDA Pro (which I have to buy a new license of... ack $$$ as much as a new MacBook) the publicly available information and some Redbull should be enough to get the Xbox 360 CPU running on the desktop.
The next post will focus on the GPU, detailing its performance characteristics, features, and relation to modern desktop hardware.
Awhile back I worked on the first Playstation Portable emulator - it was a lot of fun and I learned a lot about embedded systems, graphics, and system design. After I got my first few commercial games running I considered the project a success and let it go dormant; the goal was not to build a full software product to run games but to use the exercise as a way to explore a really esoteric but deeply technical problem. It was surprising (and great!) to see a group of guys pick up some of my code and start work on JCPSP - they've been going now for 3 years and have made tremendous progress. It's fun to look at their screenshots and see my original UI/icons and I'm happy to have contributed to the effort, even if not directly.
It's been a few years now since my last work at that level and tech has moved forward, hardware has improved, and (hopefully) I've grown a bit as well. But the craving is coming back, I'm feeling a bit rusty, and I want to again try my hand at an ambitious systems software project. It's got to be big, unproven, and interesting. When I set out on building PSP Player I didn't know if it would ever be possible to get a retail game running and I'm again looking for a challenge like that.
With all that said, the past week I've been looking at what it would take to emulate an Xbox 360. I have no code yet but have done a lot of research on the system, the software, and the state of the hacker community. The advantage of starting now, so many years after the release of the console, is that projects like Free60 have done a great job documenting the hardware and fleshing out the software toolchains. Reverse engineering of the file formats, CPU instructions, and other nasty details has progressed to the point where most areas are at least mapped out. And best of all - as far as I know - no one is looking at emulating it yet.
I'm still exploring exactly how I want to proceed, but I'll be documenting my initial research here and throwing up some tools on github as I write them. Even if I never make it to running code maybe having a consolidated source of this info will make it easier for someone else to do so.
I've got a lot of projects I'm playing with right now and to help keep myself motivated I'm going to start blogging about them. This will hopefully be a mix of things I do at work (primarily graphics related) as well as some fun personal projects.
The shortlist of projects I'm going to focus on:
- WebGL Inspector
- WebGL graphics/GPGPU demos
- OpenGL ES 2 graphics/GPGPU demos (focusing on iOS/etc)
- Xbox 360 emulation
There's a theme here: high-performance bleeding-edge gaming/graphics tech. There's really nothing sexier!
I recently switched jobs from Microsoft Live Labs to Google, where I am focusing on graphics and other GPU-related topics. I'm going to try to set my 20% project as blogging/building demos/samples/etc. Hopefully it'll mean a bunch of cool posts here!