JIT actually means that functions are converted to native machine code only when the function's actually executed. Q3 takes the lazy choice and does Ahead-Of-Time compilation instead. Basically just convert the entire thing at load time instead of run time. I've misused the term multiple times myself.
The thing thats interesting about both Q3 and Java(from what I remember of it) is that they both logically have TWO stacks, as it were. One stack holds the locals, the return addresses, etc, while the other stack holds the temporaries, the intermediates, whatever you want to call them.
so you have some load instruction with a single argument that reads data from somewhere and then just pushes it to the intermediates stack. followed by another similar instruction, then you have some simple 'mul' instruction without any arguments that pops the two arguments, multiplies them together, then pushes the result. this can then be followed by some store operation that shoves that data somewhere. When combined with a few rules like 'the intermediates stack must be empty before any jumps or comefroms', the AOT/JIT compiler is then free to allocate/remap whatever registers it likes for the various intermediates slots, and thus your simple 'mul' instruction can then become a simple x86 mul instruction (but with some extra magic each side to try to avoid needing extra movs to remap the registers to match x86's annoying specific-register requirements). If you're processing a function at a time then you can combine the two stacks back into a single block.
If you're generating bytecode then yes, x86 is horrible with all of its different extensions and additions over the years. arm bytecode 'should' be a little more straight forward, if only because it won't have quite so many restrictions on which register is used, nor different forms of instructions for different registers. I've not needed to deal with arm bytecode itself, and tbh I can never remember which terms are ins or outs with arm asm, but the bytecode itself should be more predictable, if I understand correctly.
Note that 'mov ebp,esp' has a totally different meaning from 'mov ebp,[esp]' so its hardly surprising that the bytecode changes to reflect the memory address instead of a register. Especially if your memory address is formed from multiple args like both a register AND an immediate, getting even more complex when you throw in a second register and a multiplier too... Handy though, you can get a lot of stuff done like that, hence the lea instruction which can be quite handy for multiply-and-add type stuff, especially when the regular mul instruction is so awkward.
On the other hand, QuakeC has logically has 65k+ different registers and no real indication about whether a variable is a local, an intermediate, or even a global, which makes it hard to keep track of their scopes, effectively resulting in all operations writing back into ram. That said, considering you would need to write an AOT compiler for 3 or 4 different instruction sets (even if only two main families), its easier to just write a simple interpreter, in which case the performance difference won't be that significant. Besides, your engine already has a QC interpreter...
If you have different intermediate bytecode for 32bit or 64bit builds, you're probably doing it wrong. I would argue that a vm does not need to store native pointers, thus you should not need any 64bit intermediate bytecode for your interpretter, and even if you did you could treat pointers differently (ie: always use 64bit bytecode), always reserve 64bits for them and just read+write them as 32bit on 32bit systems - this is effectively what 64bit processors do when running in compatibility mode anyway. Really though, any objects should be referenced via some table, thereby allowing you to orphan them safely or whatever, as well as free them at exit (even if you've no GC while actually running).
(obviously different ABIs need different native bytecode, but that doesn't mean that the intermediate stuff should differ, because if it does then what's the point of it at all? might as well just compile directly to the native bytecode)