In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

Stack Overflow Asked by Cecil Ward on January 3, 2022

Is there a good way of optimising this code (x86-64) ?

mov dword ptr[rsp], 0;
mov dword ptr[rsp+4], 0

where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants.

Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix.

I’m thinking of something like (don’t know if the following instructions even exist)

mov  qword ptr[rsp], 0


mov  eax, 0;
mov  qword ptr[rsp], rax    ; assuming we can spare a register, a bad idea to corrupt one though

2 Answers

Yes, compile-time / asm-source-level write-coalescing is often a good idea, especially if both or the high half is zero (or -1) so you can do the whole qword with 1 instruction; modern x86 CPUs have efficient unaligned stores especially when it doesn't cross a cache line boundary.

You generally want to minimize total fused-domain uops (to get your code through the front-end as efficiently as possible), total code-size in bytes, and total unfused-domain uops (back end space in the scheduler / RS). In something like that priority. Also, there are uop-cache considerations for Sandybridge-family; 64-bit immediates, or 32-bit immediate + disp8/disp32, can need to borrow extra space from an adjacent entry in the uop cache line. (See Agner Fog's microarch pdf on, the Sandybridge chapter. That still applies to later uarches like Skylake)

Also minimizing pressure on certain back-end execution ports that surrounding code uses heavily is good. Ice Lake has 2 store-data and store-address ports so can run both stores in parallel, but before that all x86 CPUs are limited to 1 store per clock (having only a single store-data port to write data into the store buffer). Commit to L1d cache is also limited to 1 store per clock from the store buffer. Out-of-order exec does smooth that out so 2 stores back to back isn't a big problem, but 2x 4-byte immediate stores takes a lot of instruction size.

x86 unfortunately doesn't have a mov r/m32, sign_extended_imm8, only imm32. ( x86-64 does have mov r/m64, sign_extended_imm32, though, which is what you should use:

mov    qword [rsp], 0      ; 8 bytes, 1 fused-domain uop on modern Intel and AMD CPUs

vs. 7 bytes + 8 bytes and 2 uops for mov dword [rsp],0 / mov dword [rsp+4], 0. xor-zeroing EAX / storing RAX would be smaller (code size), but cost 2 uops instead of 1.

assuming we can spare a register, a bad idea to corrupt one though

Hardly; you often have a use for a zeroed register, and xor-zeroing is literally as cheap as a NOP on Sandybridge-family. (And also cheap on AMD.) If you can do this store somewhere that makes it useful to have a zeroed register, this is very cheap:

xor   eax, eax
mov  [rsp], rax         ; better if you have a use for RAX later

Or for non-zero 64-bit values where you want mov r64, imm64, it's typical that you have a spare register you could use as a scratch destination. If you would have to spill a register or save/restore an extra reg around the whole function, then it's probably better to just do 2 separate dword-immediate stores if you can't do a single sign-extended-imm32.

For non-zero constants, if the whole qword constant can be represented as a sign-extended 32-bit immediate, use mov qword [rsp], imm32. (Or push imm32 and optimize away an earlier sub rsp, 8.)

If you know your qword memory location is 8-byte aligned, then it's worth combining even for an arbitrary 8-byte constant that doesn't fit in a 32-bit immediate:

mov  rax, 0x123456789abcdef0        ; 10 bytes, 1 uop
mov  [rsp], rax                     ; 4 bytes, 1 micro-fused uop, for port 4 + port 2,3, or 7

It's only somewhat better than doing 2 separate dword stores, and could be slower in the (probably?) rare case where it crosses a 64-byte cache-line boundary

mov  dword [rsp], 0x9abcdef0        ; 7 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7
mov  dword [rsp+4], 0x12345678      ; 8 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7

Or if your constant happens to fit in a 32-bit value zero-extended to 64-bit, but not sign-extended, you can mov eax, 0x87654321 (5 bytes, very efficient) / mov [rsp], rax.

If you want to do a qword reload later, definitely do a single qword store so store-forwarding can work efficiently.

Write-combining in the hardware

That's not the major factor. More important is OoO exec and the store buffer decoupling store execution from the surrounding code.

If you were actually hoping to get more than 1 store (of any width) per clock executed, you're definitely out of luck on uarches before Ice Lake. On any uarch (even non-x86), hardware store-coalescing happens after stores execute.

You're also out of luck if you're hoping it will coalesce and take fewer entries in the store buffer so it has more time / room to absorb two cache-miss stores. We don't have any real evidence of any x86 doing that to save store-buffer-drain bandwidth, or free up store-buffer entries sooner. See Why doesn't RFO after retirement break memory ordering? for my current understanding of (lack-of) store buffer coalescing. There's some evidence that Intel at least can commit store misses to LFBs on cache-miss stores to free up space in the store buffer, but only with the limits of program order, and no evidence of committing multiple per clock.

Answered by Peter Cordes on January 3, 2022

Yes, you can combine your two 32-bit writes into a single 64-bit write, like so:

mov     QWORD PTR [rsp], 0

The immediate value is a 32-bit sign extended immediate, so it's not this simple if your second write is non-zero1, or if the MSB of the first write is 1. In that case, you can load a 64-bit constant using movabs and write that. E.g., to write 1 and 2,

movabs  rax,  0x200000001
mov     QWORD PTR [rsp], rax

The constant 0x200000001 results in the right values being written into each 32-bit half.

This trick is definitely worth it for the zero case and maybe worth it for the non-zero case and Peter's answer goes into much more detail on the tradeoffs in that latter case.

Compilers can also make this optimization (they call it "store combining" or something like that), meaning you can play with this on godbolt.

1 Except in the special case that the sign extension gives you exactly what you want. I.e., the second value is exactly 0xFFFFFFFF and the high bit of the first value is set.

Answered by BeeOnRope on January 3, 2022

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP