Code Review Asked on October 27, 2021
For practice, I wrote some NASM code that prints out the hailstone sequence of a (unfortunately, hardcoded) number.
This is by far the most complex code I’ve ever written in NASM. I’d like advice on anything, but specifically:
mul
doesn’t take an immediate, and the register that I want to multiply is ebx
, not eax
, so I need to do a couple mov
s before I can multiply.hail.asm:
global _start
section .data
newline: db `n`
end_str: db `1n`
section .text
print_string: ; (char* string, int length)
push ebp
mov ebp, esp
push ebx
mov eax, 4
mov ebx, 1
mov ecx, [ebp + 8]
mov edx, [ebp + 12]
int 0x80
pop ebx
mov esp, ebp
pop ebp
ret
print_int: ; (int n_to_print)
push ebp
mov ebp, esp
push ebx
push esi
mov esi, esp ; So we can calculate how many were pushed easily
mov ecx, [ebp + 8]
.loop:
mov edx, 0 ; Zeroing out edx for div
mov eax, ecx ; Num to be divided
mov ebx, 10 ; Divide by 10
div ebx
mov ecx, eax ; Quotient
add edx, '0'
push edx ; Remainder
cmp ecx, 0
jne .loop
mov eax, 4 ; Write
mov ebx, 1 ; STDOUT
mov ecx, esp ; The string on the stack
mov edx, esi
sub edx, esp ; Calculate how many bytes were pushed
int 0x80
add esp, edx
pop esi
pop ebx
mov esp, ebp
pop ebp
ret
main_loop: ; (int starting_n)
push ebp
mov ebp, esp
push ebx
mov ebx, [ebp + 8] ; ebx is the accumulator
.loop:
push ebx
call print_int
add esp, 4
push 1
push newline
call print_string
add esp, 8
test ebx, 1
jz .even
.odd:
mov eax, ebx
mov ecx, 3 ; Because multiply needs a memory location
mul ecx
inc eax
mov ebx, eax
jmp .end
.even:
shr ebx, 1
.end:
cmp ebx, 1
jnz .loop
push 2
push end_str
call print_string
add esp, 8
pop ebx
mov esp, ebp
pop ebp
ret
_start:
push 1000 ; The starting number
call main_loop
add esp, 4
mov eax, 1
mov ebx, 0
int 0x80
Makefile:
nasm hail.asm -g -f elf32 -Wall -o hail.o
ld hail.o -m elf_i386 -o hail
The multiplication part seems overly complicated. The problem is,
mul
doesn't take an immediate, and the register that I want to multiply isebx
, noteax
, so I need to do a couplemov
s before I can multiply.
This is all true, but based on the premise that the mul
instruction must be used. Here are a couple of alternatives:
imul ebx, ebx, 3
, listed in the manual as a signed multiplication, but that makes no difference, because only the low half of the product is used.lea ebx, [ebx + 2*ebx]
, even the +1 can be merged into it: lea ebx, [ebx + 2*ebx + 1]
. As a reminder, lea
evaluates the address on the right and stores it in the destination register, it does not access memory despite the square-brackets syntax. 3-component lea
takes 3 cycles on some processors (eg Haswell, Skylake), making it slightly slower than a 2-component lea
and a separate inc
. 3-component lea
is good on Ryzen.The simplest way is of course to use the div
instruction, but that's not the fastest way, and it's not what a compiler would do. Here is a faster way, similar to how compilers do it, based on multiplying by a fixed-point reciprocal of 10 (namely 235 / 10, the difference between 235 and 232 is compensated for by shifting right by 3, the remaining division by 232 is implicit by taking the high half of the output of mul
).
; calculate quotient ecx/10
mov eax, 0xCCCCCCCD
mul ecx
shr edx, 3
mov eax, ecx
mov ecx, edx
; calculate remainder as n - 10*(n/10)
lea edx, [edx + 4*edx]
add edx, edx
sub eax, edx
push edx
in print_intThis will put 4 bytes on the stack for every character of the decimal representation of the integer, 1 actual char and 3 zeroes as filler. That looks fine when printed because a zero does not look like anything, so I'm not sure if this should be classed as a bug, but it just seems like an odd thing to do. The characters could be written to some buffer byte-by-byte, with a store and decrementing the pointer, then there would not be zeroes mixed in. A similar "subtract pointers to find the length"-trick could be used, that's a good trick.
mov edx, 0 ; Zeroing out edx for div
That's fine but xor edx, edx
is preferred, unless the flags must be preserved.
jmp .end .even
Given that n
is odd, 3n+1
is even, so you could omit the jump and have the flow of execution fall straight into the "even" case. Of course that means that not all integers in the sequence are printed, so maybe you can't use this trick, depending on what you want from the program.
If skipping some numbers to accelerate the sequence is OK, here is an other trick for that: skip a sequence of even numbers all at once by counting the trailing zeroes and shifting them all out.
tzcnt ecx, ebx
shr ebx, cl
mov esp, ebp pop ebp
If you want (it doesn't make a significant difference, so it's mostly personal preference), you can use leave
instead of this pair of instructions. Pairing the leave
with enter
is not recommended because enter
is slow, but leave
itself is OK. GCC likes to use leave
when it makes sense, but Clang and MSVC don't.
cmp ecx, 0 jne .loop
That's fine, but there are a couple of alternatives that you may find interesting:
test ecx, ecx
jne .loop
Saves a byte, thanks to not having to encode the zero explicitly.jecxz .loop
This special case can be used because ecx
is used. Only 2 bytes instead of 5 or 4. However, unlike a fusible arith/branch pair, this costs 2 µops on Intel processors. On Ryzen there is no downside.Answered by harold on October 27, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP