x64 Assembly

Something that I have gotten really into recently is x64 Assembly programming. So, I thought I would jot down some of the notes that I’ve collected from developing in the language. I am using the MASM assembler in a Visual Studio environment as their memory, registers, and debugging tools work well for my needs. Note: I will use routine, sub-routine, procedure, and function interchangeably, but they all effectively mean the same thing.

JMP

Register quick tips
Fast-call procedure calling conventions
Fast-call procedure shadow space (home space)
Shadow space and function arguments
Stack 16 byte alignment
Setting up a x64 only project in Visual Studio
Code examples

Register quick tips

Below is a quick reference table of registers and their purpose. This isn’t 100% accurate to all the purposes of each register, but it is good enough to get started. Note that the all registers in a row are the same register, just the lower bits as you move to the right of the row. So 64-bit is the whole register, 32-bit is the lower half of the register, 16-bit is the lower half of 32, and 8-bit is the lower half of 16. Note: there are high registers (ah, bh, ch, dh), but not for the newer registers (r8-r15) so we are skipping those for now to make this table loop pretty.

64-bit	32-bit	16-bit	8-bit	purpose
rax	eax	ax	al	general
rbx	ebx	bx	bl	general
rcx	ecx	cx	cl	general/counting
rdx	edx	dx	dl	general
rsi	esi	si		stack index
rip	eip	ip		instruction pointer
r8	r8d	r8w	r8b	general
r9	r9d	r9w	r9b	general
r10	r10d	r10w	r10b	general
r11	r11d	r11w	r11b	general
r12	r12d	r12w	r12b	general
r13	r13d	r13w	r13b	general
r14	r14d	r14w	r14b	general
r15	r15d	r15w	r15b	general

There are some registers that have special behavior based on the instruction that you are using. RCX combined with the loop instruction is one such set. Below is an example that will increment the value in the rax register 8 times.

mov rcx, 8
loop_8_times:
inc rax
loop loop_8_times

As you can see, the loop instruction basically decrements the rcx register until it reaches the value of 0. When rcx contains the value of 0 then the loop will end. Below is an example of the same code without the loop instruction to describe the behavior.

mov rcx, 8
loop_8_times:
inc rax
dec rcx
cmp rcx, 0
jnz loop_8_times

; TODO: Make a table for the FPU registers

Fast-call procedure calling conventions

First of all, this document is very helpful for understanding Microsoft calling conventions.

In short, Microsoft uses ECX, EDX, R8, and R9 as the first four arguments for a procedure call and any remaining arguments should be pushed onto the stack. Below is a sample from their docs:

func1(int a, int b, int c, int d, int e);
// a in RCX, b in RDX, c in R8, d in R9, e pushed on stack

The following is the calling convention for using floats as arguments to functions. Note, if you mix input arguments, you should still be using the order described in the samples. That is to say if you have an int as the first argument and a float as the second argument, you should use RCX, XMM1 respectively.

func2(float a, double b, float c, double d, float e);
// a in XMM0, b in XMM1, c in XMM2, d in XMM3, e pushed on stack

Lastly, when calling a procedure, the return value for the call (if any) will be put into RAX.

Fast-call procedure shadow space (home space)

When using fast-call it is important to note that if the routine is either to be called from another language such as C or C++, or if you are calling a function that is in another language like C or C++, you need to make sure to support shadow space also known as home space. I’ll call it shadow space from now on because it sounds cooler. This shadow space is 32 bytes long (since we are in 64-bit assembly). Basically what it boils down to is that you need to move the stack pointer RSP 32 bytes before doing a call (keep in mind 16 byte alignment of the stack). Let’s take a look at Microsoft’s HeapAlloc function (basically malloc) as an example of how this would work. Below is our own implementation of malloc which we will call halloc and use the Windows api function HeapAlloc.

halloc PROC
	mov r8, rcx		; Add the number of bytes to allocate
	call GetProcessHeap	; Store the process heap address in RAX
	mov rcx, rax		; The heap address is 1st arg
	mov rdx, 00h		; No flags to alter memory allocation
	sub rsp, 20h		; Shadow space
	call HeapAlloc
	add rsp, 20h		; Remove shadow space
	ret
halloc ENDP

What you will notice in the above code is the instructions sub rsp, 20h and add rsp, 20h which are adding and removing the shadow space respectively. This is a little bit annoying but I personally don’t require the shadow space when I am calling routines that I don’t intend to expose to a higher level language like C. This means that I mainly only have to add it when I am calling into a function that I would like to use from the higher level language library. For a short added reading on this, check out this Microsoft blog post.

Something I like to do is to have a routine for doing shadow space calling for me. Basically you pass the function you want to create shadow space for calling into rax (in my case) and then you add and remove the stack space around the call as you normally do.

;*********************************************;
; RAX = Function that should be shadow called ;
; Returns whatever the function call returns  ;
;*********************************************;
shadowCall PROC
	pop rbx		; Get the return address pointer in a non-volitile register
	and rsp, not 8	; Make sure that the current stack is 16-byte aligned
	sub rsp, 20h	; Add the shadow space
	call rax	; Call the function
	add rsp, 20h	; Remove the shadow space
	jmp rbx		; Go back to the stored instruction address
shadowCall ENDP

The above instructions has a few things going on. The most interesting thing that is going on is that we do pop rbx. The reasoning for this is because we don’t want our return address to be part of the shadow space as it might get overwritten by the external function. So we need to remove it from the stack and store it in a non-volitile register to return with later. The second thing is that we are using and rsp, not 8. This just makes sure that the stack is 16-byte aligned before it does the external call, otherwise you’ll probably wind up with a memory access violation.

Shadow space and function arguments

At this point you might be wondering, if there are more than 4 arguments to a function call and the remaining arguments are put onto the stack, how does this work with shadow space? Since the fast-call calling convention requires the shadow space (whether or not it uses it) and that alters the stack, your question should be, “do I push to args to the stack before or after adding the shadow space?”. The answer is to push the args before you add the shadow space.

mov rcx, 1		; Arg 1
mov rdx, 2		; Arg 2
mov r8, 3		; Arg 3
mov r9, 4		; Arg 4
push 5			; Push the 5+ arguments onto the stack first
sub rsp, 20h		; Shadow space
call someFunction
add rsp, 20h		; Remove shadow space

Stack 16 byte alignment

Something I am aware of, but honestly haven’t fully explored, is that the stack is on a 16-byte alignment. That is to say that if you are to push only 1 8-byte value onto the stack, you should padd it by adding the other 8 bytes. You could push the value 2x or, more preferrably, just move the stack pointer. Below is an example of this exact scenario.

mov rax, 99	; Some value from somewhere
push rax	; Push an 8-byte value onto the stack
sub rsp, 8	; Move the stack pointer by 8 bytes to keep it 16-byte aligned

Often you’ll want to start your program off on the right foot by aligning it. Believe it or not, it doesn’t always start off aligned.

.code
main PROC
	and rsp, not 08h	; Make sure that the stack is 8-bytes aligned
	; ...
main ENDP
END

Setting up a x64 only project in Visual Studio

You will need to create a C++ project as you normally would. Though you are selecting this to be a C++ project, we will not be creating any C/C++ file types, we will only be creating .asm files.

create project

Make sure to give your project a suitable name during the configuration step.

configure project

Something that I like to do is get rid of the normal Visual Studio solution explorer folders and just show all files so that I can setup the directories how I want to set them up.

show all files in visual studio

Next we need to enable the MASM assembler in the build customizations

build customizations

masm assembler build customization

Now lets create a src/main.asm file to make sure things are setup correctly. When you create the file, right click on it and go to the file’s properties.

asm file properties

You should see that the file type is set to Microsoft Macro Assembler.

asm file item type

Next, you need to set the label that will serve as your entry point in the Visual Studio project properties. To keep things simple, we will name our entry point label main. So to set this up you need to go to project properties.

project properties

Then you need to go to the Linker->Advanced settings and set the Entry Point value to main. Note: Make sure that you are in x64 mode and not x86.

entry point label setting

Now that you have done all that setup, turn your debugger to x64 mode (through the dropdown in Visual Studio next to the debug button) and test things out.

assembly running

NOTE: if you are getting an error when building some-time in the future that says something along the lines of unresolved external symbol __imp___CrtDbgReportW, the problem seems to be the multi-threaded debugging runtime library setting. Changing from “Multi-threaded Debug DLL (/MDd)” to “Multi-threaded DLL (/MD)” in the visual studio project settings seems to have done the trick. You can find it in Project Settings->C/C++->Code Generation->Runtime Library.

unresolved external symbol __imp___CrtDbgReportW solution

Code examples

What better way to learn something than through some code examples. Below are some ASCII string query routines that I have written in x64. Note: these routines are slower, but it works good for example sake. I use a faster versions of these routine in my personal code that account for cache lines and heap access.

strlen - Get the length of a string.

;****************************************;
; RAX = The string to get the length for ;
; Returns length of string in RAX        ;
;****************************************;
strleninline PROC
	push rbx		; Save the state of rbx since we are going to use bl
	push rcx		; Save the state of rcx since we are going to use bl
	mov rcx, rax		; Create a copy of rax to diff at end
strleninline_loop:
	mov bl, [rax]		; Copy the ascii letter at the rax address into bl
	inc rax			; Go to the next ascii letter at rax
	cmp bl, 0		; Check to see if the character is a \0
	jnz strleninline_loop	; If not \0 then continue through the loop
	dec rax			; We don't want to count \0 as part of the length
	sub rax, rcx		; Put the length in rax by subtracting address locations
	pop rcx			; Restore the state of rcx
	pop rbx			; Restore the state of rbx
	ret
strleninline ENDP

strstartswith - Determines if a string (haystack) starts with another string (needle)

;*******************************************************;
; RAX = Needle string (string should be in start)       ;
; RBX = Haystack string (string to check within)        ;
; Returns 0 in RAX if false, anything otherwise is true ;
;*******************************************************;
strstartswith PROC
	push rcx		; Save the state of rcx
	push rdx		; Save the state of rdx
	mov rdx, rax		; Copy rax to rdx since we are going to call strlen routine
	call strlen
	mov rcx, rax		; Move the len of the needle string into our counter register
	mov rax, 0		; Set the return to false
strstartswith_loop:
	mov r8b, [rbx]		; Get the character from haystack string
	cmp r8b, [rdx]		; Compare character from the needle string
	jnz strstartswith_exit
	inc rbx			; Move to the next character in haystack string
	inc rdx			; Move to the next character in needle string
	loop strstartswith_loop
	mov rax, 1		; The string starts with match!
strstartswith_exit:
	pop rdx			; Restore the state of rdx
	pop rcx			; Restore the state of rcx
	ret
strstartswith ENDP

strindexof - Get the index of a string (needle) within another string (haystack)

;*******************************************************;
; RAX = Haystack string (string to check within)        ;
; RBX = Needle string (string should be in start)       ;
; Returns -1 in RAX if not found, otherwise RAX = index ;
;*******************************************************;
strindexof PROC public
	push rcx		; Save the state of rcx
	push rdx		; Save the state of rdx
	push rax		; Save the haystack to the stack
	push rbx		; Save the needle to the stack
	mov rdx, rax		; Copy rax to rdx since we are going to call strlen routine
	call strlen
	mov rcx, rax		; Move the len of the haystack into our counter register
	mov rax, rdx		; Set the found address to the starting address
	dec rax			; Make it so that sub rax, haystack will be -1
	cmp rcx, 0		; Check to make sure we are not looping through a 0 string
	jz strindexof_exit_loop
strindexof_loop:
	mov r8b, [rdx]		; Get the character from haystack string
	cmp r8b, [rbx]		; Compare character from the needle string
	jne strindexof_notfound
	mov r8, [rsp+8]		; Get the haystack from the stack without popping
	cmp rax, r8		; See if rax has already been set, otherwise set it
	jge strindexof_check
	mov rax, rdx		; rax is -1 from haystack address, so it needs to be set
strindexof_check:
	inc rbx			; Go to the next letter in the needle
	mov r8b, [rbx]		; Get the character code for the next letter in needle
	cmp r8b, 0		; If it is the 0 string terminator, then we need to end
	jz strindexof_exit_loop
	jmp strindexof_continue
strindexof_notfound:
	pop rbx			; Reset the needle to it's starting address
	pop rax			; Reset rax to haystack starting address
	push rax		; Put the value back onto the stack for the haystack
	push rbx		; Push needle starting address back onto stack
	dec rax			; Make it so that sub rax, haystack will be -1
strindexof_continue:
	inc rdx			; Move to the next character in haystack string
	loop strindexof_loop
strindexof_exit_loop:
	pop rbx			; Remove the stored neele address as it isn't needed
	pop rdx			; Reset the haystack pointer to beginning of string
	sub rax, rdx		; Get the address difference of the needle and haystack
strindexof_exit:
	pop rdx			; Restore the state of rdx
	pop rcx			; Restore the state of rcx
	ret
strindexof ENDP