本文翻译自gnu汇编例子

GNU Assembler,它的汇编器也叫gas,是GNU操作系统的默认汇编器。 它适用于许多不同的架构,并支持多种汇编语言语法。 本文下面的示例,仅适用于使用x86-64平台的Linux操作系统。

本文目录:

  • 开始
  • 使用C库
  • 64位C代码调用约定
  • C和汇编混合编程
  • 命令行参数
  • 浮点指令
  • 数据段
  • 递归
  • SIMD并行性
  • Saturation算术
  • 本地变量和堆栈帧

入门

这是经典的Hello World程序,使用Linux系统调用write和exit,用于64位系统(汇编指令与32位不同):

# ----------------------------------------------------------------------------------------
# Writes "Hello, World" to the console using only system calls. Runs on 64-bit Linux only.
# To assemble and run:
#
# gcc -c hello.s && ld hello.o && ./a.out
#
# or
#
# gcc -nostdlib hello.s && ./a.out
# ----------------------------------------------------------------------------------------

.global _start

.text
_start:
# write(1, message, 13)
mov $1, %rax # system call 1 is write
mov $1, %rdi # file handle 1 is stdout
mov $message, %rsi # address of string to output
mov $13, %rdx # number of bytes
syscall # invoke operating system to do the write

# exit(0)
mov $60, %rax # system call 60 is exit
xor %rdi, %rdi # we want return code 0
syscall # invoke operating system to exit
message:
.ascii "Hello, world\n"

用gcc编译

$ gcc -c hello.s && ld hello.o && ./a.out
Hello, World

如果您使用的是OSX或Windows操作系统,系统调用号和使用的寄存器可能会有所不同(gcc、as当然也支持这两个平台)。

使用C库

一般来说,都会需要使用C库。 下面是调用C库(puts函数)的Hello World:

# ----------------------------------------------------------------------------------------
# Writes "Hola, mundo" to the console using a C library. Runs on Linux or any other system
# that does not use underscores for symbols in its C library. To assemble and run:
# 
#  filename: hola.s
#     gcc hola.s && ./a.out
# ----------------------------------------------------------------------------------------

        .global main

        .text
main:                                   # This is called by C library's startup code
        mov     $message, %rdi          # First integer (or pointer) parameter in %rdi
        call    puts                    # puts(message)
        ret                             # Return to C library code
message:
        .asciz "Hola, mundo"       

运行

$ gcc hola.s && ./a.out
Hola, mundo

64位C代码的调用约定(Calling Conventions)

64位调用约定有一些更详细的说明,并在AMD64 ABI参考文献中进行了全面的解释。你也可以在维基百科获取他们的信息。最重要的几点是(再次,对于64位Linux,而不是Windows):

  • 从左到右,传递寄存器的参数。分配寄存器的顺序是:
    • 对于整数和指针,rdi,rsi,rdx,rcx,r8,r9。
    • 对于浮点(float,double),xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7
  • 额外的参数被push到堆栈,从右到左,并在调用后被删除。
  • 参数压入堆栈后,调用call指令,所以当被调用函数开始运行时,返回地址为(%rsp),第一个内存参数的地址为8(%rsp)等。
  • 在call之前,堆栈指针RSP必须对齐到16字节边界。ok,但进行call是将8字节的返回地址压入了堆栈了,所以当被调用函数运行时,%rsp并不对齐。你必须通过push某个东西或从%rsp中减去8,来增加额外的空间。
  • 需要被调用函数保留的唯一寄存器(calle-save寄存器)是:rbp,rbx,r12,r13,r14,r15。其他寄存器都可以被随意改变。
  • 被调用函数也应该保存XMCSR和x87控制字的控制位,但是x87指令在64位代码中很少见,所以可以不用担心。
  • 整数以 rax 或 rdx:rax 返回,浮点值在 xmm0 或 xmm1:xmm0 中返回。

该程序输出前几个斐波纳契数字,演示了如何保存和恢复寄存器:

# -----------------------------------------------------------------------------
# A 64-bit Linux application that writes the first 90 Fibonacci numbers.  It
# needs to be linked with a C library.
#
# Assemble and Link:
#     gcc fib.s
# -----------------------------------------------------------------------------

        .global main

        .text
main:
        push    %rbx                    # we have to save this since we use it

        mov     $90, %ecx               # ecx will countdown to 0
        xor     %rax, %rax              # rax will hold the current number
        xor     %rbx, %rbx              # rbx will hold the next number
        inc     %rbx                    # rbx is originally 1
print:
        # We need to call printf, but we are using eax, ebx, and ecx.  printf
        # may destroy eax and ecx so we will save these before the call and
        # restore them afterwards.

        push    %rax                    # caller-save register
        push    %rcx                    # caller-save register

        mov     $format, %rdi           # set 1st parameter (format)
        mov     %rax, %rsi              # set 2nd parameter (current_number)
        xor     %rax, %rax              # because printf is varargs

        # Stack is already aligned because we pushed three 8 byte registers
        call    printf                  # printf(format, current_number)

        pop     %rcx                    # restore caller-save register
        pop     %rax                    # restore caller-save register

        mov     %rax, %rdx              # save the current number
        mov     %rbx, %rax              # next number is now current
        add     %rdx, %rbx              # get the new next number
        dec     %ecx                    # count down
        jnz     print                   # if not done counting, do some more

        pop     %rbx                    # restore rbx before returning
        ret
format:
        .asciz  "%20ld\n"

运行:

$ gcc fib.s && ./a.out
                   0
                   1
                   1
                   2
                   3
                 ...
  420196140727489673
  679891637638612258
 1100087778366101931
 1779979416004714189

和C语言的混合编程

这个64位例子,是一个非常简单的函数,它读入3个64位整数并返回最大值。 它演示了如何提取整数参数:它们将被push到堆栈,以便在进入该函数时,它们将分别在rdi,rsi和rdx中。 返回值是一个整数,所以把它放在了rax中。

# -----------------------------------------------------------------------------
# A 64-bit function that returns the maximum value of its three 64-bit integer
# arguments.  The function has signature:
#
#   int64_t maxofthree(int64_t x, int64_t y, int64_t z)
#
# Note that the parameters have already been passed in rdi, rsi, and rdx.  We
# just have to return the value in rax.
# -----------------------------------------------------------------------------

        .globl  maxofthree
        
        .text
maxofthree:
        mov     %rdi, %rax              # result (rax) initially holds x
        cmp     %rsi, %rax              # is x less than y?
        cmovl   %rsi, %rax              # if so, set result to y
        cmp     %rdx, %rax              # is max(x,y) less than z?
        cmovl   %rdx, %rax              # if so, set result to z
        ret   

调用这段汇编代码的C程序:

/*
 * callmaxofthree.c
 *
 * A small program that illustrates how to call the maxofthree function we wrote in
 * assembly language.
 */

#include <stdio.h>
#include <inttypes.h>

int64_t maxofthree(int64_t, int64_t, int64_t);

int main() {
    printf("%ld\n", maxofthree(1, -4, -7));
    printf("%ld\n", maxofthree(2, -6, 1));
    printf("%ld\n", maxofthree(2, 3, 1));
    printf("%ld\n", maxofthree(-2, 4, 3));
    printf("%ld\n", maxofthree(2, -6, 5));
    printf("%ld\n", maxofthree(2, 4, 6));
    return 0;
}

汇编、链接、运行这两段代码:

$ gcc -std=c99 callmaxofthree.c maxofthree.s && ./a.out
1
2
3
4
5
6

命令行参数

大家知道在C中,main只是一个简单的函数,它有两个自己的参数:

 int main(int argc,char ** argv)

下面这个例子,简单地打印一个程序的命令行参数,每行一个:

# -----------------------------------------------------------------------------
# A 64-bit program that displays its commandline arguments, one per line.
#
# On entry, %rdi will contain argc and %rsi will contain argv.
# -----------------------------------------------------------------------------

        .global main

        .text
main:
        push    %rdi                    # save registers that puts uses
        push    %rsi
        sub     $8, %rsp                # must align stack before call

        mov     (%rsi), %rdi            # the argument string to display
        call    puts                    # print it

        add     $8, %rsp                # restore %rsp to pre-aligned value
        pop     %rsi                    # restore registers puts used
        pop     %rdi

        add     $8, %rsi                # point to next argument
        dec     %rdi                    # count down
        jnz     main                    # if not done counting keep going

        ret
format:
        .asciz  "%s\n"

运行结果:

$ gcc echo.s && ./a.out 25782 dog huh $$
./a.out
25782
dog
huh
9971
$ gcc echo.s && ./a.out 25782 dog huh '$$'
./a.out
25782
dog
huh
$$

请注意,就C Library而言,命令行参数始终是字符串。 如果要将它们视为整数,需要调用atoi。 这是一个计算x的y次方的小程序。 该示例的另一个功能是它显示了如何将值限制为32位。

# -----------------------------------------------------------------------------
# A 64-bit command line application to compute x^y.
#
# Syntax: power x y
# x and y are integers
# -----------------------------------------------------------------------------

        .global main

        .text
main:
        push    %r12                    # save callee-save registers
        push    %r13
        push    %r14
        # By pushing 3 registers our stack is already aligned for calls

        cmp     $3, %rdi                # must have exactly two arguments
        jne     error1

        mov     %rsi, %r12              # argv

# We will use ecx to count down form the exponent to zero, esi to hold the
# value of the base, and eax to hold the running product.

        mov     16(%r12), %rdi          # argv[2]
        call    atoi                    # y in eax
        cmp     $0, %eax                # disallow negative exponents
        jl      error2
        mov     %eax, %r13d             # y in r13d

        mov     8(%r12), %rdi           # argv
        call    atoi                    # x in eax
        mov     %eax, %r14d             # x in r14d

        mov     $1, %eax                # start with answer = 1
check:
        test    %r13d, %r13d            # we're counting y downto 0
        jz      gotit                   # done
        imul    %r14d, %eax             # multiply in another x
        dec     %r13d
        jmp     check
gotit:                                  # print report on success
        mov     $answer, %rdi
        movslq  %eax, %rsi
        xor     %rax, %rax
        call    printf
        jmp     done
error1:                                 # print error message
        mov     $badArgumentCount, %edi
        call    puts
        jmp     done
error2:                                 # print error message
        mov     $negativeExponent, %edi
        call    puts
done:                                   # restore saved registers
        pop     %r14
        pop     %r13
        pop     %r12
        ret

answer:
        .asciz  "%d\n"
badArgumentCount:
        .asciz  "Requires exactly two arguments\n"
negativeExponent:
        .asciz  "The exponent may not be negative\n"

运行结果:

$ ./power 2 19
524288
$ ./power 3 -8
The exponent may not be negative
$ ./power 1 500
1

练习:重写这个例子,使用64位整数。 需要用strtol替换掉atoi。

浮点指令

浮点的参数放在xmm寄存器中。 这是一个简单的函数,对double数组中的值进行求和:

# -----------------------------------------------------------------------------
# A 64-bit function that returns the sum of the elements in a floating-point
# array. The function has prototype:
#
#   double sum(double[] array, unsigned length)
# -----------------------------------------------------------------------------

        .global sum
        .text
sum:
        xorpd   %xmm0, %xmm0            # initialize the sum to 0
        cmp     $0, %rsi                # special case for length = 0
        je      done
next:
        addsd   (%rdi), %xmm0           # add in the current array element
        add     $8, %rdi                # move to next array element
        dec     %rsi                    # count down
        jnz     next                    # if not done counting, continue
done:
        ret    

调用他的c代码:

/*
 * callsum.c
 *
 * Illustrates how to call the sum function we wrote in assembly language.
 */

#include <stdio.h>

double sum(double[], unsigned);

int main() {
    double test[] = {
        40.5, 26.7, 21.9, 1.5, -40.5, -23.4
    };
    printf("%20.7f\n", sum(test, 6));
    printf("%20.7f\n", sum(test, 2));
    printf("%20.7f\n", sum(test, 0));
    printf("%20.7f\n", sum(test, 3));
    return 0;
}

运行:

$ gcc callsum.c sum.s && ./a.out
          26.7000000
          67.2000000
           0.0000000
          89.1000000
          

数据段

在大多数操作系统上,代码段是只读的,数据段仅用于初始化数据,并且有一个特殊的.bss段用于未初始化的数据。

下面是一段代码,它计算命令行参数的平均值,预期为整数,并将结果显示为浮点数。注意特殊的是:代码里有 .text .data两个段标识,分别代表代码段和数据段的开始

# -----------------------------------------------------------------------------
# 64-bit program that treats all its command line arguments as integers and
# displays their average as a floating point number.  This program uses a data
# section to store intermediate results, not that it has to, but only to
# illustrate how data sections are used.
# -----------------------------------------------------------------------------

        .globl  main

        .text
main:
        dec     %rdi                    # argc-1, since we don't count program name
        jz      nothingToAverage
        mov     %rdi, count             # save number of real arguments
accumulate:
        push    %rdi                    # save register across call to atoi
        push    %rsi
        mov     (%rsi,%rdi,8), %rdi     # argv[rdi]
        call    atoi                    # now rax has the int value of arg
        pop     %rsi                    # restore registers after atoi call
        pop     %rdi
        add     %rax, sum               # accumulate sum as we go
        dec     %rdi                    # count down
        jnz     accumulate              # more arguments?
average:
        cvtsi2sd sum, %xmm0
        cvtsi2sd count, %xmm1
        divsd   %xmm1, %xmm0            # xmm0 is sum/count
        mov     $format, %rdi           # 1st arg to printf
        mov     $1, %rax                # printf is varargs, there is 1 non-int argument

        sub     $8, %rsp                # align stack pointer
        call    printf                  # printf(format, sum/count)
        add     $8, %rsp                # restore stack pointer

        ret

nothingToAverage:
        mov     $error, %rdi
        xor     %rax, %rax
        call    printf
        ret

        .data
count:  .quad   0
sum:    .quad   0
format: .asciz  "%g\n"
error:  .asciz  "There are no command line arguments to average\n"

递归

汇编做递归感觉会比较怪,但也许会令你惊讶的是,实现递归函数没有什么特殊的要求。 你需要做的只是,按照通常的方法小心的保存寄存器。 下面是一个例子。

C语音版本:

uint64_t factorial(unsigned n) {
    return (n <= 1) ? 1 : n * factorial(n-1);
}

汇编版本:

# ----------------------------------------------------------------------------
# A 64-bit recursive implementation of the function
#
#     uint64_t factorial(unsigned n)
#
# implemented recursively
# ----------------------------------------------------------------------------

        .globl  factorial

        .text
factorial:
        cmp     $1, %rdi                # n <= 1?
        jnbe    L1                      # if not, go do a recursive call
        mov     $1, %rax                # otherwise return 1
        ret
L1:
        push    %rdi                    # save n on stack (also aligns %rsp!)
        dec     %rdi                    # n-1
        call    factorial               # factorial(n-1), result goes in %rax
        pop     %rdi                    # restore n
        imul    %rdi, %rax              # n * factorial(n-1), stored in %rax
        ret

用C语音调用这个递归函数:

/*
 * An application that illustrates calling the factorial function defined elsewhere.
 */

#include <stdio.h>
#include <inttypes.h>

uint64_t factorial(unsigned n);

int main() {
    for (unsigned i = 0; i < 20; i++) {
        printf("factorial(%2u) = %lu\n", i, factorial(i));
    }
}

SIMD并行性

XMM寄存器可以对浮点值进行运算,每条指令可以进行一次或多次的算术运算。

指令形式如下:

operation xmmregister_or_memorylocation,xmmregister

对于浮点加法运算,指令如下:

  • addpd: 做2个双精度加法
  • addps: 做一个双精度加法,使用寄存器的低64位
  • addsd: 做4个单精度加法
  • addss: 做1个单精度加法,使用寄存器的低32位

练习: 实现一个计算浮点数组的函数,一次计算4个。

Saturation算术

Saturation 算术是一种特殊的算术,其中所有的操作(如加法和乘法)被限制在最小值和最大值之间的固定范围内。

XMM寄存器也支持这种操作,可以在浮点处理器上进行整数的算术运算。 指令的格式如下:

operation xmmregister_or_memorylocation,xmmregister

对于整数加法,指令如下:

  • paddb: 16字节加法
  • paddw: 8个word 的加法
  • paddd: 4个dword的加法
  • paddq: 2个qword的加法
  • paddsb:16字节加法,带符号(80..7F,括号内表示Saturation的范围,下同)
  • paddsw:8个word的加法,无符号(8000..7FFF)
  • paddusb:16字节加法,无符号(00..FF)
  • paddusw:8个word的加法,无符号 (00..FFFF)

本地变量和堆栈帧

首先,请阅读Eli Bendersky的文章,里面的概述比本文的简要说明更完整。

当调用函数时,调用者首先将参数放入正确的寄存器中,然后发出call指令。 超出寄存器数目的额外的参数,将在call指令之前被push到堆栈。

call指令将返回地址放在堆栈顶部。

比如在下面这段代码:

long example(long x, long y) {
    long a, b, c;
    b = 7;
    return x * b + y;
}

在进入example函数时,x将在%edi中,y将在%esi中,返回地址将位于堆栈的顶部。 我们在哪里可以放置局部变量? 最简单的选择是当然还是堆栈,当然如果有足够的registers,也可以使用。

如果你在遵循标准ABI的计算机上运行,则可以将%rsp保留不变,并通过%rsp访问“额外的参数”和本地变量,例如:

                +----------+
         rsp-24 |    a     |
                +----------+
         rsp-16 |    b     |
                +----------+
         rsp-8  |    c     |
                +----------+
         rsp    | retaddr  |
                +----------+
         rsp+8  | caller's |
                | stack    |
                | frame    |
                | ...      |
                +----------+

这样函数就会简化为:

        .text
        .globl  example
example:
        movl    $7, -16(%rsp)
        mov     %rdi, %rax
        imul    8(%rsp), %rax
        add     %rsi, %rax
        ret

如果我们的函数是要call另一函数,那么则必须调整%rsp来正确返回。

在Windows上,却不能使用此方法,因为如果发生interrupt,堆栈指针上方的所有内容都将被抹去。 这在大多数其他操作系统上都不会发生,因为堆栈指针有一个128字节的“red zone”,发生interrupt后是安全的。 在这种情况下,你可以在进入被调用函数后,立即在堆栈上开出一段内存出来:

example:
        sub     $24, %rsp

此时的堆栈看起来是这样:

                +----------+
         rsp    |    a     |
                +----------+
         rsp+8  |    b     |
                +----------+
         rsp+16 |    c     |
                +----------+
         rsp+24 | retaddr  |
                +----------+
         rsp+32 | caller's |
                | stack    |
                | frame    |
                | ...      |
                +----------+

下面是最后的代码。要注意,必须在返回之前恢复堆栈指针(add $24, %rsp)!

        .text
        .globl  example
example:
        sub     $24, %rsp
        movl    $7, 8(%rsp)
        mov     %rdi, %rax
        imul    8(%rsp), %rax
        add     %rsi, %rax
        add     $24, %rsp
        ret

英文原文