淺談 Linux 核心：系統呼叫 (System Call)

Posted on 2025-01-21 In Linux Views: Word count in article: 1.7k Reading time ≈ 6 mins.

我們可以將 Linux kernel 當做程式運行於特權模式 (privileged mode) 的函式庫，如果要使用這個函式庫，必須使用硬體提供的特殊指令。

淺談 Linux 核心：系統呼叫 (System Call)

我們可以將 Linux kernel 當做程式運行於特權模式 (privileged mode) 的函式庫，如果要使用這個函式庫，必須使用硬體提供的特殊指令。以 x86 為例，呼叫普通函式庫使用 call 和 ret ，呼叫 Linux kernel 則需要用 syscall 和 sysret。

系統呼叫 (system call) 是 userspace 和 kernel 進行交互的界面，讓使用者的程式可以請求 kernel 進行更高權限的操作，例如硬體相關的操作 (e.g. 讀寫檔案) 、行程 (process) 的建立與執行等等。

一切的一切就從 Hello World 開始

參考 Linux 核心設計: 賦予應用程式生命的系統呼叫文章中的範例，可以透過簡單的程式來了解系統呼叫的運作過程。

給定 hello.c：

#include <stdio.h>
int main(void) {
    printf("hello, world!\n");
    return 0;
}

以 gcc 編譯 hello.c ，隨後用 ltrace 追蹤：

$ gcc -o hello hello.c
$ ltrace ./hello
puts("hello, world!"hello, world!
)                                  = 14
+++ exited (status 0) +++

上方的輸出結果包含了幾個觀察：

字串 hello, world! 作為 puts 函式的參數
puts 的返回值為 14
最後的 status 0 為 status code ，也就是 main 函式的 return 值

在你所不知道的 C 語言：編譯器和最佳化原理篇中提到， gcc 會將 printf("hello, world!\n"); 最佳化為 put("hello, world!\n") ，以降低解析 format string 和對應處理的成本。

我們可以新增一個 hello１.c 來觀察在不同最佳化條件下， gcc 對程式碼的最佳化情形：
#include <stdio.h>
int main(void) {
    char *s = "hello, world!\n";
    printf("%s", s);
    return 0;
}
利用不同最佳化 -O0 以及 -O3 進行編譯，並且透過 ltrace 再次追蹤：
$ gcc -o hello -O0 hello.c 
$ ltrace ./hello
printf("%s", "hello, world!\n"hello, world!
)                                  = 14
+++ exited (status 0) +++

$ gcc -o hello -O3 hello.c 
$ ltrace ./hello
puts("hello, world!"hello, world!
)                                  = 14
+++ exited (status 0) +++
從上方結果可以發現，當我們在 printf() 的部份加入了格式化符號 %s ，並且在編譯時關閉最佳化，就會避免 printf("hello, world!\n"); 被替換為 put("hello, world!\n")。

接著透過 strace 來進行追蹤：

$ strace ./hello
write(1, "hello, world!\n", 14hello, world!
)         = 14
exit_group(0)                           = ?
+++ exited with 0 +++

從 strace 追蹤的結果可以發現，程式最後會透過系統呼叫的 write 來將字串輸出。

系統呼叫表（System Call Table）

Linux 核心為提供每個系統呼叫提供一個獨一無二的系統呼叫編號 (system call number)。以 x86_64 為例，Linux 核心在 arch/x86/entry/syscalls/syscall_64.tbl 提供了每個系統呼叫所對應的編號以及函式所對應的進入點 (entry point) 。

#
# 64-bit system call numbers and entry vectors
#
# The format is:
# <number> <abi> <name> <entry point>
#
# The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls
#
# The abi is "common", "64" or "x32" for this file.
#
0	common	read			sys_read
1	common	write			sys_write
2	common	open			sys_open
3	common	close			sys_close
4	common	stat			sys_newstat
5	common	fstat			sys_newfstat
6	common	lstat			sys_newlstat
7	common	poll			sys_poll
8	common	lseek			sys_lseek
9	common	mmap			sys_mmap
10	common	mprotect		sys_mprotect

例如 write 的系統呼叫編號為 1 ，因此在所有的 x86_64 架構系統中，這個系統呼叫編號是不能夠被更改的。 write 最終的實作方式在 fs/read_write.c 中：

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos, *ppos = file_ppos(f.file);
		if (ppos) {
			pos = *ppos;
			ppos = &pos;
		}
		ret = vfs_write(f.file, buf, count, ppos);
		if (ret >= 0 && ppos)
			f.file->f_pos = pos;
		fdput_pos(f);
	}

	return ret;
}

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	return ksys_write(fd, buf, count);
}

SYSCALL_DEFINE 是一個巨集，定義在 include/linux/syscalls.h ：

#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)

#define SYSCALL_DEFINE_MAXARGS	6

#define SYSCALL_DEFINEx(x, sname, ...)				\
	SYSCALL_METADATA(sname, x, __VA_ARGS__)			\
	__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

該巨集最後會擴展成 sys_write() 函式：

asmlinkage long sys_write(unsigned int fd, const char __user *buf,
			  size_t count);

透過 API 使用系統呼叫

一般而言，應用程式會透過呼叫使用者空間 (user space) 的 API (Application Programming Interface) ，而不是直接呼叫系統呼叫。在 Linux 系統中的 API 通常是以 C 標準函式庫所提供，例如 Linux 中的 libC 函式庫。

如果我們想要在 Linux 核心中使用系統呼叫，可以直接呼叫 syscall() 函式來使用指定的系統呼叫。

#include <unistd.h>
#include <sys/syscall.h>

long syscall(long number, ...);

syscall() 函式可以直接呼叫一個系統呼叫，第一個參數是系統呼叫編號，例如 write 的號碼為 1，並且可根據系統呼叫的需求提供更多參數。以系統呼叫 write 為例，我們可以透過以下程式碼來直接使用：

#define NR_WRITE 1
syscall(NR_WRITE, fd, string, string_length)

NR_WRITE
系統呼叫編號，write 為 1。
fd
file descriptor ，透過 open() 、 read() 、 write() 等函式進行各種 I/O 操作時，都是以 file descriptor 為對象。以系統呼叫 write 為例， 1 表示 stdout ， 2 為 stderr
string
用於輸出的字串內容
string_length
字串長度

參考下方 write.c ，當我們想要使用核心系統呼叫的 write 時，我們可以透過 C 標準函式庫所提供的 syscall() 或是 write() 來實現：


#include <unistd.h>
#include <sys/syscall.h>
int main(void){
    syscall(1, 1, "hello, world!\n", 14);
    write(1, "Hello, World!\n", 14);
}

接著進行編譯並且透過 strace 追蹤：

$ gcc -o write write.c
$ strace ./write
write(1, "hello, world!\n", 14hello, world!
)         = 14
write(1, "Hello, World!\n", 14Hello, World!
)         = 14
exit_group(0)                           = ?
+++ exited with 0 +++

透過組合語言使用系統呼叫

除了透過函式庫所提供的 API 之外，也可以透過組合語言使用系統呼叫。考慮以下程式碼 (以 x86-64 處理器為例)：

#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main() {
    char *hello_str = "hello world\n";
    long len = strlen(hello_str) + 1;
    long ret;

    __asm__ volatile ( 
        "mov $1, %%rax\n" // system call number
        "mov $2, %%rdi\n" // unsigned int fd (stdout: 1, stderr: 2)
        "mov %1, %%rsi\n" // const char *buf
        "mov %2, %%rdx\n" // size_t count
        "syscall\n"
        "mov %%rax, %0"
        :  "=m"(ret)
        : "g" (hello_str), "g" (len)
        : "rax", "rbx", "rcx", "rdx");
    printf("return value: %ld\n", ret);
}

在第 10 行首先將系統呼叫編號存放至 rax 暫存器，第 11 行將 file descriptor 號碼寫入 rdi 暫存器。在第 12 行表示將第 16 ~ 18 行中的第 1 個變數 "g" (hello_str) ，也就是字串的 buffer 寫入至 rsi 暫存器，最後在第 13 行則是將將第 16 ~ 18 行中的第 2 個變數 "g" (len) ，即字串的長度寫入 rdx 暫存器。在第 18 行中，系統呼叫的回傳值會被存放在 rax 暫存器，因此我們將暫存器的值存放至變數 ret 並且在第 19 行印出回傳值。

可以透過系統呼叫表查尋使用系統呼叫時暫存器需要存放的變數

淺談 Linux 核心：系統呼叫 (System Call)

一切的一切就從 Hello World 開始

系統呼叫表 （System Call Table）

透過 API 使用系統呼叫

透過組合語言使用系統呼叫

參考資料

系統呼叫表（System Call Table）