mirror of
				git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
				synced 2025-10-31 16:54:21 +00:00 
			
		
		
		
	docs: Document Syscall User Dispatch
Explain the interface, provide some background and security notes. [ tglx: Add note about non-visibility, add it to the index and fix the kerneldoc warning ] Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20201127193238.821364-8-krisman@collabora.com
This commit is contained in:
		
							parent
							
								
									d87ae0fa21
								
							
						
					
					
						commit
						a4452e671c
					
				
					 2 changed files with 91 additions and 0 deletions
				
			
		|  | @ -111,6 +111,7 @@ configure specific aspects of kernel behavior to your liking. | |||
|    rtc | ||||
|    serial-console | ||||
|    svga | ||||
|    syscall-user-dispatch | ||||
|    sysrq | ||||
|    thunderbolt | ||||
|    ufs | ||||
|  |  | |||
							
								
								
									
										90
									
								
								Documentation/admin-guide/syscall-user-dispatch.rst
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										90
									
								
								Documentation/admin-guide/syscall-user-dispatch.rst
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,90 @@ | |||
| .. SPDX-License-Identifier: GPL-2.0 | ||||
| 
 | ||||
| ===================== | ||||
| Syscall User Dispatch | ||||
| ===================== | ||||
| 
 | ||||
| Background | ||||
| ---------- | ||||
| 
 | ||||
| Compatibility layers like Wine need a way to efficiently emulate system | ||||
| calls of only a part of their process - the part that has the | ||||
| incompatible code - while being able to execute native syscalls without | ||||
| a high performance penalty on the native part of the process.  Seccomp | ||||
| falls short on this task, since it has limited support to efficiently | ||||
| filter syscalls based on memory regions, and it doesn't support removing | ||||
| filters.  Therefore a new mechanism is necessary. | ||||
| 
 | ||||
| Syscall User Dispatch brings the filtering of the syscall dispatcher | ||||
| address back to userspace.  The application is in control of a flip | ||||
| switch, indicating the current personality of the process.  A | ||||
| multiple-personality application can then flip the switch without | ||||
| invoking the kernel, when crossing the compatibility layer API | ||||
| boundaries, to enable/disable the syscall redirection and execute | ||||
| syscalls directly (disabled) or send them to be emulated in userspace | ||||
| through a SIGSYS. | ||||
| 
 | ||||
| The goal of this design is to provide very quick compatibility layer | ||||
| boundary crosses, which is achieved by not executing a syscall to change | ||||
| personality every time the compatibility layer executes.  Instead, a | ||||
| userspace memory region exposed to the kernel indicates the current | ||||
| personality, and the application simply modifies that variable to | ||||
| configure the mechanism. | ||||
| 
 | ||||
| There is a relatively high cost associated with handling signals on most | ||||
| architectures, like x86, but at least for Wine, syscalls issued by | ||||
| native Windows code are currently not known to be a performance problem, | ||||
| since they are quite rare, at least for modern gaming applications. | ||||
| 
 | ||||
| Since this mechanism is designed to capture syscalls issued by | ||||
| non-native applications, it must function on syscalls whose invocation | ||||
| ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore | ||||
| doesn't rely on any of the syscall ABI to make the filtering.  It uses | ||||
| only the syscall dispatcher address and the userspace key. | ||||
| 
 | ||||
| As the ABI of these intercepted syscalls is unknown to Linux, these | ||||
| syscalls are not instrumentable via ptrace or the syscall tracepoints. | ||||
| 
 | ||||
| Interface | ||||
| --------- | ||||
| 
 | ||||
| A thread can setup this mechanism on supported kernels by executing the | ||||
| following prctl: | ||||
| 
 | ||||
|   prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) | ||||
| 
 | ||||
| <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and | ||||
| disable the mechanism globally for that thread.  When | ||||
| PR_SYS_DISPATCH_OFF is used, the other fields must be zero. | ||||
| 
 | ||||
| [<offset>, <offset>+<length>) delimit a memory region interval | ||||
| from which syscalls are always executed directly, regardless of the | ||||
| userspace selector.  This provides a fast path for the C library, which | ||||
| includes the most common syscall dispatchers in the native code | ||||
| applications, and also provides a way for the signal handler to return | ||||
| without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this | ||||
| interface should make sure that at least the signal trampoline code is | ||||
| included in this region. In addition, for syscalls that implement the | ||||
| trampoline code on the vDSO, that trampoline is never intercepted. | ||||
| 
 | ||||
| [selector] is a pointer to a char-sized region in the process memory | ||||
| region, that provides a quick way to enable disable syscall redirection | ||||
| thread-wide, without the need to invoke the kernel directly.  selector | ||||
| can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other | ||||
| value should terminate the program with a SIGSYS. | ||||
| 
 | ||||
| Security Notes | ||||
| -------------- | ||||
| 
 | ||||
| Syscall User Dispatch provides functionality for compatibility layers to | ||||
| quickly capture system calls issued by a non-native part of the | ||||
| application, while not impacting the Linux native regions of the | ||||
| process.  It is not a mechanism for sandboxing system calls, and it | ||||
| should not be seen as a security mechanism, since it is trivial for a | ||||
| malicious application to subvert the mechanism by jumping to an allowed | ||||
| dispatcher region prior to executing the syscall, or to discover the | ||||
| address and modify the selector value.  If the use case requires any | ||||
| kind of security sandboxing, Seccomp should be used instead. | ||||
| 
 | ||||
| Any fork or exec of the existing process resets the mechanism to | ||||
| PR_SYS_DISPATCH_OFF. | ||||
		Loading…
	
	Add table
		
		Reference in a new issue
	
	 Gabriel Krisman Bertazi
						Gabriel Krisman Bertazi