Home » What is difference between User space and Kernel space?

What is difference between User space and Kernel space?

Solutons:


Is Kernel space used when Kernel is executing on the behalf of the user program i.e. System Call? Or is it the address space for all the Kernel threads (for example scheduler)?

Yes and yes.

Before we go any further, we should state this about memory.

Memory get’s divided into two distinct areas:

  • The user space, which is a set of locations where normal user processes run (i.e everything other than the kernel). The role of the kernel is to manage applications running in this space from messing with each other, and the machine.
  • The kernel space, which is the location where the code of the kernel is stored, and executes under.

Processes running under the user space have access only to a limited part of memory, whereas the kernel has access to all of the memory. Processes running in user space also don’t have access to the kernel space. User space processes can only access a small part of the kernel via an interface exposed by the kernel – the system calls. If a process performs a system call, a software interrupt is sent to the kernel, which then dispatches the appropriate interrupt handler and continues its work after the handler has finished.

Kernel space code has the property to run in “kernel mode”, which (in your typical desktop -x86- computer) is what you call code that executes under ring 0. Typically in x86 architecture, there are 4 rings of protection. Ring 0 (kernel mode), Ring 1 (may be used by virtual machine hypervisors or drivers), Ring 2 (may be used by drivers, I am not so sure about that though). Ring 3 is what typical applications run under. It is the least privileged ring, and applications running on it have access to a subset of the processor’s instructions. Ring 0 (kernel space) is the most privileged ring, and has access to all of the machine’s instructions. For example to this, a “plain” application (like a browser) can not use x86 assembly instructions lgdt to load the global descriptor table or hlt to halt a processor.

If it is the first one, than does it mean that normal user program cannot have more than 3GB of memory (if the division is 3GB + 1GB)? Also, in that case how can kernel use High Memory, because to what virtual memory address will the pages from high memory be mapped to, as 1GB of kernel space will be logically mapped?

For an answer to this, please refer to the excellent answer by wag here

CPU rings are the most clear distinction

In x86 protected mode, the CPU is always in one of 4 rings. The Linux kernel only uses 0 and 3:

  • 0 for kernel
  • 3 for users

This is the most hard and fast definition of kernel vs userland.

Why Linux does not use rings 1 and 2: https://stackoverflow.com/questions/6710040/cpu-privilege-rings-why-rings-1-and-2-arent-used

How is the current ring determined?

The current ring is selected by a combination of:

  • global descriptor table: a in-memory table of GDT entries, and each entry has a field Privl which encodes the ring.

    The LGDT instruction sets the address to the current descriptor table.

    See also: http://wiki.osdev.org/Global_Descriptor_Table

  • the segment registers CS, DS, etc., which point to the index of an entry in the GDT.

    For example, CS = 0 means the first entry of the GDT is currently active for the executing code.

What can each ring do?

The CPU chip is physically built so that:

  • ring 0 can do anything

  • ring 3 cannot run several instructions and write to several registers, most notably:

    • cannot change its own ring! Otherwise, it could set itself to ring 0 and rings would be useless.

      In other words, cannot modify the current segment descriptor, which determines the current ring.

    • cannot modify the page tables: https://stackoverflow.com/questions/18431261/how-does-x86-paging-work

      In other words, cannot modify the CR3 register, and paging itself prevents modification of the page tables.

      This prevents one process from seeing the memory of other processes for security / ease of programming reasons.

    • cannot register interrupt handlers. Those are configured by writing to memory locations, which is also prevented by paging.

      Handlers run in ring 0, and would break the security model.

      In other words, cannot use the LGDT and LIDT instructions.

    • cannot do IO instructions like in and out, and thus have arbitrary hardware accesses.

      Otherwise, for example, file permissions would be useless if any program could directly read from disk.

      More precisely thanks to Michael Petch: it is actually possible for the OS to allow IO instructions on ring 3, this is actually controlled by the Task state segment.

      What is not possible is for ring 3 to give itself permission to do so if it didn’t have it in the first place.

      Linux always disallows it. See also: https://stackoverflow.com/questions/2711044/why-doesnt-linux-use-the-hardware-context-switch-via-the-tss

How do programs and operating systems transition between rings?

  • when the CPU is turned on, it starts running the initial program in ring 0 (well kind of, but it is a good approximation). You can think this initial program as being the kernel (but it is normally a bootloader that then calls the kernel still in ring 0).

  • when a userland process wants the kernel to do something for it like write to a file, it uses an instruction that generates an interrupt such as int 0x80 or syscall to signal the kernel. x86-64 Linux syscall hello world example:

    .data
    hello_world:
        .ascii "hello worldn"
        hello_world_len = . - hello_world
    .text
    .global _start
    _start:
        /* write */
        mov $1, %rax
        mov $1, %rdi
        mov $hello_world, %rsi
        mov $hello_world_len, %rdx
        syscall
    
        /* exit */
        mov $60, %rax
        mov $0, %rdi
        syscall
    

    compile and run:

    as -o hello_world.o hello_world.S
    ld -o hello_world.out hello_world.o
    ./hello_world.out
    

    GitHub upstream.

    When this happens, the CPU calls an interrupt callback handler which the kernel registered at boot time. Here is a concrete baremetal example that registers a handler and uses it.

    This handler runs in ring 0, which decides if the kernel will allow this action, do the action, and restart the userland program in ring 3. x86_64

  • when the exec system call is used (or when the kernel will start /init), the kernel prepares the registers and memory of the new userland process, then it jumps to the entry point and switches the CPU to ring 3

  • If the program tries to do something naughty like write to a forbidden register or memory address (because of paging), the CPU also calls some kernel callback handler in ring 0.

    But since the userland was naughty, the kernel might kill the process this time, or give it a warning with a signal.

  • When the kernel boots, it setups a hardware clock with some fixed frequency, which generates interrupts periodically.

    This hardware clock generates interrupts that run ring 0, and allow it to schedule which userland processes to wake up.

    This way, scheduling can happen even if the processes are not making any system calls.

What is the point of having multiple rings?

There are two major advantages of separating kernel and userland:

  • it is easier to make programs as you are more certain one won’t interfere with the other. E.g., one userland process does not have to worry about overwriting the memory of another program because of paging, nor about putting hardware in an invalid state for another process.
  • it is more secure. E.g. file permissions and memory separation could prevent a hacking app from reading your bank data. This supposes, of course, that you trust the kernel.

How to play around with it?

I’ve created a bare metal setup that should be a good way to manipulate rings directly: https://github.com/cirosantilli/x86-bare-metal-examples

I didn’t have the patience to make a userland example unfortunately, but I did go as far as paging setup, so userland should be feasible. I’d love to see a pull request.

Alternatively, Linux kernel modules run in ring 0, so you can use them to try out privileged operations, e.g. read the control registers: https://stackoverflow.com/questions/7415515/how-to-access-the-control-registers-cr0-cr2-cr3-from-a-program-getting-segmenta/7419306#7419306

Here is a convenient QEMU + Buildroot setup to try it out without killing your host.

The downside of kernel modules is that other kthreads are running and could interfere with your experiments. But in theory you can take over all interrupt handlers with your kernel module and own the system, that would be an interesting project actually.

Negative rings

While negative rings are not actually referenced in the Intel manual, there are actually CPU modes which have further capabilities than ring 0 itself, and so are a good fit for the “negative ring” name.

One example is the hypervisor mode used in virtualization.

For further details see:

  • https://security.stackexchange.com/questions/129098/what-is-protection-ring-1
  • https://security.stackexchange.com/questions/216527/ring-3-exploits-and-existence-of-other-rings

ARM

In ARM, the rings are called Exception Levels instead, but the main ideas remain the same.

There exist 4 exception levels in ARMv8, commonly used as:

  • EL0: userland

  • EL1: kernel (“supervisor” in ARM terminology).

    Entered with the svc instruction (SuperVisor Call), previously known as swi before unified assembly, which is the instruction used to make Linux system calls. Hello world ARMv8 example:

    hello.S

    .text
    .global _start
    _start:
        /* write */
        mov x0, 1
        ldr x1, =msg
        ldr x2, =len
        mov x8, 64
        svc 0
    
        /* exit */
        mov x0, 0
        mov x8, 93
        svc 0
    msg:
        .ascii "hello syscall v8n"
    len = . - msg
    

    GitHub upstream.

    Test it out with QEMU on Ubuntu 16.04:

    sudo apt-get install qemu-user gcc-arm-linux-gnueabihf
    arm-linux-gnueabihf-as -o hello.o hello.S
    arm-linux-gnueabihf-ld -o hello hello.o
    qemu-arm hello
    

    Here is a concrete baremetal example that registers an SVC handler and does an SVC call.

  • EL2: hypervisors, for example Xen.

    Entered with the hvc instruction (HyperVisor Call).

    A hypervisor is to an OS, what an OS is to userland.

    For example, Xen allows you to run multiple OSes such as Linux or Windows on the same system at the same time, and it isolates the OSes from one another for security and ease of debug, just like Linux does for userland programs.

    Hypervisors are a key part of today’s cloud infrastructure: they allow multiple servers to run on a single hardware, keeping hardware usage always close to 100% and saving a lot of money.

    AWS for example used Xen until 2017 when its move to KVM made the news.

  • EL3: yet another level. TODO example.

    Entered with the smc instruction (Secure Mode Call)

The ARMv8 Architecture Reference Model DDI 0487C.a – Chapter D1 – The AArch64 System Level Programmer’s Model – Figure D1-1 illustrates this beautifully:

enter image description here

The ARM situation changed a bit with the advent of ARMv8.1 Virtualization Host Extensions (VHE). This extension allows the kernel to run in EL2 efficiently:

enter image description here

VHE was created because in-Linux-kernel virtualization solutions such as KVM have gained ground over Xen (see e.g. AWS’ move to KVM mentioned above), because most clients only need Linux VMs, and as you can imagine, being all in a single project, KVM is simpler and potentially more efficient than Xen. So now the host Linux kernel acts as the hypervisor in those cases.

Note how ARM, maybe due to the benefit of hindsight, has a better naming convention for the privilege levels than x86, without the need for negative levels: 0 being the lower and 3 highest. Higher levels tend to be created more often than lower ones.

The current EL can be queried with the MRS instruction: https://stackoverflow.com/questions/31787617/what-is-the-current-execution-mode-exception-level-etc

ARM does not require all exception levels to be present to allow for implementations that don’t need the feature to save chip area. ARMv8 “Exception levels” says:

An implementation might not include all of the Exception levels. All implementations must include EL0 and EL1.
EL2 and EL3 are optional.

QEMU for example defaults to EL1, but EL2 and EL3 can be enabled with command line options: https://stackoverflow.com/questions/42824706/qemu-system-aarch64-entering-el1-when-emulating-a53-power-up

Code snippets tested on Ubuntu 18.10.

If it is the first one, than does it mean that normal user program cannot have more than 3GB of memory (if the division is 3GB + 1GB)?

Yes this is the case on a normal linux system. There were a set of “4G/4G” patches floating around at one point that made the user and kernel address spaces completely independent (at a performance cost because it made it harder for the kernel to access user memory) but I don’t think they were ever merged upstream and interest waned with the rise of x86-64

Also, in that case how can kernel use High Memory, because to what virtual memory address will the pages from high memory be mapped to, as 1GB of kernel space will be logically mapped?

The way linux used to work (and still does on systems where the memory is small compared to the address space) was that the whole of physical memory was permanently mapped into the kernel part of the address space. This allowed the kernel to access all of physical memory with no remapping but clearly it doesn’t scale to 32-bit machines with lots of physical memory.

So the concept of low and high memory was born. “low” memory is permanently mapped into the kernels address space. “high” memory is not.

When the processor is running a system call it is running in kernel mode but still in the context of the current process. So it can directly access both kernel address space and the user address space of the current process (assuming you aren’t using the aforementioned 4G/4G patches). This means it is no problem for “high” memory to be allocated to a userland process.

Using “high” memory for kernel purposes is more of a problem. To access high memory that is not mapped to the current process it must be temporally mapped into the kernel’s address space. That means extra code and a performance penalty.

Related Solutions

Joining bash arguments into single string with spaces

[*] I believe that this does what you want. It will put all the arguments in one string, separated by spaces, with single quotes around all: str="'$*'" $* produces all the scripts arguments separated by the first character of $IFS which, by default, is a space....

AddTransient, AddScoped and AddSingleton Services Differences

TL;DR Transient objects are always different; a new instance is provided to every controller and every service. Scoped objects are the same within a request, but different across different requests. Singleton objects are the same for every object and every...

How to download package not install it with apt-get command?

Use --download-only: sudo apt-get install --download-only pppoe This will download pppoe and any dependencies you need, and place them in /var/cache/apt/archives. That way a subsequent apt-get install pppoe will be able to complete without any extra downloads....

What defines the maximum size for a command single argument?

Answers Definitely not a bug. The parameter which defines the maximum size for one argument is MAX_ARG_STRLEN. There is no documentation for this parameter other than the comments in binfmts.h: /* * These are the maximum length and maximum number of strings...

Bulk rename, change prefix

I'd say the simplest it to just use the rename command which is common on many Linux distributions. There are two common versions of this command so check its man page to find which one you have: ## rename from Perl (common in Debian systems -- Ubuntu, Mint,...

Output from ls has newlines but displays on a single line. Why?

When you pipe the output, ls acts differently. This fact is hidden away in the info documentation: If standard output is a terminal, the output is in columns (sorted vertically) and control characters are output as question marks; otherwise, the output is...

mv: Move file only if destination does not exist

mv -vn file1 file2. This command will do what you want. You can skip -v if you want. -v makes it verbose - mv will tell you that it moved file if it moves it(useful, since there is possibility that file will not be moved) -n moves only if file2 does not exist....

Is it possible to store and query JSON in SQLite?

SQLite 3.9 introduced a new extension (JSON1) that allows you to easily work with JSON data . Also, it introduced support for indexes on expressions, which (in my understanding) should allow you to define indexes on your JSON data as well. PostgreSQL has some...

Combining tail && journalctl

You could use: journalctl -u service-name -f -f, --follow Show only the most recent journal entries, and continuously print new entries as they are appended to the journal. Here I've added "service-name" to distinguish this answer from others; you substitute...

how can shellshock be exploited over SSH?

One example where this can be exploited is on servers with an authorized_keys forced command. When adding an entry to ~/.ssh/authorized_keys, you can prefix the line with command="foo" to force foo to be run any time that ssh public key is used. With this...

Why doesn’t the tilde (~) expand inside double quotes?

The reason, because inside double quotes, tilde ~ has no special meaning, it's treated as literal. POSIX defines Double-Quotes as: Enclosing characters in double-quotes ( "" ) shall preserve the literal value of all characters within the double-quotes, with the...

What is GNU Info for?

GNU Info was designed to offer documentation that was comprehensive, hyperlinked, and possible to output to multiple formats. Man pages were available, and they were great at providing printed output. However, they were designed such that each man page had a...

Set systemd service to execute after fstab mount

a CIFS network location is mounted via /etc/fstab to /mnt/ on boot-up. No, it is not. Get this right, and the rest falls into place naturally. The mount is handled by a (generated) systemd mount unit that will be named something like mnt-wibble.mount. You can...

Merge two video clips into one, placing them next to each other

To be honest, using the accepted answer resulted in a lot of dropped frames for me. However, using the hstack filter_complex produced perfectly fluid output: ffmpeg -i left.mp4 -i right.mp4 -filter_complex hstack output.mp4 ffmpeg -i input1.mp4 -i input2.mp4...

How portable are /dev/stdin, /dev/stdout and /dev/stderr?

It's been available on Linux back into its prehistory. It is not POSIX, although many actual shells (including AT&T ksh and bash) will simulate it if it's not present in the OS; note that this simulation only works at the shell level (i.e. redirection or...

How can I increase the number of inodes in an ext4 filesystem?

It seems that you have a lot more files than normal expectation. I don't know whether there is a solution to change the inode table size dynamically. I'm afraid that you need to back-up your data, and create new filesystem, and restore your data. To create new...

Why doesn’t cp have a progress bar like wget?

The tradition in unix tools is to display messages only if something goes wrong. I think this is both for design and practical reasons. The design is intended to make it obvious when something goes wrong: you get an error message, and it's not drowned in...