Normal view

There are new articles available, click to refresh the page.

Before yesterdayBlog - Atredis Partners

Blog - Atredis Partners
A LibAFL Introductory WorkshopJordan Whitehead
4 December 2023 at 23:54

A LibAFL Introductory Workshop

Blog - Atredis Partners

By: Jordan Whitehead

4 December 2023 at 23:54

Intro

Why LibAFL

Fuzzing is great! Throwing randomized inputs at a target really fast can have unreasonable effectiveness with the right setup. When starting with a new target a fuzzing harness can iterate along with your reversing/auditing efforts and you can sleep well at night knowing your cores are taking the night watch. When looking for bugs our time is often limited; any effort spent on tooling needs to be time well spent. LibAFL is a great library that can let us quickly adapt a fuzzer to our specific target. Not every target fits nicely into the "Command-line program that parses a file" category, so LibAFL lets us craft fuzzers for our specific situations. This adaptability opens up the power of fuzzing for a wider range of targets.

Why a workshop

The following material comes from an internal workshop used as an introduction to LibAFL. This post is a summary of the workshop, and includes a repository of exercises and examples for following along at home. It expects some existing understanding of rust and fuzzing concepts. (If you need a refresher on rust: google's comprehensive rust is great.)

There are already a few good resources for learning about LibAFL.

The "LibAFL Book" was created by a few of the LibAFL maintainers, and is a wonderful resource. https://aflplus.plus/libafl-book/
epi has a great series of posts that delve into the creation of a few example fuzzers using LibAFL. https://epi052.gitlab.io/notes-to-self/blog/2021-11-01-fuzzing-101-with-libafl/
The LibAFL repository itself contains a number of useful examples that can be a reference for your own fuzzers. https://github.com/AFLplusplus/LibAFL/tree/main/fuzzers

This workshop seeks to add to the existing corpus of example fuzzers built with LibAFL, with a focus on customizing fuzzers to our targets. You will also find a few starter problems for getting hands on experience with LibAFL. Throughout the workshop we try to highlight the versatility and power of the library, letting you see where you can fit a fuzzer in your flow.

Course Teaser

As an aside, if you are interested in this kind of thing (security tooling, bugs, fuzzing), you may be interested in our Symbolic Execution course. We have a virtual session planned for Febuary 2024 with ringzer0. There is more information at the end of this post.

The Fuzzers

The target

Throughout the workshop we beat up on a simple target that runs on Linux. This target is not very interesting, but acts as a good example target for our fuzzers. It takes in some text, line by line, and replaces certain identifiers (like {{XXd3sMRBIGGGz5b2}}) with names. To do so, it contains a function with a very large lookup tree. In this function many lookup cases can result in a segmentation fault.

//...
    const char* uid_to_name(const char* uid) {
        /*...*/ // big nested mess of switch statements
                        switch (nbuf[14]) {
                        case 'b':
                            // regular case, no segfault
                            addr = &names[0x4b9];
                            LOG("UID matches known name at %p", addr);
                            return *addr;
                        /*...*/
                        case '7':
                            // a bad case
                            addr = ((const char**)0x68c2);
                            // SEGFAULT here
                            LOG("UID matches known name at %p", addr); 
                            return *addr;
                        /*...*/

This gives us a target that has many diverting code paths, and many reachable "bugs" to find. As we progress we will adapt our fuzzers to this target, showing off some common ways we can mold a fuzzer to a target with LibAFL.

You can find our target here, and the repository includes a couple variations that will be useful for later examples. ./fuzz_target/target.c

Pieces of a Fuzzer

Before we dive into the examples, let's establish an quick understanding of modern fuzzer internals. LibAFL breaks a fuzzer down into pieces that can be swapped out or changed. LibAFL makes great use of rust's trait system to do this. Below we have a diagram of a very simple fuzzer.

The script for this fuzzer could be as simple as the following.

while ! [ -f ./core.* ]
do
    head -c 900 /dev/urandom > ./testfile
    cat ./testfile | ./target
done

The simple fuzzer above follows three core steps.

1) Makes a randomized input

2) Runs the target with the new input

3) Keeps the created input if it causes a "win" (in this case a win is crash that produces a core file)

If you miss any of the above pieces, you won't have a very good fuzzer. We all have heard the sad tale of researchers who piped random inputs into their target, got an exciting crash, but were unable to ever reproduce the bug because they didn't save off the test case.

Even with the above pieces, that simple fuzzer will struggle to make any real progress toward finding bugs. It does not even have a notion of what progress means! Below we have a diagram of what a more modern fuzzer might look like.

A block diagram of a fuzzer with feedback

This fuzzer works off a set of existing inputs, which are randomly mutated to create the new test cases. The "mutations" are just a simple set of modifications to the input that can be quickly applied to generate new exciting inputs. Importantly, this fuzzer also uses observations from the executing target to know if a inputs was "interesting". Instead of only caring out crashes, a fuzzer with feedback can route mutated test cases back into the set of inputs to be mutated. This allows a fuzzer to progress by iterating on an input, tracking down interesting features in the target.

LibAFL provides tools for each of these "pieces" of a fuzzer.

Implementors of the Executor trait will run a target with a given test case.
- https://docs.rs/libafl/latest/libafl/executors/trait.Executor.html
The Corpus trait is for items that hold a collection of test cases, usually for inputs or solutions.
- https://docs.rs/libafl/latest/libafl/corpus/trait.Corpus.html
Something implementing the Scheduler trait is in charge of selecting inputs from a corpus to make the next test cases.
- https://docs.rs/libafl/latest/libafl/schedulers/trait.Scheduler.html
The Mutator trait implementors provide the modifications to change an input before a run.
- https://docs.rs/libafl/latest/libafl/mutators/trait.Mutator.html
The Feedback and Observer traits provide useful tools for deciding if an input was useful.
- https://docs.rs/libafl/latest/libafl/feedbacks/trait.Feedback.html
- https://docs.rs/libafl/latest/libafl/observers/trait.Observer.html

There are other important traits we will see as well. Be sure to look at the "Implementors" section of the trait documentation to see useful implementations provided by the library.

Exec fuzzer

Which brings us to our first example! Let's walk through a bare-bones fuzzer using LibAFL.

./exec_fuzzer/src/main.rs

The source is well-commented, and you should read through it. Here we just highlight a few key sections of this simple fuzzer.

//...
        let mut executor = CommandExecutor::builder()
            .program("../fuzz_target/target")
            .build(tuple_list!())
            .unwrap();

        let mut state = StdState::new(
            StdRand::with_seed(current_nanos()),
            InMemoryCorpus::<BytesInput>::new(),
            OnDiskCorpus::new(PathBuf::from("./solutions")).unwrap(),
            &mut feedback,
            &mut objective,
        ).unwrap();

Our fuzzer uses a "state" object which tracks the set of input test cases, any solution test cases, and other metadata. Notice we are choosing to keep our inputs in memory, but save out the solution test cases to disk.

We use a CommandExecutor for executing our target program, which will run the target process and pass in the test case.

//...
        let mutator = StdScheduledMutator::with_max_stack_pow(
            havoc_mutations(),
            9,                 // maximum mutation iterations
        );

        let mut stages = tuple_list!(StdMutationalStage::new(mutator));

We build a very simple pipeline for our inputs. This pipeline only has one stage, which will randomly select from a set of mutations for each test case.

//...
        let scheduler = RandScheduler::new();
        let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);

        // load the initial corpus in our state
        // since we lack feedback in this fuzzer, we have to force this,
        state.load_initial_inputs_forced(&mut fuzzer, &mut executor, &mut mgr, &[PathBuf::from("../fuzz_target/corpus/")]).unwrap();

        // fuzz
        fuzzer.fuzz_loop(&mut stages, &mut executor, &mut state, &mut mgr).expect("Error in fuzz loop");

With a fuzzer built from a scheduler and some feedbacks (here we use a ConstFeedback::False to not have any feedback except for the objective feedback which is a CrashFeedback), we can load our initial entries and start to fuzz. We use the created stages, chosen executor, the state, and an event manager to start fuzzing. Our event manager will let us know when we start to get "wins".

[jordan exec_fuzzer]$ ./target/release/exec_fuzzer/

[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 1, objectives: 0, executions: 1, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 2, objectives: 0, executions: 2, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 3, objectives: 0, executions: 3, exec/sec: 0.000
[Objective #0] run time: 0h-0m-1s, clients: 1, corpus: 3, objectives: 1, executions: 3, exec/sec: 2.932
[Stats #0] run time: 0h-0m-15s, clients: 1, corpus: 3, objectives: 1, executions: 38863, exec/sec: 2.590k
[Objective #0] run time: 0h-0m-20s, clients: 1, corpus: 3, objectives: 2, executions: 38863, exec/sec: 1.885k
...

Our fragile target quickly starts giving us crashes, even with no feedback. Working from a small set of useful inputs helps our mutations be able to find crashing inputs.

This simple execution fuzzer gives us a good base to work from as we add features to our fuzzer.

Exec fuzzer with custom feedback

We can't effectively iterate on interesting inputs without feedback. Currently our random mutations must generate a crashing case in one go. If we can add feedback to our fuzzer, then we can identify test cases that did something interesting. We will loop those interesting test cases back into our set of cases for further mutation.

There are many different sources we could turn to for this information. For this example, let's use the fuzz_target/target_dbg binary which is a build of our target with some debug output on stderr. By looking at this debug output we can start to identify interesting cases. If a test case gets us some debug output we hadn't previously seen before, then we can say it is interesting and worth iterating on.

There isn't an existing implementation of this kind of feedback in the LibAFL library, so we will have to make our own! If you want to try this yourself, we have provided a template file in the repository.

./exec_fuzzer_stderr_template/

The LibAFL repo provides a StdErrObserver structure we can use with our CommandExecutor. This observer will allow our custom feedback structure to receive the stderr output from our run. All we need to do is create a structure that implements the is_interesting method of the Feedback trait, and we should be good to go. In that method we are provided with the state, the mutated input, the observers. We just have to get the debug output from the StdErrObserver and determine if we reached somewhere new.

impl<S> Feedback<S> for NewOutputFeedback
where
    S: UsesInput + HasClientPerfMonitor,
{
    fn is_interesting<EM, OT>(
        &mut self,
        _state: &mut S,
        _manager: &mut EM,
        _input: &S::Input,
        observers: &OT,
        _exit_kind: &ExitKind
    ) -> Result<bool, Error>
       where EM: EventFirer<State = S>,
             OT: ObserversTuple<S>
    {
        // return Ok(false) for uninteresting inputs
        // return Ok(true) for interesting ones
        Ok(false)
    }
}

I encourage you to try implementing this feedback yourself. You may want to find some heuristics to ignore unhelpful debug messages. We want to avoid reporting too many inputs as useful, so we don't overfill our input corpus. The input corpus is the set of inputs we use for generating new test cases. We will waste lots of time when there are inputs in that set that are not actually helping us dig towards a win. Ideally we want each of these inputs to be as small and quick to run as possible, while exercising a unique path in our target.

In our solution, we simply keep a set of seen hashes. We report an input to be interesting if we see it caused a unique hash.

./exec_fuzzer_stderr/src/main.rs

//...
        fn is_interesting<EM, OT>(
            &mut self,
            _state: &mut S,
            _manager: &mut EM,
            _input: &S::Input,
            observers: &OT,
            _exit_kind: &ExitKind
        ) -> Result<bool, Error>
           where EM: EventFirer<State = S>,
                 OT: ObserversTuple<S>
        {
            let observer = observers.match_name::<StdErrObserver>(&self.observer_name)
                .expect("A NewOutputFeedback needs a StdErrObserver");

            let mut hasher = DefaultHasher::new();
            hasher.write(&observer.stderr.clone().unwrap());
            let hash = hasher.finish();

            if self.hash_set.contains(&hash) {
                Ok(false)
            } else {
                self.hash_set.insert(hash);
                Ok(true)
            }
        }

This ends up finding "interesting" inputs very quickly, and blowing up our input corpus.

...
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 308, objectives: 0, executions: 4388, exec/sec: 2.520k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 309, objectives: 0, executions: 4423, exec/sec: 2.520k
[Objective #0] run time: 0h-0m-1s, clients: 1, corpus: 309, objectives: 1, executions: 4423, exec/sec: 2.497k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 310, objectives: 1, executions: 4532, exec/sec: 2.520k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 311, objectives: 1, executions: 4629, exec/sec: 2.521k
...

Code Coverage Feedback

Relying on the normal side-effects of a program (like debug output, system interactions, etc) is not very reliable for deeply exploring a target. There may be many interesting features that we miss using this kind of feedback. The feedback of choice for many modern fuzzers is "code coverage". By observing what blocks of code are being executed, we can gain insight into what inputs are exposing interesting logic.

Being able to collect that information, however, is not always straight forward. If you have access to the source code, you may be able to use a compiler to instrument the code with this information. If not, you may have to find ways to dynamically instrument your target either through binary modification, emulation, or other sources.

AFL++ provides a version of clang with compiler-level instrumentation for providing code coverage feedback. LibAFL can observe the information produced by this instrumentation, and we can use it for feedback. We have a build of our target using afl-clang-fast. With this build (target_instrumented), we can use the LibAFL ForkserverExecutor to communicate with our instrumented target. The HitcountsMapObserver can use shared memory for receiving our coverage information each run.

You can see our fuzzer's code here.

./aflcc_fuzzer/src/main.rs

//...
        let mut shmem_provider = UnixShMemProvider::new().unwrap();
        let mut shmem = shmem_provider.new_shmem(MAP_SIZE).unwrap();
        // write the id to the env var for the forkserver
        shmem.write_to_env("__AFL_SHM_ID").unwrap();
        let shmembuf = shmem.as_mut_slice();
        // build an observer based on that buffer shared with the target
        let edges_observer = unsafe {HitcountsMapObserver::new(StdMapObserver::new("shared_mem", shmembuf))};
        // use that observed coverage to feedback based on obtaining maximum coverage
        let mut feedback = MaxMapFeedback::tracking(&edges_observer, true, false);

        // This time we can use a fork server executor, which uses a instrumented in fork server
        // it gets a greater number of execs per sec by not having to init the process for each run
        let mut executor = ForkserverExecutor::builder()
            .program("../fuzz_target/target_instrumented")
            .shmem_provider(&mut shmem_provider)
            .coverage_map_size(MAP_SIZE)
            .build(tuple_list!(edges_observer))
            .unwrap();

The compiled-in fork server should also reduce our time needed to instantiate a run, by forking off partially instantiated processes instead of starting from scratch each time. This should offset some of the cost of our instrumentation.

When executed, our fuzzer quickly finds new paths through the process, building up our corpus of interesting cases and guiding our fuzzer.

[jordan aflcc_fuzzer]$ ./target/release/aflcc_fuzzer 

[Stats #0] run time: 0h-0m-0s, clients: 1, corpus: 0, objectives: 0, executions: 0, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 1, objectives: 0, executions: 1, exec/sec: 0.000
[Stats #0] run time: 0h-0m-0s, clients: 1, corpus: 1, objectives: 0, executions: 1, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 2, objectives: 0, executions: 2, exec/sec: 0.000
[Stats #0] run time: 0h-0m-0s, clients: 1, corpus: 2, objectives: 0, executions: 2, exec/sec: 0.000
...
[Testcase #0] run time: 0h-0m-10s, clients: 1, corpus: 100, objectives: 0, executions: 19152, exec/sec: 1.823k
[Objective #0] run time: 0h-0m-10s, clients: 1, corpus: 100, objectives: 1, executions: 19152, exec/sec: 1.762k
[Stats #0] run time: 0h-0m-11s, clients: 1, corpus: 100, objectives: 1, executions: 19152, exec/sec: 1.723k
[Testcase #0] run time: 0h-0m-11s, clients: 1, corpus: 101, objectives: 1, executions: 20250, exec/sec: 1.821k
...

Custom Mutation

So far we have been using the havoc_mutations, which you can see here is a set of mutations that are pretty good for lots of targets.

https://github.com/AFLplusplus/LibAFL/blob/bd12e060ca263ea650ece0a51a355ac714e7ce75/libafl/src/mutators/scheduled.rs#L296

Many of these mutations are wasteful for our target. In order to get to the vulnerable uid_to_name function, the input must first pass a valid_uid check. In this check, characters outside of the range A-Za-z0-9\-_ are rejected. Many of the havoc_mutations, such as the BytesRandInsertMutator, will introduce characters that are not in this range. This results in many test cases that are wasted.

With this knowledge about our target, we can use a custom mutator that will insert new bytes only in the desired range. Implementing the Mutator trait is simple, we just have to provide a mutate function.

//...
    impl<I, S> Mutator<I, S> for AlphaByteSwapMutator
    where
        I: HasBytesVec,
        S: HasRand,
    {
        fn mutate(
            &mut self,
            state: &mut S,
            input: &mut I,
            _stage_idx: i32,
        ) -> Result<MutationResult, Error> {

            /*
                return Ok(MutationResult::Mutated) when you mutate the input
                or Ok(MutationResult::Skipped) when you don't
            */

            Ok(MutationResult::Skipped)
        }
    }

If you want to try this for yourself, feel free to use the aflcc_custom_mut_template as a template to get started.

./aflcc_custom_mut_template/

In our solution we use a set of mutators, including our new AlphaByteSwapMutator and a few existing mutators. This set should hopefully result in a greater number of valid test cases that make it to the uid_to_name function.

//...
        // we will specify our custom mutator, as well as two other helpful mutators for growing or shrinking
        let mutator = StdScheduledMutator::with_max_stack_pow(
            tuple_list!(
                AlphaByteSwapMutator::new(),
                BytesDeleteMutator::new(),
                BytesInsertMutator::new(),
            ),
            9,
        );

Then in our mutator we use the state's source of random to choose a location, and a new byte from a set of valid characters.

//...
        fn mutate(
            &mut self,
            state: &mut S,
            input: &mut I,
            _stage_idx: i32,
        ) -> Result<MutationResult, Error> {
            // here we apply our random mutation
            // for our target, simply swapping a byte should be effective
            // so long as our new byte is 0-9A-Za-z or '-' or '_'

            // skip empty inputs
            if input.bytes().is_empty() {
                return Ok(MutationResult::Skipped)
            }

            // choose a random byte
            let byte: &mut u8 = state.rand_mut().choose(input.bytes_mut());

            // don't replace tag chars '{{}}'
            if *byte == b'{' || *byte == b'}' {
                return Ok(MutationResult::Skipped)
            }

            // now we can replace that byte with a known good byte
            *byte = *state.rand_mut().choose(&self.good_bytes);

            // technically we should say "skipped" if we replaced a byte with itself, but this is fine for now
            Ok(MutationResult::Mutated)
        }

And that is it! The custom mutator works seamlessly with the rest of the system. Being able to quickly tweak fuzzers like this is great for adapting to your target. Experiments like this can help us quickly iterate when combined with performance measurements.

...
[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 76, objectives: 1, executions: 2339, exec/sec: 1.895k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 77, objectives: 1, executions: 2386, exec/sec: 1.933k
[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 77, objectives: 1, executions: 2386, exec/sec: 1.928k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 78, objectives: 1, executions: 2392, exec/sec: 1.933k
...

Example Problem

At this point, we have a separate target you may want to experiment with! It is a program that contains a small maze, and gives you a chance to create a fuzzer with some custom feedback or mutations to better traverse the maze and discover a crash. Play around with some of the concepts we have introduced here, and see how fast your fuzzer can solve the maze.

./maze_target/

[jordan maze_target]$ ./maze -p

██████████████
█.██......█ ██
█....██ █.☺  █
██████  █ ██ █
██   ██████  █
█  █  █     ██
█ ███   ██████
█  ███ ██   ██
██   ███  █  █
████ ██  ███ █
█    █  ██ █ █
█ ████ ███ █ █
█          █  
████████████

Found:

  ############
  #          #
# # ### #### #
# # ##  #...@#
# ###  ##.####
#  #  ###...##
##   ## ###..#
######...###.#
##.....#..#..#
#..######...##
#.##.#  ######
#....# ##....#
## #......##.#
[Testcase #0] run time: 0h-0m-2s, clients: 1, corpus: 49, objectives: 0, executions: 5745, exec/sec: 2.585k
Found:

  ############
  #          #
# # ### ####@#
# # ##  #....#
# ###  ##.####
#  #  ###...##
##   ## ###..#
######...###.#
##.....#..#..#
#..######...##
#.##.#  ######
#....# ##....#
## #......##.#
[Testcase #0] run time: 0h-0m-3s, clients: 1, corpus: 50, objectives: 0, executions: 8892, exec/sec: 2.587k

Going Faster

Persistent Fuzzer

In previous examples, we have made use of the ForkserverExecutor which works with the forkserver that afl-clang-fast inserted into our target. While the fork server does give us a great speed boost by reducing the start-up time for each target process, we still require a new process for each test case. If we can instead run multiple test cases in one process, we can speed up our fuzzing greatly. Running multiple testcases per target process is often called "persistent mode" fuzzing.

As they say in the AFL++ documentation:

Basically, if you do not fuzz a target in persistent mode, then you are just doing it for a hobby and not professionally :-).

Some targets do not play well with persistent mode. Anything that changes lots of global state each run can have trouble, as we want each test case to run in isolation as much as possible. Even for targets well suited for persistent mode, we usually will have to create a harness around the target code. This harness is just a bit of code we write to call in to the target for fuzzing. The AFL++ documentation on persistent mode with LLVM is a great reference for writing these kinds of harnesses.

When we have created such a harness, the inserted fork server will detect the ability to persist, and can even use shared memory to provide the test cases. LibAFL's ForkserverExecutor can let us make use of these persistent harnesses.

Our fuzzer using a persistent harness is not much changed from our previous fuzzers.

./persistent_fuzzer/src/main.rs

The main change is in telling our ForkServerExecutor that it is_persistent(true).

//...
        let mut executor = ForkserverExecutor::builder()
            .program("../fuzz_target/target_persistent")
            .is_persistent(true)
            .shmem_provider(&mut shmem_provider)
            .coverage_map_size(MAP_SIZE)
            .build(tuple_list!(edges_observer))
            .unwrap();

The ForkserverExecutor takes care of the magic to make this all happen. Most of our work goes into actually creating an effective harness! If you want to try and craft your own, we have a bit of a template ready for you to get started.

./fuzz_target/target_persistent_template.c

In our harness we want to be careful to reset state each round, so we remain as true to our original as possible. Any modified global variables, heap allocations, or side-effects from a run that could change the behavior of future runs needs to be undone. Failure to clean the program state can result in false positives or instability. If we want our winning test cases from this fuzzer to also be able to crash the original target, then we need to emulate the original target's behavior as close as possible.

Sometimes it is not worth it to emulate the original, and instead use our harness to target deeper surface. For example in our target we could directly target the uid_to_name function, and then convert the solutions into solutions for our original target later. We would want to also call valid_uid in our harness, to ensure we don't report false positives that would never work against our original target.

You can inspect our persistent harness here; we choose to repeatedly call process_line for each line and take care to clean up after ourselves.

./fuzz_target/target_persistent.c

Where previously we saw around 2k executions per second for our fuzzers with code coverage feedback, we are now seeing around 5k or 6k, still with just one client.

[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 171, objectives: 4, executions: 95677, exec/sec: 5.826k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 172, objectives: 4, executions: 96236, exec/sec: 5.860k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 172, objectives: 4, executions: 96236, exec/sec: 5.821k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 173, objectives: 4, executions: 96933, exec/sec: 5.863k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 173, objectives: 4, executions: 96933, exec/sec: 5.798k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 174, objectives: 4, executions: 98077, exec/sec: 5.866k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 174, objectives: 4, executions: 98077, exec/sec: 5.855k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 175, objectives: 4, executions: 98283, exec/sec: 5.867k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 175, objectives: 4, executions: 98283, exec/sec: 5.853k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 176, objectives: 4, executions: 98488, exec/sec: 5.866k

In-Process Fuzzer

Using AFL++'s compiler and fork server is not the only way to achieve multiple test cases in one process. LibAFL is an extremely flexible library, and supports all sorts of scenarios. The InProcessExecutor allows us to run test cases directly in the same process as our fuzzing logic. This means if we can link with our target somehow, we can fuzz in the same process.

The versatility of LibAFL means we can build our entire fuzzer as a library, which we can link into our target, or even preload into our target dynamically. LibAFL even supports nostd (compilation without dependency on an OS or standard library), so we can treat our entire fuzzer as a blob to inject into our target's environment. As long as execution reaches our fuzzing code, we can fuzz.

In our example we build our fuzzer and link with our target built as a static library, calling into the C code directly using rust's FFI.

Building our fuzzer and causing it to link with our target is done by providing a build.rs file, which the rust compilation will use.

./inproc_fuzzer/build.rs

//...
    fn main() {
        let target_dir = "../fuzz_target".to_string();
        let target_lib = "target_libfuzzer".to_string();

        // force us to link with the file 'libtarget_libfuzzer.a'
        println!("cargo:rustc-link-search=native={}", &target_dir);
        println!("cargo:rustc-link-lib=static:+whole-archive={}", &target_lib);

        println!("cargo:rerun-if-changed=build.rs");
    }

LibAFL also provides tools to wrap the clang compiler, if you wish to create a compiler that will automatically inject your fuzzer into the target. You can see examples of this in the LibAFL examples.

We will want a harness for this target as well, so we can pass our test cases in as a buffer instead of having the target read lines from stdin. We will use the common interface used by libfuzzer, which has us create a function called LLVMFuzzerTestOneInput. LibAFL even has some helpers that will do the FFI calls for us.

Our harness can be very similar to the one we created for persistent mode fuzzing. We also have to watch out for the same kinds of global state or memory leaks that could make our fuzzing unstable. Again, we have a template for you if you want to craft the harness yourself.

./fuzz_target/target_libfuzzer_template.c

With LLVMFuzzerTestOneInput defined in our target, and a static library made, our fuzzer can directly call into the harness for each test case. We define a harness function which our executor will call with the test case data.

//...
        // our executor will be just a wrapper around a harness
        // that calls out the the libfuzzer style harness
        let mut harness = |input: &BytesInput| {
            let target = input.target_bytes();
            let buf = target.as_slice();
            // this is just some niceness to call the libfuzzer C function
            // but we don't need to use a libfuzzer harness to do inproc fuzzing
            // we can call whatever function we want in a harness, as long as it is linked
            libfuzzer_test_one_input(buf);
            return ExitKind::Ok;
        };

        let mut executor = InProcessExecutor::new(
            &mut harness,
            tuple_list!(edges_observer),
            &mut fuzzer,
            &mut state,
            &mut restarting_mgr,
        ).unwrap();

This easy interoperability with libfuzzer harnesses is nice, and again we see a huge speed improvement over our previous fuzzers.

[jordan inproc_fuzzer]$ ./target/release/inproc_fuzzer 

Starting up
[Stats       #1]  (GLOBAL) run time: 0h-0m-16s, clients: 2, corpus: 0, objectives: 0, executions: 0, exec/sec: 0.000
                  (CLIENT) corpus: 0, objectives: 0, executions: 0, exec/sec: 0.000, edges: 0/37494 (0%)
...
[Testcase    #1]  (GLOBAL) run time: 0h-0m-19s, clients: 2, corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.79k
                  (CLIENT) corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.79k, edges: 136/37494 (0%)
[Stats       #1]  (GLOBAL) run time: 0h-0m-19s, clients: 2, corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.75k
                  (CLIENT) corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.75k, edges: 137/37494 (0%)
[Testcase    #1]  (GLOBAL) run time: 0h-0m-19s, clients: 2, corpus: 103, objectives: 5, executions: 106626, exec/sec: 30.88k
                  (CLIENT) corpus: 103, objectives: 5, executions: 106626, exec/sec: 30.88k, edges: 137/37494 (0%)
[Objective   #1]  (GLOBAL) run time: 0h-0m-20s, clients: 2, corpus: 103, objectives: 6, executions: 106626, exec/sec: 28.32k
...

In this fuzzer we are also making use of a very important tool offered by LibAFL: the Low Level Message Passing (LLMP). This provides quick communication between multiple clients and lets us effectively scale our fuzzing to multiple cores or even multiple machines. The setup_restarting_mgr_std helper function creates an event manager that will manage the clients and restart them when they encounter crashes.

//...
        let monitor = MultiMonitor::new(|s| println!("{s}"));

        println!("Starting up");

        // we use a restarting manager which will restart
        // our process each time it crashes
        // this will set up a host manager, and we will have to start the other processes
        let (state, mut restarting_mgr) = setup_restarting_mgr_std(monitor, 1337, EventConfig::from_name("default"))
            .expect("Failed to setup the restarter!");

        // only clients will return from the above call
        println!("We are a client!");

This speed gain is important, and can make the difference between finding the juicy bug or not. Plus, it feels good to use all your cores and heat up your room a bit in the winter.

Emulation

Of course, not all targets are so easy to nicely link with or instrument with a compiler. In those cases, LibAFL provides a number of interesting tools like libafl_frida or libafl_nyx. In this next example we are going to use LibAFL's modified version of QEMU to give us code coverage feedback on a binary with no built in instrumentation. The modified version of QEMU will expose code coverage information to our fuzzer for feedback.

The setup will be similar to our in-process fuzzer, except now our harness will be in charge of running the emulator at the desired location in the target. By default the emulator state is not reset for you, and you will want to reset any global state changed between runs.

If you want to try it out for yourself, consult the Emulator documentation, and feel free to start with our template.

./qemu_fuzzer_template/

In our solution we first execute some initialization until a breakpoint, then save off the stack and return address. We will have to reset the stack each run, and put a breakpoint on the return address so that we can stop after our call. We also map an area in our target where we can place our input.

//...
        emu.set_breakpoint(mainptr);
        unsafe { emu.run() };

        let pc: GuestReg = emu.read_reg(Regs::Pc).unwrap();
        emu.remove_breakpoint(mainptr);

        // save the ret addr, so we can use it and stop
        let retaddr: GuestAddr = emu.read_return_address().unwrap();
        emu.set_breakpoint(retaddr);

        let savedsp: GuestAddr = emu.read_reg(Regs::Sp).unwrap();

        // now let's map an area in the target we will use for the input.
        let inputaddr = emu.map_private(0, 0x1000, MmapPerms::ReadWrite).unwrap();
        println!("Input page @ {inputaddr:#x}");

Now in the harness itself we will take the input and write it into the target, then start execution at the target function. This time we are executing the uid_to_name function directly, and using a mutator that will not add any invalid characters that valid_uid would have stopped.

//...
        let mut harness = |input: &BytesInput| {
            let target = input.target_bytes();
            let mut buf = target.as_slice();
            let mut len = buf.len();

            // limit out input size
            if len > 1024 {
                buf = &buf[0..1024];
                len = 1024;
            }

            // write our testcase into memory, null terminated
            unsafe {
                emu.write_mem(inputaddr, buf);
                emu.write_mem(inputaddr + (len as u64), b"\0\0\0\0");
            };
            // reset the registers as needed
            emu.write_reg(Regs::Pc, parseptr).unwrap();
            emu.write_reg(Regs::Sp, savedsp).unwrap();
            emu.write_return_address(retaddr).unwrap();
            emu.write_reg(Regs::Rdi, inputaddr).unwrap();

            // run until our breakpoint at the return address
            // or a crash
            unsafe { emu.run() };

            // if we didn't crash, we are okay
            ExitKind::Ok
        };

This emulation can be very quick, especially if we can get away without having to reset a lot of state each run. By targeting a deeper function here we are likely to reach crashes quickly.

...
[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 54, objectives: 0, executions: 33349, exec/sec: 31.56k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 55, objectives: 0, executions: 34717, exec/sec: 32.85k
[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 55, objectives: 0, executions: 34717, exec/sec: 31.59k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 56, objectives: 0, executions: 36124, exec/sec: 32.87k
[2023-11-25T20:24:02Z ERROR libafl::executors::inprocess::unix_signal_handler] Crashed with SIGSEGV
[2023-11-25T20:24:02Z ERROR libafl::executors::inprocess::unix_signal_handler] Child crashed! 
[Objective #0] run time: 0h-0m-1s, clients: 1, corpus: 56, objectives: 1, executions: 36124, exec/sec: 28.73k
...

LibAFL also provides some useful helpers such as QemuAsanHelper and QemuSnapshotHelper. There is even support for full system emulation, as opposed to usermode emulation. Being able to use emulators effectively when fuzzing opens up a whole new world of targets.

Generation

Our method of starting with some initial inputs and simply mutating them can be very effective for certain targets, but less so for more complicated inputs. If we start with an input of some javascript like:

if (a < b) {
    somefunc(a);
}

Our existing mutations might result in the following:

if\x00 (a << b) {
    somefu(a;;;;
}

Which might find some bugs in parsers, but is unlikely to find deeper bugs in any javascript engine. If we want to exercise the engine itself, we will want to mostly produce valid javascript. This is a good use case for generation! By defining a grammar of what valid javascript looks like, we can generate lots of test cases to throw against the engine.

A block diagram of a basic generative fuzzer

As you can see in the diagram above, with just generation alone we are no longer using a mutation+feedback loop. There are lots of successful fuzzers that have gotten wins off generation alone (domato, boofuzz, a bunch of weird midi files), but we would like to have some form of feedback and progress in our fuzzing.

In order to make use of feedback in our generation, we can create an intermediate representation (IR) of our generated data. Then we can feed back the interesting cases into our inputs to be further mutated.

So our earlier javascript could be expressed as tokens like:

(if
    (cond_lt (var a), (var b)),
    (code_block
        (func_call some_func,
            (arg_list (var a))
        )
    )
)

Our mutations on this tokenized version can do things like replace tokens with other valid tokens or add more nodes to the tree, creating a slightly different input. We can then use these IR inputs and mutations as we did earlier with code coverage feedback.

A block diagram of a generative fuzzer with mutation feedback

Now mutations on the IR could produce something like so:

(if
    (cond_lt (const 0), (var b)),
    (code_block
        (func_call some_func
            (arg_list
                (func_call some_func,
                    (arg_list ((var a), (var a)))
                )
            )
        )
    )
)

Which would render to valid javascript, and can be further mutated upon if it produces interesting feedback.

if (0 < b) {
    somefunc(somefunc(a,a));
}

LibAFL provides some great tools for getting your own generational fuzzer with feedback going. A version of the Nautilus fuzzer is included in LibAFL. To use it with our example, we first define a grammar describing what a valid input to our target looks like.

./aflcc_custom_gen/grammar.json

With LibAFL we can load this grammar into a NautilusContext that we can use for generation. We use a InProcessExecutor and in our harness we take in a NautilusInput which we render to bytes and pass to our LLVMFuzzerTestOneInput.

./aflcc_custom_gen/src/main.rs

//...
    // our executor will be just a wrapper around a harness closure
    let mut harness = |input: &NautilusInput| {
        // we need to convert our input from a natilus tree
        // into actual bytes
        input.unparse(&genctx, &mut bytes);

        let s = std::str::from_utf8(&bytes).unwrap();
        println!("Trying:\n{:?}", s);

        let buf = bytes.as_mut_slice();

        libfuzzer_test_one_input(&buf);

        return ExitKind::Ok;
    };

We also need to generate a few initial IR inputs and specify what mutations to use.

//...
    if state.must_load_initial_inputs() {
        // instead of loading from an inital corpus, we will generate our initial corpus of 9 NautilusInputs
        let mut generator = NautilusGenerator::new(&genctx);
        state.generate_initial_inputs_forced(&mut fuzzer, &mut executor, &mut generator, &mut restarting_mgr, 9).unwrap();
        println!("Created initial inputs");
    }

    // we can't use normal byte mutations, so we use mutations that work on our generator trees
    let mutator = StdScheduledMutator::with_max_stack_pow(
        tuple_list!(
            NautilusRandomMutator::new(&genctx),
            NautilusRandomMutator::new(&genctx),
            NautilusRandomMutator::new(&genctx),
            NautilusRecursionMutator::new(&genctx),
            NautilusSpliceMutator::new(&genctx),
            NautilusSpliceMutator::new(&genctx),
        ),
        3,
    );

With this all in place, we can run and get the combined benefits of generation, code coverage, and in-process execution. To iterate on this, we can further improve our grammar as we better understand our target.

//...
                  (CLIENT) corpus: 145, objectives: 2, executions: 40968, exec/sec: 1.800k, edges: 167/37494 (0%)
[Testcase    #1]  (GLOBAL) run time: 0h-0m-26s, clients: 2, corpus: 146, objectives: 2, executions: 41229, exec/sec: 1.811k
                  (CLIENT) corpus: 146, objectives: 2, executions: 41229, exec/sec: 1.811k, edges: 167/37494 (0%)
[Objective   #1]  (GLOBAL) run time: 0h-0m-26s, clients: 2, corpus: 146, objectives: 3, executions: 41229, exec/sec: 1.780k
                  (CLIENT) corpus: 146, objectives: 3, executions: 41229, exec/sec: 1.780k, edges: 167/37494 (0%)
[Stats       #1]  (GLOBAL) run time: 0h-0m-27s, clients: 2, corpus: 146, objectives: 3, executions: 41229, exec/sec: 1.755k

Note that our saved solutions are just serialized NautilusInputs and will not work when used against our original target. We have created a separate project that will render these solutions out to bytes with our grammar.

./gen_solution_render/src/main.rs

//...
    let input: NautilusInput = NautilusInput::from_file(path).unwrap();
    let mut b = vec![];

    let tree_depth = 0x45;
    let genctx = NautilusContext::from_file(tree_depth, grammarpath);

    input.unparse(&genctx, &mut b);

    let s = std::str::from_utf8(&b).unwrap();
    println!("{s}");

[jordan gen_solution_render]$ ./target/release/gen_solution_render ../aflcc_custom_gen/solutions/id\:0

bar{{PLvkLizOcGccywcS}}foo

{{EGgkWs-PxeqpwBZK}}foo

bar{{hlNeoKiwMTNfqO_h}}

[jordan gen_solution_render]$ ./target/release/gen_solution_render ../aflcc_custom_gen/solutions/id\:0 | ../fuzz_target/target

Segmentation fault (core dumped)

Example Problem 2

This brings us to our second take home problem! We have a chat client that is vulnerable to a number of issues. Fuzzing this binary could be made easier though good use of generation and/or emulation. As you find some noisy bugs you may wish to either avoid those paths in your fuzzer, or patch the bugs in your target. Bugs can often mask other bugs. You can find the target here.

./chat_target/

As well as one example solution that can fuzz the chat client.

./chat_solution/src/main.rs

-- Ping from    16937944: D�DAAAATt'AAAAPt'%222�%%%%%%9999'pRR9&&&%%%%%2Tt�{�''pRt�'%99999999'pRR9&&&&&&999AATt'%&'pRt�'%TTTTTTTTTTTTTT9999999'a%''AAA��TTt�'% --
-- Error sending message: Bad file descriptor --
[Stats #0] run time: 0h-0m-5s, clients: 1, corpus: 531, objectives: 13, executions: 26752, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-5s, clients: 1, corpus: 532, objectives: 13, executions: 26760, exec/sec: 0.000
-- Ping from    16937944: D�DAAAATT'%'aRt�'%9999'pRR����������T'%'LLLLLLLLLLLa%'nnnnnmnnnT'AA''��'A�'%'p%''A9999'pRR����������'pRR��R�� --
[2023-11-25T21:29:19Z ERROR libafl::executors::inprocess::unix_signal_handler] Crashed with SIGSEGV
[2023-11-25T21:29:19Z ERROR libafl::executors::inprocess::unix_signal_handler] Child crashed!

Conclusion

The goal of this workshop is to show the versatility of LibAFL and encourage its use. Hopefully these examples have sparked some ideas of how you can encorporate custom fuzzers against some of your targets. Let us know if you have any questions or spot any issues with our examples. Alternately, if you have an interesting target and want us to find bugs in it for you, please contact us.

Course Plug

Thanks again for reading! If you like this kind of stuff, you may be interested in our course "Practical Symbolic Execution for VR and RE" where you will learn to create your own symbolic execution harnesses for: reverse engineering, deobfuscation, vulnerability detection, exploit development, and more. The next public offering is in Febuary 2024 as part of ringzer0's BOOTSTRAP24. We are also available for private offerings on request.

More info here. https://ringzer0.training/trainings/practical-symbolic-execution.html

Blog - Atredis Partners
Symbolic Triage: Making the Best of a Good SituationJordan Whitehead
31 October 2022 at 18:40

Symbolic Triage: Making the Best of a Good Situation

Blog - Atredis Partners

By: Jordan Whitehead

31 October 2022 at 18:40

Symbolic Execution can get a bad rap. Generic symbex tools have a hard time proving their worth when confronted with a sufficiently complex target. However, I have found symbolic execution can be very helpful in certain targeted situations. One of those situations is when triaging a large number of crashes coming out of a fuzzer, especially in cases where dealing with a complicated or opaque target. This is the "Good Situation" I have found myself in before, where my fuzzer handed me a large load of crashes that resisted normal minimization and de-duplication. By building a small symbolic debugger I managed a much faster turnaround time from fuzz-case to full understanding.

In this post I want to share my process for writing symbolic execution tooling for triaging crashes, and try to highlight tricks I use to make the tooling effective and flexible. The examples here all use the great Triton library for our symbolic execution and solving. The examples all use code hosted at: github.com/atredis-jordan/SymbolicTriagePost

(Oh BTW, we have a course!) Do you reverse engineer and symbolically execute in your workflows, or want to?
Are you using fuzzing today but want to find more opportunities to improve it and find deeper and more interesting bugs?
Can you jam with the console cowboys in cyberspace?
We've developed a 4-day course called "Practical Symbolic Execution for VR and RE" that's tailored towards these exact goals. It’s fun and practical, with lots of demos and labs to practice applying these concepts in creative ways. If that sounds interesting to you, there is more information at the bottom of this post. Hope to see you there!

We will be using a bunch of crashes in Procmon64.exe for our examples. Procmon's parsing of PML (Process Monitor Log) files is pretty easy to knock over, and we can quickly get lots of crashes out of a short fuzzing session. It is a large opaque binary with some non-determinism to the crashes, so useful tooling here will help us to speed up our reverse engineering efforts. Note that we weren't exhaustive in trying to find bugs in Procmon; so although these bugs we will talk about here don't appear super useful to an attacker, I won't be opening any untrusted PML files any time soon.

I gathered a bunch of crashes by making a few very small PML files and throwing Jackalope at the target. After a few hours we had 200ish odd crashes to play with. Many of the crashes were unstable, and only reproduced occasionally.

..\Jackalope\Release\fuzzer.exe -iterations_per_round 30 -minimize_samples false -crash_retry 0 -nthreads 32 -in - -resume -out .\out -t 5000 -file_extension PML -instrument_module procmon64.exe -- procmon64.exe /OpenLog @@ /Quiet /Runtime 1 /NoFilter /NoConnect

Fuzzing Procmon's PML paser with Jackalope

A Simple Debugger

With all this hyping up symbolic execution, our first step is to not use Symbolic Execution! Knowing when to turn to symbolic execution and when just to use emulation or a debugger is a good skill to have. In this case, we are going to write a very simple debugger using the Windows debugging API. This debugger can be used to re-run our crashing inputs, find out how stable they are, see if they all happen in the main thread, gather stack traces, etc.

Also, having a programmatic debugger will be very useful when we start symbolically executing. We will talk about that here in a second, first let's get our debugger off the ground.

Quick aside. All my code examples here are in python, because I like being able to pop into IPython in my debuggers. I defined a bunch of ctypes structures in the win_types.py file. I recommend having some programmatic way to to generate the types you need. Look into PDBRipper or cvdump as a good place to start.

Okay, so first we want a debugger that can run the process until it crashes and collect the exception information. The basic premise is we start a process as debugged (our connect_debugger function in triage.py), and then wait on it until we get an unhandled exception. Like so:

    handle, main_tid = connect_debugger(cmd)

    log("process", 3, f": -- ")

    event = dbg_wait(handle, None)
    code = event.dwDebugEventCode

    if code == EXIT_PROCESS_DEBUG_EVENT:
        log("crash", 1, f" Closed with no crash")
    elif code == EXCEPTION_DEBUG_EVENT:
        # exception to investigate
        log("crash", 1, f" crashed:")
        er = event.u.Exception.ExceptionRecord
        log("crash", 1, exceptionstr(handle, er, event.dwThreadId))

    else:
        log("process", 1, f" hit unexpected Debug Event ")

    dbg_kill(handle)

A piece of triage.py's handle_case, running a single test case

*Running the above code to get the exception information from a crash*

Many of the crashes will not happen every time due to some non-determinism. Running through all our test cases multiple times in our debugger, we can build a picture of which crashes are the most stable, if they stay in the main thread, and what kind of exception is happening.

.\crsh\access_violation_0000xxxxxxxxx008_00000xxxxxxxx5AA_1.PML -- 100% (18) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x520 read at 0x5aa
         EXCEPTION_STACK_BUFFER_OVERRUN(0xc0000409) @ 0x83c
.\crsh\access_violation_0000xxxxxxxxx008_00000xxxxxxxx5AA_2.PML -- 100% (34) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x520 read at 0x5aa
.\crsh\access_violation_0000xxxxxxxxx063_00000xxxxxxxx3ED_1.PML -- 100% (34) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x520 read at 0x3ed
...
.\crsh\access_violation_0000xxxxxxxxx3D4_00000xxxxxxxxED1_2.PML -- 52% (23) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed1
.\crsh\access_violation_0000xxxxxxxxx234_00000xxxxxxxxED4_3.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed4
.\crsh\access_violation_0000xxxxxxxxx3CA_00000xxxxxxxxED1_1.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed1
.\crsh\access_violation_0000xxxxxxxxx5EC_00000xxxxxxxx0A2_1.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed4
.\crsh\access_violation_0000xxxxxxxxx5EF_00000xxxxxxxxF27_1.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xecb
.\crsh\access_violation_0000xxxxxxxxxB46_00000xxxxxxxxFF4_1.PML -- 44% (18) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed4
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
.\crsh\access_violation_0000xxxxxxxxx25A_00000xxxxxxxxED4_1.PML -- 38% (21) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xecb
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x19d read at 0x184
...

Gathered information from multiple runs

Let me have another quick aside. Windows exceptions are nice because they can contain extra information. The exception record tells us if an access violation is a read or a write, as well as the pointer that lead to the fault. On Linux, it can be hard to get that information programmatically, as a SEGFAULT is just a SEGFAULT. Here we can use our symbolic execution engine to lift only the faulting instruction. The engine will provide us the missing information on what loads or stores happened, letting us differentiate between a boring NULL read and an exciting write past the end of a page.

Getting our Symbolic Execution Running, and a Few Tricks

We now have a simple debugger working using the Windows debugger API (or ptrace or whatever). Now we can add our symbolic engine into the mix. The game plan is to use our debugger to run the target until our input is in memory somewhere. Then we will mark the input as symbolic and trace through the rest of the instructions in our symbolic engine.

Marking input as “symbolic” here means we are telling our engine that these values are to be tracked as variables, instead of just numbers. This will let the expressions we see all be in terms of our input variables, like “rax: (add INPUT_12 0x12)” instead of just “rax: 0x53”. A better term would be “concolic” (concrete-symbolic) because we are still using the actual value of these input bytes, just adding the symbolic information on top of them. I just use the term symbolic in this post, though.

Our debugger will tell us when we reach the exception. From there we should be able to inspect the state at the crash in terms of our symbolic input. For an access violation we hope to see that the pointer dereferenced is symbolically "(0xwhatever + INPUT_3c)" or some other symbolic expression, showing us what in our input caused the crash.

This information is useful for root causing the crash (we will see a couple cool tricks for working with this information in the next section). We gather this symbolic info so we can take the constraints that kept us on the crashing path, along with our own constraints, and send those to a solver. Using the solver we can ask "What input would make this pointer be X instead?" This lets us quickly identify a Write-What-Where from a Read8-AroundHere, or a Write-That-ThereGiveOrTake100. We can break our symbolic debugger at any point in a trace and use the solver to answer our questions.

*Stopping at a breakpoint, and seeing what input would make RBX equal 0x12340000 here*

Note: I should point out that it isn't strictly necessary to use a debugger at all. We could just load procmon64.exe and it's libraries into our symbolic execution engine, and then emulate the instructions without a debugger's help. If you see the great examples in the Triton repo, you will notice that none of them step along with a debugger. I like using a symbolic execution engine alongside a debugger for a couple of reasons. I’ll highlight a few of those reasons in the following paragraphs.

The main reason is probably to avoid gaslighting myself. With a debugger or a concrete execution trace I have a ground truth I can follow along with. Without that it is easy to make a mistake when setting up our execution environment and not realize until much later. Things like improperly loading libraries, handling relocations, or setting up the TEB and PEB on windows. By using a debugger, we can just setup our execution environment by pulling in chunks of memory from the actual process. We can also load the memory on demand, so we can save time on very large processes. In our example we load the memory lazily with Triton's GET/SET_CONCRETE_MEMORY_VALUE callbacks.

def tri_init(handle, onlyonsym=False, memarray=False):
    # do the base initialization of a TritonContext

    ctx = TritonContext(ARCH.X86_64)
    ctx.setMode(MODE.ONLY_ON_SYMBOLIZED, onlyonsym)
    if memarray:
        ctx.setMode(MODE.MEMORY_ARRAY, True)
    else:
        ctx.setMode(MODE.ALIGNED_MEMORY, True)
    ctx.setMode(MODE.AST_OPTIMIZATIONS, True)

    # set lazy memory loading
    def getmemcb(ctx, ma):
        addr = ma.getAddress()
        sz = ma.getSize()
        # will only load pages that have not been previously loaded
        tri_load_dbg_mem(ctx, handle, addr, sz, False)

    def setmemcb(ctx, ma, val):
        addr = ma.getAddress()
        sz = ma.getSize()
        # will only load pages that have not been previously loaded
        tri_load_dbg_mem(ctx, handle, addr, sz, True)

    ctx.addCallback(CALLBACK.GET_CONCRETE_MEMORY_VALUE, getmemcb)
    ctx.addCallback(CALLBACK.SET_CONCRETE_MEMORY_VALUE, setmemcb)

    return ctx

Setting up Triton in triage.py

The debugger also lets us handle instructions that are unknown to our symbolic execution engine. For example, Triton does not have a definition for the 'rdrand' instruction. By single stepping alongside our debugger, we can simply fix up any changed registers when we encounter unknown instructions. This could lead to a loss of symbolic information if the instruction is doing something with our symbolic inputs, but for the most part we can get away with just ignoring these instructions.

Lastly, using our debugger gives us another really nice benefit; we can just skip over whole swaths of irrelevant code! We have to be very careful with what we mark as irrelevant, because getting it wrong can mean we lose a bunch of symbolic information. With procmon, I marked most of the drawing code as irrelevant. When hitting one of these imports from user32 or gdi32, we place a breakpoint and let the debugger step over those calls, then resume single stepping with Triton. This saves a ton of time, as symbolic execution is orders of magnitude slower than actual execution. Any irrelevant code we can step over can make a huge difference.

Without a debugger we can still do this, but it usually involves writing hooks that will handle any important return values or side effects from those calls, instead of just bypassing them with our debugger. Building profiling into our tooling can help us identify those areas of concern, and adjust our tooling to gain back some of that speed.

    # skip drawing code
    if skip_imports:
        impfuncs = dbg_get_imports_from(handle, base, ["user32.dll", "gdi32.dll", "comdlg32.dll", "comctl32.dll"])
        for name in impfuncs:
            addr = impfuncs[name]

            # don't skip a few user32 ones
            skip = True
            for ds in ["PostMessage", "DefWindowProc", "PostQuitMessage", "GetMessagePos", "PeekMessage", "DispatchMessage", "GetMessage", "TranslateMessage", "SendMessage", "CallWindowProc", "CallNextHook"]:
                if ds.lower() in name.lower():
                    skip = False
                    break

            if skip:
                hooks[addr] = (skipfunc_hook, name)

Skipping unneeded imports in triage.py

For our target, skipping imports wasn't enough. We were still spending lots of time in loops inside the procmon binary. A quick look confirmed that these were a statically included memset and memcpy. We can't just skip over memcpy because we will lose the symbolic information being copied. So for these two, we wrote a hook that would handle the operation symbolically in our python, without having to emulate each instruction. We made sure that copied bytes got a copy of the symbolic expression in the source data.

    for i in range(size):
        sa = MemoryAccess(src + i, 1)
        da = MemoryAccess(dst + i, 1)
        cell = ctx.getMemoryAst(sa)
        expr = ctx.newSymbolicExpression(cell, "memcpy byte")
        ctx.assignSymbolicExpressionToMemory(expr, da)

Transfering symbolic information in our memcpy hook

These kind of hooks not only save us time, but they are a great opportunity to check the symbolic arguments going into the memcpy or memset. Even if the current trace is not going to crash inside of the memcpy, we have the ability to look at those symbolic arguments and ask "Could this memcpy reach unmapped memory?" or "Could the size argument be unreasonably large?". This can help us find other vulnerabilities, or other expressions of the issues we are already tracing. Below is a small check that tries to see if the end of a memcpy's destination could be some large amount away.

        astctx = ctx.getAstContext()
        cond = ctx.getPathPredicate()
        # dst + size
        dstendast = ctx.getRegisterAst(ctx.registers.rcx) + ctx.getRegisterAst(ctx.registers.r8)
        # concrete value of the dst + size
        dstendcon = dst + size
        testpast = 0x414141
        cond = astctx.land([cond, (dstendcon + testpast) <= dstendast])

        log("hook", 5, "Trying to solve for a big memcpy")
        model, status, _ = ctx.getModel(cond, True)
        if status == SOLVER_STATE.SAT:
            # can go that far
            # this may not be the cause of our crash though, so let's just report it, not raise it
            log("crash", 2, "Symbolic memcpy could go really far!")

A simple check in our memcpy hook from triage.py

The tradeoff of these checks is that invoking the solver often can add to our runtime, and you probably don't want them enabled all the time.

At this point we have most of what we need to run through our crashing cases and start de-duplicating and root-causing. However, some of our access violations were still saying that the bad dereference did not depend on our input. This didn't make sense to me, so I suspected we were losing symbolic information along the way somehow. Sometimes this can happen due to concretization of pointers, so turning on Triton's new MEMORY_ARRAY mode can help us recover that information (at the cost of a lot of speed).

In this case, however, I had my tooling print out all the imported functions being called along the trace. I wanted to see if any of the system calls on the path were causing a loss of symbolic information. Or if there was a call that re-introduced the input without it being symbolized. I found that there was another second call to MapViewOfFile that was remapping our input file into memory in a different location. With a hook added to symbolize the remapped input, all our crashes were now reporting their symbolic relation to the input correctly!

*Our tooling showing an AST for one of the bad dereferences*

Using our Symbolic Debugger

Cool! Now we have symbolic information for our crashes. What do we do with it?

Well first, we can quickly group our crashes by what input they depend on. This is quite helpful; even though some issues can lead to crashes in multiple locations, we can still group them together by what exact input is lacking bounds checks. This can help us understand a bug better, and also see different ways the bug can interact with the system.

By grouping our crashes, it looked like our 200ish crashes boil down to four distinct groups: three controlled pointers being read from and one call to __fastfail.

One neat tool Triton gives us is backward slicing! Because Triton can keep a reference of the associated instruction when building it's symbolic expressions, we can generate a instruction trace that only contains the instructions relevant to our final expression. I used this to cut out most code along the trace as irrelevant, and be able to walk just the pieces of code between the input and the crash that were relevant. Below we gather relevant instructions that created the bad pointer that was dereferenced in one of our crashes.

def backslice_expr(ctx, symbexp, print_expr=True):
    # sort by refId to put things temporal
    # to get a symbolic expression from a load access, do something like:
    # symbexp = inst.getLoadAccess()[0][0].getLeaAst().getSymbolicExpression()
    items = sorted(ctx.sliceExpressions(symbexp).items(), key=lambda x: x[0])
    for _, expr in items:
        if print_expr:
            print(expr)
        da = expr.getDisassembly()
        if len(da) > 0:
            print("\t" if print_expr else "", da)

A back-slicing helper in triage.py-

Being able to drop into the IPython REPL at any point of the trace and see the program state in terms of my input is very helpful during my RE process.

For the call to __fastfail (kinda like an abort for Windows), we don't have a bad dereference to back-slice here, instead we have the path constraints our engine gathered. These constraints are grabbed any time the engine sees that we could symbolically go either way at a junction. To stay tied to our concrete path, the engine notes down the condition required for our path. For example: if we take a jne branch after having compared the INPUT_5 byte against 0, the engine will add a path constraint saying "If you want to stay on the path we took, make sure INPUT_5 is not 0", or "(not (= INPUT_5 (_ bv0 8))" in AST-speak.

These path constraints are super useful. We can use them to generate other inputs that would go down unexplored paths. There are lots of nice symbolic execution tools that use this to help a fuzzer by generating interesting new inputs. (SymCC, KLEE, Driller, to name three)

In our case, we can inspect them to find out why we ended up at the __fastfail. By just looking at the most recent path constraint, we can see where our path forked off most recently due to our input.

*The most recent path constraint leading to the __fastfail*

*The disassembly around the most recent path constraint*

The path constraint tells us that the conditional jump at 0x7FF7A3F43517 in the above disassembly is where our path last forked due to one of our input values. When we follow the trace after this fork, we can see that it always leads directly to our fatal condition. To get more information on why the compare before the fork failed, I dropped into an IPython shell for us at that junction. Our tooling makes it easy to determine the control we have over the pointer in RCX being dereferenced before the branch. That makes this another crash due to a controlled pointer read.

Showing the AST for the dereferenced register before the branch

Where To Go From Here

So from here we have a pretty good understanding of why these issues exist in procmon64.exe. Digging in a little deeper into the crashes shows that they are probably not useful for crafting a malicious PML file. If I wanted to keep going down this road, my next steps would include:

Generating interesting test cases for our fuzzer based off the known unchecked areas in the input
Identifying juicy looking functions in our exploit path. With our tooling we can gather information on what control we have in those functions. With this information we can start to path explore or generate fuzz cases that follow our intuition about what areas look interesting.
Patch out the uninteresting crash locations, and let our fuzzer find better paths without being stopped by the low-hanging fruit.
Generate a really small crashing input for fun.

The official policy of Microsoft is "All Sysinternals tools are offered 'as is' with no official Microsoft support." We were unable to find a suitable place to report these issues. If anyone with ties to the Sysinternals Suite wants more information, please contact us.

Hope this Helped! Come Take the Course!

I hope this post helped you see useful ways in which creative symbolic execution tooling could help their workflow! If anyone has questions or wants to talk about it, you can message me @jordan9001 or at [email protected].

If you got all the way to the end of this post, you would probably like our course!

"Practical Symbolic Execution for VR and RE" is hands-on. I had a lot of fun making it, and we look at a variety of ways to apply these concepts. Students spend time becoming comfortable with a few frameworks, deobfuscating binaries, detecting time-of-check time-of-use bugs, and other interesting stuff.

You can give us your information below, and we will email you when we next offer a public course. (No spam or anything else, I promise.)

If you have a group that would be interested in receiving this kind of training privately, feel free to contact us about that as well! I’d like to see you in a class sometime!

Thanks!

Subscribe to Atredis Training News

Email Address

We respect your privacy, and will not use your contact information for anything other than news about Atredis Trainings.

Thank you!

Symbolic Triage: Making the Best of a Good Situation

Blog - Atredis Partners
QEMU and U: Whole-system tracing with QEMU customizationJordan Whitehead
15 April 2021 at 18:06

QEMU and U: Whole-system tracing with QEMU customization

Blog - Atredis Partners

By: Jordan Whitehead

15 April 2021 at 18:06

Introduction

QEMU is a key tool for anyone searching for bugs in diverse places. Besides just opening the doors to expensive or opaque platforms, QEMU has several internal tools available to enable developer’s further insight and control. Researchers comfortable modifying QEMU have access to powerful inspection capabilities. We will walk through a recent custom addition to QEMU to highlight some helpful internal tools and demonstrate the power of a hackable emulator.

The target was a SoC that had an interesting system spread across multiple processes and libraries. We could communicate with this system from the external network, and we wanted to know the extent of our reach before authentication. Because of the design of the system, it was not simple to track down all the places our influence reached without valid credentials. A better map of that surface area would be helpful for further findings. We had done the prior work to get the target up and running in QEMU, so why not just have the emulator tell us?

Tracing in QEMU

Tracing guest execution in QEMU is not as simple as calling printf(“%p\n”, pc); for every instruction. The thing that puts the Q in QEMU is the TCG. The TCG (Tiny Code Generator) is a just in time compiler that will translate blocks of guest instructions to code that can run on the host. While it would be simple to trace each new block translated, once they are translated the blocks can run multiple times unimpeded and untracked by QEMU code. If all that is needed is a trace of when each block is translated, there is built-in tracing in QEMU that can give that information. (The event is translate_block. See the docs for more details.)

Once a block is translated, the emitted code may be used and reused many times. For our target we wanted to be able to start our trace when the system was in a steady state, when many blocks would have already been translated. If we want to trace every time some basic block is executed in the guest, we need to emit our own operations in front of the rest of the translated block.

There are lots of great references we can turn to for emitting custom operations alongside the translated code. QEMU itself can place instructions before each basic block that are used to count the number of instructions executed. We can follow the call to get_tb_start here in translator.c, which leads here. Operations to check the instruction count are added so execution can be halted if a limit is reached.

/*...*/
    tcg_gen_ld_i32(count, cpu_env,
                   offsetof(ArchCPU, neg.icount_decr.u32) -
                   offsetof(ArchCPU, env));

    if (tb_cflags(tb) & CF_USE_ICOUNT) {
        /*
         * We emit a sub with a dummy immediate argument. Keep the insn index
         * of the sub so that we later (when we know the actual insn count)
         * can update the argument with the actual insn count.
         */
        tcg_gen_sub_i32(count, count, tcg_constant_i32(0));
        icount_start_insn = tcg_last_op();
    }

    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx->exitreq_label);

The piece of gen_tb_start emitting the conditional branch

Thankfully, we do not have to specify individual operations like the icount code does. To simplify things, QEMU can generate “helper” functions which will generate operations to call out to a native function from within the translated blocks. This is what AFL++’s fork of qemu uses for its tracing instrumentation without having to modify the guest binary. AFL++’s qemu has to track unique paths taken, and the code makes for a good example for our use case. The In the AFL++ fork, the function afl_gen_trace is called immediately before a basic block is translated.

tcg_ctx->cpu = env_cpu(env);
    afl_gen_trace(pc);
    gen_intermediate_code(cpu, tb, max_insns);

In tb_gen_code where afl emits operations to trace execution

They there call gen_helper_afl_maybe_log, but searching the source we can find no definition for that function. This is a helper function. The build system will create a definition that will emit operations to perform a call to the function HELPER(afl_maybe_log).

void HELPER(afl_maybe_log)(target_ulong cur_loc) {

  register uintptr_t afl_idx = cur_loc ^ afl_prev_loc;

  INC_AFL_AREA(afl_idx);

  afl_prev_loc = cur_loc >> 1;

}

AFL++'s trace helper, adjusting a map in shared memory

The function was declared here as DEF_HELPER_FLAGS_1(afl_maybe_log, TCG_CALL_NO_RWG, void, t1). QEMU’s build system will handle generating the code to create TCG operations to call the helper function. The “_1” indicates it takes one argument, and the last two arguments to the macro are the return type, and the argument type. tl indicates target_ulong. Another helpful argument type is env which passes an CPUArchState * argument to the helper function. ptr, i64, f32, and such all do what they say on the tin.

For our tool, we used a helper function to call to call out at the beginning of every code block. In target/arm/translate.c we added gen_helper_bb_enter(cpu_env, tcg_constant_i32(4)) in the function arm_tr_tb_start which is called at the beginning of translating every block for an ARM guest. This will generate code for each basic block that will call our function HELPER(bb_enter).

This leads us to another problem we encounter when trying to trace such a complex system. On our target, if we implement HELPER(bb_enter) with fprintf(logfile, “@%p\n”, env->regs[15]) we are quickly going to slow our emulator to a crawl, and be left with huge unreasonable files. In our case, we did not care too much about the order in which these basic blocks were hit, we just cared what basic blocks were uniquely hit when we interact with the system over the network. For this we implemented a form of Differential Debugging.

We had to communicate to QEMU when to start and stop a trace, so we could take separate recordings. A recording of area covered while running the system without interacting with it over the network, and a separate recording of lots of various non-authenticated interaction with the system over the network. We then found the area covered in the second recording that was not covered in the first. Then we had our tool report this as surface area to be further tested and reviewed for vulnerabilities.

To do this we implemented the tracing as a bitmap of the address space we cared about. We adjusted the granularity of our map so that every entry accounted for 0x10 bytes of code, which for our 32-bit arm target produced perfectly manageable file sizes.

// paddr to start watching
#define MAP_START_PADDR 0x80000000
// size of memory region
#define MAP_SIZE        0x20000000
#define MAP_END_PADDR   (MAP_START_PADDR + MAP_SIZE)
#define MAP_GRAN_SHF    4   // 0x10 granularity

#define INDX_OFF        (MAP_START_PADDR >> MAP_GRAN_SHF)
#define BB_MAP_INDEX(addr)  ((addr - INDX_OFF) >> 3)
#define BB_MAP_BIT(addr)    (addr & ((1<<3)-1))

unsigned char bb_map[(MAP_SIZE >> (MAP_GRAN_SHF + 3))];
void HELPER(bb_enter)(CPUARMState *env, int blksz)
{
    /* ... */
    pend = pstart + blksz - 1;

    if ((pstart < MAP_START_PADDR) || (pstart >= MAP_END_PADDR)) {
        // not in region
        return;
    }

    if (pend >= MAP_END_PADDR) {
        pend = MAP_END_PADDR-1;
    }

    pstart >>= MAP_GRAN_SHF;
    pend >>= MAP_GRAN_SHF;

    while (pstart <= pend) {
        bb_map[BB_MAP_INDEX(pstart)] |= (1 << BB_MAP_BIT(pstart));
        pstart++;
    }

    return;
}

Piece of relevant code for implementing our tracing bitmap

We also had to add some method to communicate to our emulator when to start, stop, clear, or write out a coverage map. QEMU provides a nice way to implement commands such as these in its HMP (human monitor) system. The documentation contains instructions on how to add monitor commands. The basic process involves adding an entry in the hmp-commands.hx file describing the command names, the arguments expected, and a bit of info about the command. The handler declarations can go in include/monitor/hmp.h, and the definitions typically go in monitor/hmp-cmds.c.

We implemented a clear, start, stop, and write command for our tracing.

For many targets this would be enough, and we could move on to writing tooling to convert our coverage information to file offsets. The system we wanted to gather info on was running in usermode code across multiple processes on our target. If we had logged based on the instruction pointer, we would have a trace of virtual addresses across all processes. Most of these virtual addresses are not going to be unique across processes, rendering our system coverage mostly meaningless.

We got around this issue with a bit of a hack. By translating the virtual address of the instruction pointer to a physical address, we can avoid aliasing between processes. This works for the system we were testing because the relevant processes all remained running the whole time. For an extra measure we turned swap off, keeping our pages from moving around underneath us.

This is probably not a Good Idea™ for most tracing use cases, but it worked well for our setup and we were able to implement it quickly. We made use of a function in QEMU called get_phys_addr that exists for ARM targets. We probably would have been better off using something that made use of the TLB, as the constant translation slowed down the emulator noticeably when our tracing was enabled.

/*...*/
    target_ulong start;
    hwaddr pstart;
    hwaddr pend;
    MemTxAttrs attrs = { 0 };
    int prot = 0;
    target_ulong page_size = { 0 };
    ARMMMUFaultInfo fi = { 0 };
    ARMCacheAttrs cacheattrs = { 0 };

    start = env->regs[15];
    // convert to physical address

    // >:|
    // returns bool, but 0 means success
    if (get_phys_addr(
                    env,
                    start,
                    MMU_INST_FETCH,
                    arm_mmu_idx(env),
                    &pstart,
                    &attrs,
                    &prot,
                    &page_size,
                    &fi,
                    &cacheattrs
    )) {
        printf("DBG Could not get phys addr for %x\n", start);
        return;
    }

Our call to get_phys_addr to translate the instruction pointer into a physical address

To work with physical addresses, we confined our coverage map to the part of the physical address space that we knew was correlated with RAM. Before and after obtaining our two coverage recordings we took physical memory dumps of our system using the existing QEMU monitor command pmemsave. To parse the coverage date for unique coverage, we made a small script that evaluated the dumps, the coverage maps, and any relevant binary files. Upon finding bits in the bitmap that are unique to the second recording, the script checks if the memory dumps show this to be in one of the relevant binary files. We cannot do exact matching on the binary files because relocations will have changed the contents, so we simply align the text section and check if it is “near enough” a match. With a good threshold for “near enough” we obtained accurate results. From there we translated the unique bit locations to file offsets and generate coverage data that could be used with IDA, Binary Ninja, or Ghidra.

(Lighthouse is our favorite coverage plugin for IDA and Binary Ninja. Dragon Dance is a good alternative for Ghidra. Lighthouse’s modoff format is very simple to implement. If coverage compatible with Dragon Dance is needed, the drcov format is simple enough to implement, and Qiling framework has some good example code for generating it.)

Conclusion

This solution worked well for our target and gave us some areas to dig into that would have otherwise been difficult to find quickly. The purpose of this post is not to introduce some new fork of QEMU with this tool built in. There are already too many unmaintained forks of QEMU for vulnerability research, and this tool would be a lot less effective in other situations. This is meant as more of a love note to QEMU, and hopefully inspires other researchers to make better use of their favorite emulator. The internals of QEMU made so our tracing tool could be developed quickly, and we could return focus to finding vulnerabilities.