Normal view

There are new articles available, click to refresh the page.
Before yesterdayNVISO Labs

Analyzing VSTO Office Files

29 April 2022 at 09:25

VSTO Office files are Office document files linked to a Visual Studio Office File application. When opened, they launch a custom .NET application. There are various ways to achieve this, including methods to serve the VSTO files via an external web server.

An article was recently published on the creation of these document files for phishing purposes, and since then we have observed some VSTO Office files on VirusTotal.

Analysis Method (OOXML)

Sample Trusted Updater.docx (0/60 detections) appeared first on VirusTotal on 20/04/2022, 6 days after the publication of said article. It is a .docx file, and as can be expected, it does not contain VBA macros (per definition, .docm files contain VBA macros, .docx files do not):

Figure 1: typical VSTO document does not contain VBA code

Taking a look at the ZIP container (a .docx file is an OOXML file, i.e. a ZIP container containing XML files and other file types), there are some aspects that we don’t usually see in “classic” .docx files:

Figure 2: content of sample file

Worth noting is the following:

  1. The presence of files in a folder called vstoDataStore. These files contain metadata for the execution of the VSTO file.
  2. The timestamp of some of the files is not 1980-01-01, as it should be with documents created with Microsoft Office applications like Word.
  3. The presence of a docsProp/custom.xml file.

Checking the content of the custom document properties file, we find 2 VSTO related properties: _AssemblyLocation and _AssemblyName:

Figure 3: custom properties _AssemblyLocation and _AssemblyName

The _AssemblyLocation in this sample is a URL to download a VSTO file from the Internet. We were not able to download the VSTO file, and neither was VirusTotal at the time of scanning. Thus we can not determine if this sample is a PoC, part of a red team engagement or truly malicious. It is a fact though, that this technique is known and used by red teams like ours, prior to the publication of said article.

There’s little information regarding domain login03k[.]com, except that it appeared last year in a potential phishing domain list, and that VirusTotal tags it as DGA.

If the document uses a local VSTO file, then the _AssemblyLocation is not a URL:

Figure 4: referencing a local VSTO file

Analysis Method (OLE)

OLE files (the default Office document format prior to Office 2007) can also be associated with VSTO applications. We have found several examples on VirusTotal, but none that are malicious.
Therefore, to illustrate how to analyze such a sample, we converted the .docx maldoc from our first analysis, to a .doc maldoc.

Figure 5: analysis of .doc file

Taking a look at the metadata with oledump‘s plugin_metadata, we find the _AssemblyLocation and _AssemblyName properties (with the URL):

Figure 6: custom properties _AssemblyLocation and _AssemblyName

Notice that this metadata does not appear when you use oledump’s option -M:

Figure 7: olefile’s metadata result

Option -M extracts the metadata using olefile’s methods, and this olefile Python module (whereupon oledump relies) does not (yet) parse user defined properties.

Conclusion

To analyze Office documents linked with VSTO apps, search for custom properties _AssemblyLocation and _AssemblyName.

To detect Office documents like these, we have created some YARA rules for our VirusTotal hunting. You can find them on our Github here. Some of them are rather generic by design, and will generate too many hits for use in a production environment. They are originally designed for hunting on VT.

We will discus these rules in detail in a follow-up blog post, but we already wanted to share these with you.

About the authors

Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

You can follow NVISO Labs on Twitter to stay up to date on all our future research and publications.

Analyzing a “multilayer” Maldoc: A Beginner’s Guide

6 April 2022 at 08:21

In this blog post, we will not only analyze an interesting malicious document, but we will also demonstrate the steps required to get you up and running with the necessary analysis tools. There is also a howto video for this blog post.

I was asked to help with the analysis of a PDF document containing a DOCX file.

The PDF is REMMITANCE INVOICE.pdf, and can be found on VirusTotal, MalwareBazaar and Malshare (you don’t need a subscription to download from MalwareBazaar or Malshare, so everybody that wants to, can follow along).

The sample is interesting for analysis, because it involves 3 different types of malicious documents.
And this blog post will also be different from other maldoc analysis blog posts we have written, because we show how to do the analysis on a machine with a pristine OS and without any preinstalled analysis tools.


To follow along, you just need to be familiar with operating systems and their command-line interface.
We start with a Ubuntu LTS 20.0 virtual machine (make sure that it is up-to-date by issuing the “sudo apt update” and “sudo apt upgrade” commands). We create a folder for the analysis: /home/testuser1/Malware (we usually create a folder per sample, with the current date in the filename, like this: 20220324_twitter_pdf). testuser1 is the account we use, you will have another account name.

Inside that folder, we copy the malicious sample. To clearly mark the sample as (potentially) malicious, we give it the extension .vir. This also prevents accidental launching/execution of the sample. If you want to know more about handling malware samples, take a look at this SANS ISC diary entry.

Figure 1: The analysis machine with the PDF sample

The original name of the PDF document is REMMITANCE INVOICE.pdf, and we renamed it to REMMITANCE INVOICE.pdf.vir.
To conduct the analysis, we need tools that I develop and maintain. These are free, open-source tools, designed for static analysis of malware. Most of them are written in Python (a free, open-source programming language).
These tools can be found here and on GitHub.

PDF Analysis

To analyze a malicious PDF document like this one, we are not opening the PDF document with a PDF reader like Adobe Reader. In stead, we are using dedicated tools to dissect the document and find malicious code. This is known as static analysis.
Opening the malicious PDF document with a reader, and observing its behavior, is known as dynamic analysis.

Both are popular analysis techniques, and they are often combined. In this blog post, we are performing static analysis.

To install the tools from GitHub on our machine, we issue the following “git clone” command:

Figure 2: The “git clone” command fails to execute

As can be seen, this command fails, because on our pristine machine, git is not yet installed. Ubuntu is helpful and suggest the command to execute to install git:

sudo apt install git

Figure 3: Installing git
Figure 4: Installing git

When the DidierStevensSuite repository has been cloned, we will find a folder DidierStevensSuite in our working folder:

Figure 5: Folder DidierStevensSuite is the result of the clone command

With this repository of tools, we have different maldoc analysis tools at our disposal. Like PDF analysis tools.
pdfid.py and pdf-parser.py are two PDF analysis tools found in Didier Stevens’ Suite. pdfid is a simple triage tool, that looks for known keywords inside the PDF file, that are regularly associated with malicious activity. pdf-parser.py is able to parse a PDF file and identify basic building blocks of the PDF language, like objects.

To run pdfid.py on our Ubuntu machine, we can start the Python interpreter (python3), and give it the pdfid.py program as first parameter, followed by options and parameters specific for pdfid. The first parameter we provide for pdfid, is the name of the PDF document to analyze. Like this:

Figure 6: pdfid’s analysis report

In the report provided as output by pdfid, we see a bunch of keywords (first column) and a counter (second column). This counter simply indicates the frequency of the keyword: how many times does it appear in the analyzed PDF document?

As you can see, many counters are zero: keywords with zero counter do not appear in the analyzed PDF document. To make the report shorter, we can use option -n. This option excludes zero counters (n = no zeroes) from the report, like this:

Figure 7: pdfid’s condensed analysis report

The keywords that interest us the most, are the ones after the /Page keyword.
Keyword /EmbeddedFile means that the PDF contains an embedded file. This feature can be used for benign and malicious purposes. So we need to look into it.
Keyword /OpenAction means that the PDF reader should do something automatically, when the document is opened. Like launching a script.
Keyword /ObjStm means that there are stream objects inside the PDF document. Stream objects are special objects, that contain other objects. These contained objects are compressed. pdfid is in nature a simple tool, that is not able to recognize and handle compressed data. This has to be done with pdf-parser.py. Whenever you see stream objects in pdfid’s report (e.g., /ObjStm with counter greater than zero), you have to realize that pdfid is unable to give you a complete report, and that you need to use pdf-parser to get the full picture. This is what we do with the following command:

Figure 8: pdf-parser’s statistical report

Option -a is used to have pdf-parser.py produce a report of all the different elements found inside the PDf document, together with keywords like pdfid.py produces.
Option -O is used to instruct pdf-parser to decompress stream objects (/ObjStm) and include the contained objects into the statistical report. If this option is omitted, then pdf-parser’s report will be similar to pdfid’s report. To know more about this subject, we recommend this blog post.

In this report, we see again keywords like /EmbeddedFile. 1 is the counter (e.g., there is one embedded file) and 28 is the index of the PDF object for this embedded file.
New keywords that did appear, are /JS and /JavaScript. They indicate the presence of scripts (code) in the PDF document. The objects that represent these scripts, are found (compressed) inside the stream objects (/ObjStm). That is why they did not appear in pdfid’s report, and why they do in pdf-parser’s report (when option -O is used).
JavaScript inside a PDF document is restricted in its interactions with the operating system resources: it can not access the file system, the registry, … .
Nevertheless, the included JavaScript can be malicious code (a legitimate reason for the inclusion of JavaScript in a PDF document, is input validation for PDF forms).
But we will first take a look at the embedded file. We to this by searching for the /EmbeddedFile keyword, like this:

Figure 9: Searching for embedded files

Notice that the search option -s is not case sensitive, and that you do not need to include the leading slash (/).
pdf-parser found one object that represents an embedded file: the object with index 28.
Notice the keywords /Filter /Flatedecode: this means that the embedded file is not included into the PDF document as-is, but that it has been “filtered” first (e.g., transformed). /FlateDecode indicates which transformation was applied: “deflation”, e.g., zlib compression.
To obtain the embedded file in its original form, we need to decompress the contained data (stream), by applying the necessary filters. This is done with option -f:

Figure 10: Decompressing the embedded file

The long string of data (it looks random) produced by pdf-parser when option -f is used, is the decompressed stream data in Python’s byte string representation. Notice that this data starts with PK: this is a strong indication that the embedded file is a ZIP container.
We will now use option -d to dump (write) the contained file to disk. Since it is (potentially) malicious, we use again extension .vir.

Figure 11: Extracting the embedded file to disk

File embedded.vir is the embedded file.

Office document analysis

Since I was told that the embedded file is an Office document, we use a tool I developed for Office documents: oledump.py
But if you would not know what type the embedded file is, you would first want to determine this. We will actually have to do that later, with a downloaded file.

Now we run oledump.py on the embedded file we extracted: embedded.vir

Figure 12: No ole file was found

The output of oledump here is a warning: no ole file was found.
A bit of background can help understand what is happening here. Microsoft Office document files come in 2 major formats: ole files and OOXML files.
Ole files (official name: Compound File Binary Format) are the “old” file format: the binary format that was default until Office 2007 was released. Documents using this internal format have extensions like .doc, .xls, .ppt, …
OOXML files (Office Open XML) are the “new” file format. It’s the default since Office 2007. Its internal format is a ZIP container containing mostly XML files. Other contained file types that can appear are pictures (.png, .jpeg, …) and ole (for VBA macros for example). OOXML files have extensions like .docx, .xlsx, .docm, .xlsm, …
OOXML is based on another format: OPC.
oledump.py is a tool to analyze ole files. Most malicious Office documents nowadays use VBA macros. VBA macros are always stored inside ole files, even with the “new” format OOXML. OOXML documents that contain macros (like .docm), have one ole file inside the ZIP container (often named vbaProject.bin) that contains the actual VBA macros.
Now, let’s get back to the analysis of our embedded file: oledump tells us that it found no ole file inside the ZIP container (OPC).
This tells us 1) that the file is a ZIP container, and more precisely, an OPC file (thus most likely an OOXML file) and 2) that it does not contain VBA macros.
If the Office document contains no VBA macros, we need to look at the files that are present inside the ZIP container. This can be done with a dedicated tool for the analysis of ZIP files: zipdump.py
We just need to pass the embedded file as parameter to zipdump, like this:

Figure 13: Looking inside the ZIP container

Every line of output produced by zipdump, represents a contained file.
The presence of folder “word” tells us that this is a Word file, thus extension .docx (because it does not contain VBA macros).
When an OOXML file is created/modified with Microsoft Office, the timestamp of the contained files will always be 1980-01-01.
In the result we see here, there are many files that have a different timestamp: this tells us, that this .docx file has been altered with a ZIP tool (like WinZip, 7zip, …) after it was saved with Office.
This is often an indicator of malicious intend.
If we are presented with an Office document that has been altered, it is recommended to take a look at the contained files that were most recently changed, as this is likely the file that has been tampered for malicious purposed.
In our extracted sample, that contained file is the file with timestamp 2022-03-23 (that’s just a day ago, time of writing): file document.xml.rels.
We can use zipdump.py to take a closer look at this file. We do not need to type its full name to select it, we can just use its index: 14 (this index is produced by zipdump, it is not metadata).
Using option -s, we can select a particular file for analysis, and with option -a, we can produce a hexadecimal/ascii dump of the file content. We start with this type of dump, so that we can first inspect the data and assure us that the file is indeed XML (it should be pure XML, but since it has been altered, we must be careful).

Figure 14: Hexadecimal/ascii dump of file document.xml.rels

This does indeed look like XML: thus we can use option -d to dump the file to the console (stdout):

Figure 15: Using option -d to dump the file content

There are many URLs in this output, and XML is readable to us humans, so we can search for suspicious URLs. But since this is XML without any newlines, it’s not easy to read. We might easily miss one URL.
Therefor, we will use a tool to help us extract the URLs: re-search.py
re-search.py is a tool that uses regular expressions to search through text files. And it comes with a small embedded library of regular expressions, for URLs, email addresses, …
If we want to use the embedded regular expression for URLs, we use option -n url.
Like this:

Figure 16: Extracting URLs

Notice that we use option -u to produce a list of unique URLs (remove duplicates from the output) and that we are piping 2 commands together. The output of command zipdump is provided as input to command re-search by using a pipe (|).
Many tools in Didier Stevens’ Suite accept input from stdin and produce output to stdout: this allows them to be piped together.
Most URLs in the output of re-search have schemas.openxmlformats.org as FQDN: these are normal URLs, to be expected in OOXML files. To help filtering out URLs that are expected to be found in OOXML files, re-search has an option to filter out these URLs. This is option -F with value officeurls.

Figure 17: Filtered URLs

One URL remains: this is suspicious, and we should try to download the file for that URL.

Before we do that, we want to introduce another tool that can be helpful with the analysis of XML files: xmldump.py. xmldump parses XML files with Python’s built-in XML parser, and can represent the parsed output in different formats. One format is “pretty printing”: this makes the XML file more readable, by adding newlines and indentations. Pretty printing is achieved by passing parameter pretty to tool xmldump.py, like this:

Figure 18: Pretty print of file document.xml.rels

Notice that the <Relationship> element with the suspicious URL, is the only one with attribute TargetMode=”External”.
This is an indication that this is an external template, that is loaded from the suspicious URL when the Office document is opened.
It is therefore important to retrieve this file.

Downloading a malicious file

We will download the file with curl. Curl is a very flexible tool to perform all kinds of web requests.
By default, curl is not installed in Ubuntu:

Figure 19: Curl is missing

But it can of course be installed:

Figure 20: Installing curl

And then we can use it to try to download the template. Often, we do not want to download that file using an IP address that can be linked to us or our organisation. We often use the Tor network to hide behind. We use option -x 127.0.0.1:9050 to direct curl to use a proxy, namely the Tor service running on our machine. And then we like to use option -D to save the headers to disk, and option -o to save the downloaded file to disk with a name of our choosing and extension .vir.
Notice that we also number the header and download files, as we know from experience, that often several attempts will be necessary to download the file, and that we want to keep the data of all attempts.

Figure 21: Downloading with curl over Tor fails

This fails: the connection is refused. That’s because port 9050 is not open: the Tor service is not installed. We need to install it first:

Figure 22: Installing Tor

Next, we try again to download over Tor:

Figure 23: The download still fails

The download still fails, but with another error. The CONNECT keyword tells us that curl is trying to use an HTTP proxy, and Tor uses a SOCKS5 proxy. I used the wrong option: in stead of option -x, I should be using option –socks5 (-x is for HTTP proxies).

Figure 24: The download seems to succeed

But taking a closer look at the downloaded file, we see that it is empty:

Figure 25: The downloaded file is empty, and the headers indicate status 301

The content of the headers file indicates status 301: the file was permanently moved.
Curl will not automatically follow redirections. This has to be enabled with option -L, let’s try again:

Figure 26: Using option -L

And now we have indeed downloaded a file:

Figure 27: Download result

Notice that we are using index 2 for the downloaded files, as to not overwrite the first downloaded files.
Downloading over Tor will not always work: some servers will refuse to serve the file to Tor clients.
And downloading with Curl can also fail, because of the User Agent String. The User Agent String is a header that Curl includes whenever it performs a request: this header indicates that the request was done by curl. Some servers are configured to only serve files to clients with the “proper” User Agent String, like the ones used by Office or common web browsers.
If you suspect that this is the case, you can use option -A to provide an appropriate User Agent String.

As the downloaded file is a template, we expect it is an Office document, and we use oledump.py to analyze it:

Figure 28: Analyzing the downloaded file with oledump fails

But this fails. Oledump does not recognize the file type: the file is not an ole file or an OOXML file.
We can use Linux command file to try to identify the file type based on its content:

Fgiure 29: Command file tells us this is pure text

If we are to believe this output, the file is a pure text file.
Let’s do a hexadecimal/ascii dump with command xxd. Since this will produce many pages of output, we pipe the output to the head command, to limit the output to the first 10 lines:

Figure 30: Hexadecimal/ascii dump of the downloaded file

RTF document analysis

The file starts with {\rt : this is a deliberately malformed RTF file. Richt Text Format is a file format for Word documents, that is pure text. The format does not support VBA macros. Most of the time, malicious RTF files perform malicious actions through exploits.
Proper RTF files should start with {\rtf1. The fact that this file starts with {\rt. is a clear indication that the file has been tampered with (or generated with a maldoc generator): Word will not produce files like this. However, Word’s RTF parser is forgiving enough to accept files like this.

Didier Stevens’ Suite contains a tool to analyze RTF files: rtfdump.py
By default, running rtfdump.py on an RTF file produces a lot of output:

Figure 31: Parsing the RTF file

The most important fact we know from this output, is that this is indeed an RTF file, since rtfdmp was able to parse it.
As RTF files often contain exploits, they often use embedded objects. Filtering rtfdump’s output for embedded objects can be done with option -O:

Figure 32: There are no embedded objects

No embedded objects were found. Then we need to look at the hexadecimal data: since RTF is a text format, binary data is encoded with hexadecimal digits. Looking back at figure 30, we see that the second entry (number 2) contains 8349 hexadecimal digits (h=8349). That’s the first entry we will inspect further.
Notice that 8349 is an uneven number, and that encoding a single byte requires 2 hexadecimal digits. This is an indication that the RTF file is obfuscated, to thwart analysis.
Using option -s, we can select entry 2:

Figure 33: Selecting the second entry

If you are familiar with the internals of RTF files, you would notice that the long, uninterrupted sequences of curly braces are suspicious: it’s another sign of obfuscation.
Let’s try to decode the hexadecimal data inside entry 2, by using option -H

Figure 34: Hexadecimal decoding

After some randomly looking bytes and a series of NULL bytes, we see a lot of FF bytes. This is typical of ole files. Ole files start with a specific set of bytes, known as a magic header: D0 CF 11 E0 A1 B1 1A E1.
We can not find this sequence in the data, however we find a sequence that looks similar: 0D 0C F1 1E 0A 1B 11 AE 10 (starting at position 0x46)
This is almost the same as the magic header, but shifted by one hexadecimal digit. This means that the RTF file is obfuscated with a method that has not been foreseen in the deobfuscation routines of rtfdump. Remember that the number of hexadecimal digits is uneven: this is the result. Should rtfdump be able to properly deobfuscate this RTF file, then the number would be even.
But that is not a problem: I’ve foreseen this, and there is an option in rtfdump to shift all hexadecimal strings with one digit. This is option -S:

Figure 35: Using option -S to manually deobfuscate the file

We have different output now. Starting at position 0x47, we now see the correct magic header: D0 CF 11 E0 A1 B1 1A E1
And scrolling down, we see the following:

Figure 36: ole file directory entries (UNICODE)

We see UNICODE strings RootEntry and ole10nAtiVE.
Every ole file contains a RootEntry.
And ole10native is an entry for embedded data. It should all be lower case: the mixing of uppercase and lowercase is another indicator for malicious intend.

As we have now managed to direct rtfdump to properly decode this embedded olefile, we can use option -i to help with the extraction:

Figure 37: Extraction of the olefile fails

Unfortunately, this fails: there is still some unresolved obfuscation. But that is not a problem, we can perform the extraction manually. For that, we locate the start of the ole file (position 0x47) and use option -c to “cut” it out of the decoded data, like this:

Figure 38: Hexadecimal/ascii dump of the embedded ole file

With option -d, we can perform a dump (binary data) of the ole file and write it to disk:

Figure 39: Writing the embedded ole file to disk

We use oledump to analyze the extracted ole file (ole.vir):

Figure 40: Analysis of the extracted ole file

It succeeds: it contains one stream.
Let’s select it for further analysis:

Figure 41: Content of the stream

This binary data looks random.
Let’s use option -S to extract strings (this option is like the strings command) from this binary data:

Figure 42: Extracting strings

There’s nothing recognizable here.

Let’s summarize where we are: we extracted an ole file from an RTF file that was downloaded by a .docx file embedded in a PDF file. When we say it like this, we can only think that this is malicious.

Shellcode analysis

Remember that malicious RTF files very often contain exploits? Exploits often use shellcode. Let’s see if we can find shellcode.
To achieve this, we are going to use scdbg, a shellcode emulator developed by David Zimmer.
First we are going to write the content of the stream to a file:

Figure 43: Writing the (potential) shellcode to disk

scdbg is an free, open source tool that emulates 32-bit shellcode designed to run on the Windows operating system. Started as a project running on Windows and Linux, it is now further developed for Windows only.

Figure 44: Scdbg

We download Windows binaries for scdbg:

Figure 45: Scdbg binary files

And extract executable scdbg.exe to our working directory:

Figure 46: Extracting scdbg.exe
Figure 47: Extracting scdbg.exe

Although scdbg.exe is a Windows executable, we can run it on Ubuntu via Wine:

Figure 48: Trying to use wine

Wine is not installed, but by now, we know how to install tools like this:

Figure 49: Installing wine
Figure 50: Tasting wine 😊

We can now run scdbg.exe like this:

wine scdbg.exe

scdbg requires some options: -f sc.vir to provide it with the file to analyze

Shellcode has an entry point: the address from where it starts to execute. By default, scdbg starts to emulate from address 0. Since this is an exploit (we have not yet recognized which exploit, but that does not prevent us from trying to analyze the shellcode), its entry point will not be address 0. At address 0, we should find a data structure (that we have not identified) that is exploited.
To summarize: we don’t know the entry point, but it’s important to know it.
Solution: scdbg.exe has an option to try out all possible entry points. Option -findsc.
And we add one more option to produce a report: -r.

Let’s try this:

Figure 51: Running scdbg via wine

This looks good: after a bunch of messages and warnings from Wine that we can ignore, scdbg proposes us with 8 (0 through 7) possible entry points. We select the first one: 0

Figure 52: Trying entry point 0 (address 0x95)

And we are successful: scdbg.exe was able to emulate the shellcode, and show the different Windows API calls performed by the shellcode. The most important one for us analysts, is URLDownloadToFile. This tells us that the shellcode downloads a file and writes it to disk (name vbc.exe).
Notice that scdbg did emulate the shellcode: it did not actually execute the API calls, no files were downloaded or written to disk.

Although we don’t know which exploit we are dealing with, scdbg was able to find the shellcode and emulate it, providing us with an overview of the actions executed by the shellcode.
The shellcode is obfuscated: that is why we did not see strings like the URL and filename when extracting the strings (see figure 42). But by emulating the shellcode, scdbg also deobfuscates it.

We can now use curl again to try to download the file:

Figure 53: Downloading the executable

And it is indeed a Windows executable (.NET):

Figure 54: Headers
Figure 55: Running command file on the downloaded file

To determine what we are dealing with, we try to look it up on VirusTotal.
First we calculate its hash:

Figure 56: Calculating the MD5 hash

And then we look it up through its hash on VirusTotal:

Figure 57: VirusTotal report

From this report, we conclude that the executable is Snake Keylogger.

If the file would not be present on VirusTotal, we could upload it for analysis, provided we accept the fact that we can potentially alert the criminals that we have discovered their malware.

In the video for this blog post, there’s a small bonus at the end, where we identify the exploit: CVE-2017-11882.

Conclusion
This is a long blog post, not only because of the different layers of malware in this sample. But also because in this blog post, we provide more context and explanations than usual.
We explained how to install the different tools that we used.
We explained why we chose each tool, and why we execute each command.
There are many possible variations of this analysis, and other tools that can be used to achieve similar results. I for example, would pipe more commands together.
The important aspect to static analysis like this one, is to use dedicated tools. Don’t use a PDF reader to open the PDF, don’t use Office to open the Word document, … Because if you do, you might execute the malicious code.
We have seen malicious documents like this before, and written blog post for them like this one. The sample we analyzed here, has more “layers” than these older maldocs, making the analysis more challenging.

In that blog post, we also explain how this kind of malicious document “works”, by also showing the JavaScript and by opening the document inside a sandbox.

IOCs

Type Value
PDF sha256: 05dc0792a89e18f5485d9127d2063b343cfd2a5d497c9b5df91dc687f9a1341d
RTF sha256: 165305d6744591b745661e93dc9feaea73ee0a8ce4dbe93fde8f76d0fc2f8c3f
EXE sha256: 20a3e59a047b8a05c7fd31b62ee57ed3510787a979a23ce1fde4996514fae803
URL hxxps://vtaurl[.]com/IHytw
URL hxxp://192[.]227[.]196[.]211/FRESH/fresh[.]exe

These files can be found on VirusTotal, MalwareBazaar and Malshare.

About the authors

Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

You can follow NVISO Labs on Twitter to stay up to date on all our future research and publications.

Cobalt Strike: Overview – Part 7

22 March 2022 at 09:04

This is an overview of a series of 6 blog posts we dedicated to the analysis and decryption of Cobalt Strike traffic. We include videos for different analysis methods.

In part 1, we explain that Cobalt Strike traffic is encrypted using RSA and AES cryptography, and that we found private RSA keys that can help with decryption of Cobalt Strike traffic

In part 2, we actually decrypt traffic using private keys. Notice that one of the free, open source tools that we created to decrypt Cobalt Strike traffic, cs-parse-http-traffic.py, was a beta release. It has now been replaced by tool cs-parse-traffic.py. This tool is capable to decrypt HTTP(S) and DNS traffic. For HTTP(S), it’s a drop-in replacement for cs-parse-http-traffic.py.

In part 3, we use process memory dumps to extract the decryption keys. This is for use cases where we don’t have the private keys.

In part 4, we deal with some specific obfuscation: data transforms of encrypted traffic, and sleep mode in beacons’ process memory.

In part 5, we handle Cobalt Strike DNS traffic.

And finally, in part 6, we provide some tips to make memory dumps of Cobalt Strike beacons.

The tools used in these blog post are free and open source, and can be found here.

Here are a couple of videos that illustrate the methods discussed in this series:

YouTube playlist “Cobalt Strike: Decrypting Traffic

Blog posts in this series:

About the authors

Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

You can follow NVISO Labs on Twitter to stay up to date on all our future research and publications.

Cobalt Strike: Memory Dumps – Part 6

11 March 2022 at 05:59

This is an overview of different methods to create and analyze memory dumps of Cobalt Strike beacons.

This series of blog posts describes different methods to decrypt Cobalt Strike traffic. In part 1 of this series, we revealed private encryption keys found in rogue Cobalt Strike packages. In part 2, we decrypted Cobalt Strike traffic starting with a private RSA key. In part 3, we explain how to decrypt Cobalt Strike traffic if you don’t know the private RSA key but do have a process memory dump. In part 4, we deal with traffic obfuscated with malleable C2 data transforms. And in part 5, we deal with Cobalt Strike DNS traffic.

For some of the Cobalt Strike analysis methods discussed in previous blog posts, it is useful to have a memory dump: either a memory dump of the system RAM, or a process memory dump of the process hosting the Cobalt Strike beacon.

We provide an overview of different methods to make and/or use memory dumps.

Full system memory dump

Several methods exist to obtain a full system memory dump of a Windows machine. As most of these methods involve commercial software, we will not go into the details of obtaining a full memory dump.

When you have a full system memory dump that is uncompressed, the first thing to check, is for the presence of a Cobalt Strike beacon in memory. This can be done with tool 1768.py, a tool to extract and analyze the configuration of Cobalt Strike beacons. Make sure to use a 64-bit version of Python, as uncompressed full memory dumps are huge.

Issue the following command:

1768.py -r memorydump

Example:

Figure 1: Using 1768.py on a full system memory dump

In this example, we are lucky: not only does 1768.py detect the presence of a beacon configuration, but that configuration is also contained in a single memory page. That is why we get the full configuration. Often, the configuration will overlap memory pages, and then you get a partial result, sometimes even Python errors. But the most important piece of information we get from this command, is that there is a beacon running on the system of which we took a full memory dump.

Let’s assume that our command produced partial results. What we have to do then, to obtain the full configuration, is to use Volatility to produce a process memory dump of the process(es) hosting the beacon. Since we don’t know which process(es) hosts the beacon, we will create process memory dumps for all processes.

We do that with the following command:

vol.exe -f memorydump -o procdumps windows.memmap.Memmap -dump

Example:

Figure 2: using Volatility to extract process memory dumps – start of command
Figure 3: using Volatility to extract process memory dumps – end of command


procdumps is the folder where all process memory dumps will be written to.

This command takes some time to complete, depending on the size of the memory dump and the number of processes.

Once the command completed, we use tool 1768.py again, to analyze each process dump:

Figure 4: using 1768.py to analyze all extracted process memory dumps – start of command
Figure 4: using 1768.py to analyze all extracted process memory dumps – detection for process ID 2760

We see that file pid.2760.dmp contains a beacon configuration: this means that the process with process ID 2760 hosts a beacon. We can use this process memory dump if we would need to extract more information, like encryption keys for example (see blog post 3 of this series).


Process memory dumps
Different methods exist to obtain process memory dumps on a Windows machine. We will explain several methods that do not require commercial software.

Task Manager
A full process memory dump can be made with the built-in Windows’ Task Manager.
Such a process memory dump contains all the process memory of the selected process.

To use this method, you have to know which process is hosting a beacon. Then select this process in Task Manager, right-click, and select “Create dump file”:

Figure 6: Task Manager: selecting the process hosting the beacon
Figure 7: creating a full process memory dump


The process memory dump will be written to a temporary folder:

Figure 8: Task Manager’s dialog after the completion of the process memory dump
Figure 9: the temporary folder containing the dump file (.DMP)

Sysinternals’ Process Explorer
Process Explorer can make process memory dumps, just like Task Manager. Select the process hosting the beacon, right-click and select “Create Dump / Create Full Dump“.

Figure 10: using Process Explorer to create a full process memory dump

Do not select “Create Minidump”, as a process memory dump created with this option, does not contain process memory.

With Process Explorer, you can select the location to save the dump:

Figure 12: with Process Explorer, you can choose the location to save the dump file

Sysinternals’ ProcDump
ProcDump is a tool to create process memory dumps from the command-line. You provide it with a process name or process ID, and it creates a dump. Make sure to use option -ma to create a full process memory dump, otherwise the dump will not contain process memory.

Figure 12: using procdump to create a full process memory dump


With ProcDump, the dump is written to the current directory.

Using process memory dumps
Just like with full system memory dumps, tool 1768.py can be used to analyze process memory dumps and to extract the beacon configuration.
As explained in part 3 of this series, tool cs-extract-key.py can be used to extract the secret keys from process memory dumps.
And if the secret keys are obfuscated, tool cs-analyze-processdump.py can be used to try to defeat the obfuscation, as explained in part 4 of this series.

Conclusion
Memory dumps can be used to detect and analyze beacons.
We developed tools to extract the beacon configuration and the secret keys from memory dumps.

About the authors

Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

You can follow NVISO Labs on Twitter to stay up to date on all our future research and publications.

Cobalt Strike: Decrypting DNS Traffic – Part 5

29 November 2021 at 11:14

Cobalt Strike beacons can communicate over DNS. We show how to decode and decrypt DNS traffic in this blog post.

This series of blog posts describes different methods to decrypt Cobalt Strike traffic. In part 1 of this series, we revealed private encryption keys found in rogue Cobalt Strike packages. In part 2, we decrypted Cobalt Strike traffic starting with a private RSA key. In part 3, we explain how to decrypt Cobalt Strike traffic if you don’t know the private RSA key but do have a process memory dump. And in part 4, we deal with traffic obfuscated with malleable C2 data transforms.

In the first 4 parts of this series, we have always looked at traffic over HTTP (or HTTPS). A beacon can also be configured to communicate over DNS, by performing DNS requests for A, AAAA and/or TXT records. Data flowing from the beacon to the team server is encoded with hexadecimal digits that make up labels of the queried name, and data flowing from the team server to the beacon is contained in the answers of A, AAAA and/or TXT records.

The data needs to be extracted from DNS queries, and then it can be decrypted (with the same cryptographic methods as for traffic over HTTP).

DNS C2 protocol

We use a challenge from the 2021 edition of the Cyber Security Rumble to illustrate how Cobalt Strike DNS traffic looks like.

First we need to take a look at the beacon configuration with tool 1768.py:

Figure 1: configuration of a DNS beacon

Field “payload type” confirms that this is a DNS beacon, and the field “server” tells us what domain is used for the DNS queries: wallet[.]thedarkestside[.]org.

And then a third block of DNS configuration parameters is highlighted in figure 1: maxdns, DNS_idle, … We will explain them when they appear in the DNS traffic we are going to analyze.

Seen in Wireshark, that DNS traffic looks like this:

Figure 2: Wireshark view of Cobalt Strike DNS traffic

We condensed this information (field Info) into this textual representation of DNS queries and replies:

Figure 3: Textual representation of Cobalt Strike DNS traffic

Let’s start with the first set of queries:

Figure 4: DNS_beacon queries and replies

At regular intervals (determined by the sleep settings), the beacon issues an A record DNS query for name 19997cf2[.]wallet[.]thedarkestside[.]org. wallet[.]thedarkestside[.]org are the root labels of every query that this beacon will issue, and this is set inside the config. 19997cf2 is the hexadecimal representation of the beacon ID (bid) of this particular beacon instance. Each running beacon generates a 32-bit number, that is used to identify the beacon with the team server. It is different for each running beacon, even when the same beacon executable is started several times. All DNS request for this particular beacon, will have root labels 19997cf2[.]wallet[.]thedarkestside[.]org.

To determine the purpose of a set of DNS queries like above, we need to consult the configuration of the beacon:

Figure 5: zooming in on the DNS settings of the configuration of this beacon (Figure 1)

The following settings define the top label per type of query:

  1. DNS_beacon
  2. DNS_A
  3. DNS_AAAA
  4. DNS_TXT
  5. DNS_metadata
  6. DNS_output

Notice that the values seen in figure 5 for these settings, are the default Cobalt Strike profile settings.

For example, if DNS queries issued by this beacon have a name starting with http://www., then we know that these are queries to send the metadata to the team server.

In the configuration of our beacon, the value of DNS_beacon is (NULL …): that’s an empty string, and it means that no label is put in front of the root labels. Thus, with this, we know that queries with name 19997cf2[.]wallet[.]thedarkestside[.]org are DNS_beacon queries. DNS_beacon queries is what a beacon uses to inquire if the team server has tasks for the beacon in its queue. The reply to this A record DNS query is an IPv4 address, and that address instructs the beacon what to do. To understand what the instruction is, we first need to XOR this replied address with the value of setting DNS_Idle. In our beacon, that DNS_Idle value is 8.8.4.4 (the default DNS_Idle value is 0.0.0.0).

Looking at figure 4, we see that the replies to the first requests are 8.8.4.4. These have to be XORed with DNS_Idle value 8.8.4.4: thus the result is 0.0.0.0. A reply equal to 0.0.0.0 means that there are no tasks inside the team server queue for this beacon, and that it should sleep and check again later. So for the first 5 queries in figure 4, the beacon has to do nothing.

That changes with the 6th query: the reply is IPv4 address 8.8.4.246, and when we XOR that value with 8.8.4.4, we end up with 0.0.0.242. Value 0.0.0.242 instructs the beacon to check for tasks using TXT record queries.

Here are the possible values that determine how a beacon should interact with the team server:

Figure 6: possible DNS_Beacon replies

If the least significant bit is set, the beacon should do a checkin (with a DNS_metadata query).

If bits 4 to 2 are cleared, communication should be done with A records.

If bit 2 is set, communication should be done with TXT records.

And if bit 3 is set, communication should be done with AAAA records.

Value 242 is 11110010, thus no checkin has to be performed but tasks should be retrieved via TXT records.

The next set of DNS queries are performed by the beacon because of the instructions (0.0.0.242) it received:

Figure 7: DNS_TXT queries

Notice that the names in these queries start with api., thus they are DNS_TXT queries, according to the configuration (see figure 5). And that is per the instruction of the team server (0.0.0.242).

Although DNS_TXT queries should use TXT records, the very first DNS query of a DNS_TXT query is an A record query. The reply, an IPv4 address, has to be XORed with the DNS_Idle value. So here in our example, 8.8.4.68 XORed with 8.8.4.4 gives 0.0.0.64. This specifies the length (64 bytes) of the encrypted data that will be transmitted over TXT records. Notice that for DNS_A and DNS_AAAA queries, the first query will be an A record query too. It also encodes the length of the encrypted data to be received.

Next the beacon issues as many TXT record queries as necessary. The value of each TXT record is a BASE64 string, that has to be concatenated together before decoding. The beacon stops issuing TXT record requests once the decoded data has reached the length specified in the A record reply (64 bytes in our example).

Since the beacon can issue these TXT record queries very quickly (depending on the sleep settings), a mechanism is introduced to avoid that cached DNS results can interfere in the communication. This is done by making each name in the DNS queries unique. This is done with an extra hexadecimal label.

Notice that there is an hexadecimal label between the top label (api in our example) and the root labels (19997cf2[.]wallet[.]thedarkestside[.]org in our example). That hexadecimal label is 07311917 for the first DNS query and 17311917 for the second DNS query. That hexadecimal label consists of a counter and a random number: COUNTER + RANDOMNUMBER.

In our example, the random number is 7311917, and the counter always starts with 0 and increments with 1. That is how each query is made unique, and it also helps to process the replies in the correct order, in case the DNS replies arrive in disorder.

Thus, when all the DNS TXT replies have been received (there is only one in our example), the base 64 string (ZUZBozZmBi10KvISBcqS0nxp32b7h6WxUBw4n70cOLP13eN7PgcnUVOWdO+tDCbeElzdrp0b0N5DIEhB7eQ9Yg== in our example) is decoded and decrypted (we will do this with a tool at the end of this blog post).

This is how DNS beacons receive their instructions (tasks) from the team server. The encrypted bytes are transmitted via DNS A, DNS AAAA or DNS TXT record replies.

When the communication has to be done over DNS A records (0.0.0.240 reply), the traffic looks like this:

Figure 8: DNS_A queries

cdn. is the top label for DNS_A requests (see config figure 5).

The first reply is 8.8.4.116, XORed with 8.8.4.4, this gives 0.0.0.112. Thus 112 bytes of encrypted data have to be received.: that’s 112 / 4 = 28 DNS A record replies.

The encrypted data is just taken from the IPv4 addresses in the DNS A record replies. In our example, that’s: 19, 64, 240, 89, 241, 225, …

And for DNS_AAAA queries, the method is exactly the same, except that the top label is www6. in our example (see config figure 5) and that each IPv6 address contains 16 bytes of encrypted data.

The encrypted data transmitted via DNS records from the team server to the beacon (e.g., the tasks) has exactly the same format as the encrypted tasks transmitted with http or https. Thus the decryption process is exactly the same.

When the beacon has to transmit its results (output of the tasks) to the team server, is uses DNS_output queries. In our example, these queries start with top label post. Here is an example:

Figure 9: beacon sending results to the team server with DNS_output queries

Each name of a DNS query for a DNS_output query, has a unique hexadecimal counter, just like DNS_A, DNS_AAAA and DNS_TXT queries. The data to be transmitted, is encoded with hexadecimal digits in labels that are added to the name.

Let’s take the first DNS query (figure 9): post.140.09842910.19997cf2[.]wallet[.]thedarkestside.org.

This name breaks down into the following labels:

  • post: DNS_output query
  • 140: transmitted data
  • 09842910: counter + random number
  • 19997cf2: beacon ID
  • wallet[.]thedarkestside.org: domain chosen by the operator

The transmitted data of the first query is actually the length of the encrypted data to be transmitted. It has to be decoded as follows: 140 -> 1 40.

The first hexadecimal digit (1 in our example) is a counter that specifies the number of labels that are used to contain the hexadecimal data. Since a DNS label is limited to 63 characters, more than one label needs to be used when 32 bytes or more need to be encoded. That explains the use of a counter. 40 is the hexadecimal data, thus the length of the encrypted data is 64 bytes long.

The second DNS query (figure 9) is: post.2942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b3.4adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42.19842910.19997cf2[.]wallet[.]thedarkestside[.]org.

The name in this query contains the encrypted data (partially) encoded with hexadecimal digits inside labels.

These are the transmitted data labels: 2942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b3.4adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42

The first digit, 2, indicates that 2 labels were used to encode the encrypted data: 942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b3 and 4adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42.

The third DNS query (figure 9) is: post.1debfa06ab4786477.29842910.19997cf2[.]wallet[.]thedarkestside[.]org.

The counter for the labels is 1, and the transmitted data is debfa06ab4786477.

Putting all these labels together in the right order, gives the following hexadecimal data:

942880f933a45cf2d048b0c14917493df0cd10a0de26ea103d0eb1b34adf28c63a97deb5cbe4e20b26902d1ef427957323967835f7d18a42debfa06ab4786477. That’s 128 hexadecimal digits long, or 64 bytes, exactly like specified by the length (40 hexadecimal) in the first query.

The hexadecimal data above, is the encrypted data transmitted via DNS records from the beacon to the team server (e.g., the task results or output) and it has almost the same format as the encrypted output transmitted with http or https. The difference is the following: with http or https traffic, the format starts with an unencrypted size field (size of the encrypted data). That size field is not present in the format of the DNS_output data.

Decryption

We have developed a tool, cs-parse-traffic, that can decrypt and parse DNS traffic and HTTP(S). Similar to what we did with encrypted HTTP traffic, we will decode encrypted data from DNS queries, use it to find cryptographic keys inside the beacon’s process memory, and then decrypt the DNS traffic.

First we run the tool with an unknown key (-k unknown) to extract the encrypted data from the DNS queries and replies in the capture file:

Figure 10: extracting encrypted data from DNS queries

Option -f dns is required to process DNS traffic, and option -i 8.8.4.4. is used to provided the DNS_Idle value. This value is needed to properly decode DNS replies (it is not needed for DNS queries).

The encrypted data (red rectangle) can then be used to find the AES and HMAC keys inside the process memory dump of the running beacon:

Figure 11: extracting cryptographic keys from process memory

That key can then be used to decrypt the DNS traffic:

Figure 12: decrypting DNS traffic

This traffic was used in a CTF challenge of the Cyber Security Rumble 2021. To find the flag, grep for CSR in the decrypted traffic:

Figure 13: finding the flag inside the decrypted traffic

Conclusion

The major difference between DNS Cobalt Strike traffic and HTTP Cobalt Strike traffic, is how the encrypted data is encoded. Once encrypted data is recovered, decrypting it is very similar for DNS and HTTP.

About the authors

Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.

You can follow NVISO Labs on Twitter to stay up to date on all our future research and publications.

❌
❌