Normal view

There are new articles available, click to refresh the page.
Before yesterdayNettitude Labs

Flaw in PuTTY P-521 ECDSA signature generation leaks SSH private keys

16 April 2024 at 14:08

This article provides a technical analysis of CVE-2024-31497, a vulnerability in PuTTY discovered by Fabian Bäumer and Marcus Brinkmann of the Ruhr University Bochum.

PuTTY, a popular Windows SSH client, contains a flaw in its P-521 ECDSA implementation. This vulnerability is known to affect versions 0.68 through 0.80, which span the last 7 years. This potentially affects anyone who has used a P-521 ECDSA SSH key with an affected version, regardless of whether the ECDSA key was generated by PuTTY or another application. Other applications that utilise PuTTY for SSH or other purposes, such as FileZilla, are also affected.

An attacker who compromises an SSH server may be able to leverage this vulnerability to compromise the user’s private key. Attackers may also be able to compromise the SSH private keys of anyone who used git+ssh with commit signing and a P-521 SSH key, simply by collecting public commit signatures.

Background

Elliptic Curve Digital Signature Algorithm (ECDSA) is a cryptographic signing algorithm. It fulfils a similar role to RSA for message signing – an ECDSA public and private key pair are generated, and signatures generated with the private key can be validated using the public key. ECDSA can operate over a number of different elliptic curves, with common examples being P-256, P-384, and P-521. The numbers represent the size of the prime field in bits, with the security level (i.e. the comparable key size for a symmetric cipher) being roughly half of that number, e.g. P-256 offers roughly a 128-bit security level. This is a significant improvement over RSA, where the key size grows nonlinearly and a 3072-bit key is needed to achieve a 128-bit security level, making it much more expensive to compute. As such, RSA is largely being phased out in favour of EC signature algorithms such as ECDSA and EdDSA/Ed25519.

In the SSH protocol, ECDSA may be used to authenticate users. The server stores the user’s ECDSA public key in the known users file, and the client signs a message with the user’s private key in order to prove the user’s identity to that server. In a well-implemented system, a malicious server cannot use this signed message to compromise the user’s credentials.

Vulnerability Details

ECDSA signatures are (normally) non-deterministic and rely on a secure random number, referred to as a nonce (“number used once”) or the variable k in the mathematical description of ECDSA, which must be generated for each new signature. The same nonce must never be used twice with the same ECDSA key for different messages, and every single bit of the nonce must be completely unpredictable. An unfortunate property of ECDSA is that the private key can be compromised if a nonce is reused with the same key and a different message, or if the nonce generation is predictable.

Ordinarily the nonce is generated with a cryptographically secure pseudorandom number generator (CSPRNG). However, PuTTY’s implementation of DSA dates back to September 2001, around a month before Windows XP was released. Windows 95 and 98 did not provide a CSPRNG and there was no reliable way to generate cryptographically secure numbers on those operating systems. The PuTTY developers did not trust any of the available options, recognising that a weak CSPRNG would not be sufficient due to DSA’s strong reliance on the security of the random number generator. In response they chose to implement an alternative nonce generation scheme. Instead of generating a random number, their scheme utilised SHA512 to generate a 512-bit number based on the private key and the message.

The code comes with the following comment:

* [...] we must be pretty careful about how we
* generate our k. Since this code runs on Windows, with no
* particularly good system entropy sources, we can't trust our
* RNG itself to produce properly unpredictable data. Hence, we
* use a totally different scheme instead.
*
* What we do is to take a SHA-512 (_big_) hash of the private
* key x, and then feed this into another SHA-512 hash that
* also includes the message hash being signed. That is:
*
*   proto_k = SHA512 ( SHA512(x) || SHA160(message) )
*
* This number is 512 bits long, so reducing it mod q won't be
* noticeably non-uniform. So
*
*   k = proto_k mod q
*
* This has the interesting property that it's _deterministic_:
* signing the same hash twice with the same key yields the
* same signature.
*
* Despite this determinism, it's still not predictable to an
* attacker, because in order to repeat the SHA-512
* construction that created it, the attacker would have to
* know the private key value x - and by assumption he doesn't,
* because if he knew that he wouldn't be attacking k!

This is a clever trick in principle: since the attacker doesn’t know the private key, it isn’t possible to predict the output of SHA512 even if the message is known ahead of time, and thus the generated number is unpredictable. Since SHA512 is a cryptographically secure hash, it is computationally infeasible to guess any bit of its output until you compute the hash. When the PuTTY developers implemented ECDSA, they re-used this DSA implementation, resulting in a somewhat odd deterministic implementation of ECDSA where signing the same message twice results in the same nonce and signature. This is certainly unusual, but it does not count as nonce reuse in a compromising sense – you’re essentially just redoing the same maths and getting the same result.

Unfortunately, when the PuTTY developers repurposed this DSA implementation for ECDSA in 2017, they made an oversight. Prior usage for DSA did not utilise keys larger than 512 bits, but P-521 in ECDSA needs 521 bits. Recall that ECDSA is only secure when every single bit of the key is unpredictable. In PuTTY’s implementation, though, they only generate 512 bits of random nonce using SHA512, leaving the remaining 9 bits as zero. This results in a nonce bias that can be exploited to compromise the private key. If an attacker has access to the public key and around 60 different signatures they can recover the private key. A detailed description of this key recovery attack can be found in this cryptopals writeup.

Had the PuTTY developers extended their solution to fill all 521 bits of the key, e.g. with one additional hash function call to fill the last 9 bits, their deterministic nonce generation scheme would have remained secure. Given the constraint of not having access to a CSPRNG, it is actually a clever solution to the problem. RFC6979 was later released as a standard method for implementing deterministic ECDSA signatures, but this was not implemented by PuTTY as their implementation predated that RFC.

Windows XP, released a few months after PuTTY wrote their DSA implementation, introduced a CSPRNG API, CryptGenRandom, which can be used for standard non-deterministic implementations of ECDSA. While one could postulate that the PuTTY developers might have used this API had they written their DSA implementation just a few months later, the developers have made several statements about their distrust in Windows’ random number generator APIs of that era and their preference for deterministic implementations. This distrust may have been founded at the time, but such concerns are certainly unfounded on modern versions of Windows.

Impact

This vulnerability exists specifically in the P-521 ECDSA signature generation code in PuTTY, so it only affects P-521 and not other curves such as P-256 and P-384. However, since it is the signature generation which is affected, any P-521 key that was used with a vulnerable version of PuTTY may be compromised regardless of whether that key was generated by PuTTY or something else. It is the signature generation that is vulnerable, not the key generation. Other implementations of P-521 in SSH or other protocols are not affected; this vulnerability is specific to PuTTY.

An attacker cannot leverage this vulnerability by passively sniffing SSH traffic on the network. The SSH protocol first creates a secure tunnel to the server, in a similar manner to connecting to a HTTPS server, authenticating the server by checking the server key fingerprint against the cached fingerprint. The server then prompts the client for authentication, which is sent through this secure tunnel. As such, the ECDSA signatures are encrypted before transmission in this context, so an attacker cannot get access to the signatures needed for this attack through passive network sniffing.

However, an attacker who performs an active man-in-the-middle attack (e.g. via DNS spoofing) to redirect the user to a malicious SSH server would be able to capture signatures in order to exploit this vulnerability if the user ignores the SSH key fingerprint change warning. Alternatively, an attacker who compromised an SSH server could also use it to capture signatures to exploit this vulnerability, then recover the user’s private key in order to compromise other systems. This also applies to other applications (e.g. FileZilla, WinSCP, TortoiseGit, TortoiseSVN) which leverage PuTTY for SSH functionality.

A more concerning issue is the use of PuTTY for git+ssh, which is a way of interacting with a git repository over SSH. PuTTY is commonly used as an SSH client by development tools that support git+ssh. Users can digitally sign git commits with their SSH key, and these signatures are published alongside the commit as a way of authenticating that the commit was made by that user. These commit logs are publicly available on the internet, alongside the user’s public key, so an attacker could search for git repositories with P-521 ECDSA commit signatures. If those signatures were generated by a vulnerable version of PuTTY, the user’s private key could be compromised and used to compromise the server or make fraudulent signed commits under that user’s identity.

Fortunately, users who use P-521 ECDSA SSH keys, git+ssh via PuTTY, and commit signing represent a very small fraction of the population. However, due to the law of large numbers, there are bound to be a few out there who end up being vulnerable to this attack. In addition, informal observations suggest that users may be more likely to select P-521 when offered a choice of P-256, P-384, or P-521, likely due to the perception that the larger key size offers more security. Somewhat ironically, P-521 ended up being the only curve implementation in PuTTY that was insecure.

Remediation

The PuTTY developers have resolved this issue by reimplementing the deterministic nonce generation using the approach described in the RFC6979 standard.

Any P-521 keys that have ever been used with any of the following software should be treated as compromised:

  • PuTTY 0.68 – 0.80
  • FileZilla 3.24.1 – 3.66.5
  • WinSCP 5.9.5 – 6.3.2
  • TortoiseGit 2.4.0.2 – 2.15.0
  • TortoiseSVN 1.10.0 – 1.14.6

Users should update their software to the latest version.

If a P-521 key has ever been used for git commit signing with development tools on Windows, it is advisable to assume that the key may be compromised and change it immediately.

References

The post Flaw in PuTTY P-521 ECDSA signature generation leaks SSH private keys appeared first on LRQA Nettitude Labs.

Preventing Type Confusion with CastGuard

18 October 2023 at 08:00

Built into the Microsoft C++ compiler and runtime, CastGuard is a pivotal security enhancement designed to significantly reduce the number of exploitable Type Confusion vulnerabilities in applications. Joe Bialek gave a talk about CastGuard at BHUSA2022 (slides) that explains the overall goals of the feature, how it was developed, and how it works at a high level. This article offers a journey into my discovery CastGuard – delving into a technical evaluation of its mechanics, exploring illustrative examples, and highlighting relevant compiler flags.

While looking into new control flow guard feature support in the Windows PE load config directory a while back, I stumbled across a newly added field called CastGuardOsDeterminedFailureMode, added in Windows 21H2. I had never heard of CastGuard before so, naturally, I wondered what it did.

To give a brief overview, CastGuard is intended to solve Type Confusion problems such as the following:

struct Organism {
    virtual void Speak() { cout << "..."; }
}

struct Animal : public Organism {
    virtual void Speak() { cout << "Uh... hi?"; }
}

struct Dog : public Animal {
    virtual void Speak() { cout << "Woof!"; }
}

struct Cat : public Animal {
    virtual void Speak() { cout << "Meow!"; }
}

void SayMeow(Animal* animal) {
    static_cast<Cat*>(animal)->Speak();
}

Animal* dog = new Dog();
SayMeow(dog);

In this application, SayMeow will print “Woof!”, in a classic example of type confusion through an illegal downcast. The compiler is unable to infer that the Dog type being passed to SayMeow is a problem, because the function takes an Animal type, so no contract is broken there. The cast within SayMeow is also valid from the compiler’s perspective, because a Cat is an Animal, so it is entirely valid to downcast if you, the developer who wrote the code, know that the object being passed is in fact a Cat or a descendent type thereof. This is why this bug class is so pernicious – it’s easy to violate the type contract, especially in complex codebases.

Ordinarily this can be solved with dynamic_cast and RTTI, which tags each object with type information, but this has its own problems (see the talk linked above for full details) and it’s non-trivial to replace static_cast with dynamic_cast across a large codebase, especially in the case where your code has to coexist with 3rd party / user code (e.g. in the case of runtime libraries) where you can’t even enforce that RTTI is enabled. Furthermore, RTTI causes significant codegen bloat and performance penalties – a static cast is free (you’re interpreting the memory natively as if it were the type being cast to) whereas a dynamic cast with RTTI requires a handful of stores, loads, jumps, and calls on every cast.

CastGuard acts as an additional layer of protection against type confusion, or, more specifically, against cases where type confusion is the first-order memory vulnerability; it is not designed to protect against cases where an additional memory corruption issue is leveraged first. Its goal is to offer this protection with minimal codegen bloat and performance overhead, without modifying the (near-universally relied upon) ABI for C++ objects.

CastGuard leverages the fact that vftables (aka vtables) uniquely identify types. As long as the types on the left- and right-hand side of the cast have at least one vftable, and both types were declared within the binary being complied, the object types can be consistently and uniquely determined by their vftable address (with one caveat: comdat folding for identical vftables must be disabled in the linker). This allows the vftable pointer to be used as a unique type identifier on each object, avoiding the need for RTTI bloat and expensive runtime checks. Since an object’s vftable pointer is almost certainly being accessed around the same time as any cast involving that object, the memory being accessed is probably already in cache (or is otherwise about to benefit from being cached) so the performance impact of accessing that data is negligible.

Initially, Microsoft explored the idea of creating bitmaps that describe which types’ vftables are compatible with each other, so that each type that was observed to be down-cast to had a bitvector that described which of the other vftables were valid for casting. However, this turns out to be inefficient in a bunch of ways, and they came up with a much more elegant solution.

The type vftables are enumerated during link time code generation (LTCG). A type inheritance hierarchy is produced, and that hierarchy is flattened into a top-down depth-first list of vftables. These are stored contiguously in memory.

To use the above code as an example, if we assume that each vftable is 8 bytes in size, the CastGuard section would end up looking like this:

Offset Name
0x00 __CastGuardVftableStart
0x08 Organism::$vftable@
0x10 Animal::$vftable@
0x18 Dog::$vftable@
0x20 Cat::$vftable@
0x28 __CastGuardVftableEnd

Notice that parent types are always before child types in the table. Siblings can be in any order, but a sibling’s descendants would come immediately after it. For example, if we added a WolfHound class that inherited from Dog, its vftable would appear between Dog::$vftable@ and Cat::$vftable@ in the above table.

At any given static_cast<T> site the compiler knows how many other types inherit from T. Given that child types appear sequentially after the parent type in the CastGuard section, the compiler knows that there are a certain number of child type vftables appearing immediately afterward.

For example, Animal has two child types – Cat and Dog – and both of these types are allowed to be cast to Animal. So, if you do static_cast<Animal>(foo), CastGuard checks to see if foo’s vftable pointer lands within two vftable slots downward of Animal::$vftable@, which in this case would be any offset between 0x10 and 0x20 inclusively, i.e. the vftables of Animal, Dog, and Cat. These are all valid. If you try to cast an Organism object to the Animal type, CastGuard’s check detects this as being invalid because the Organism object vftable pointer is to offset 0x08, which is outside the valid range.

Looking back again at the example code, the cast being done is static_cast<Cat> on a Dog object. The Cat type has no descendants, so the range size of valid vftables is zero. The Cat type’s vftable, Cat::$vftable@, is at offset 0x20, whereas the Dog object vftable pointer points to offset 0x18, so it therefore fails the CastGuard range check. Casting a Cat object to the Cat type works, on the other hand, because a Cat object’s vftable pointer points to  0x20, which is within a 0 byte range of Cat::$vftable@.

This check is optimised even further by computing the valid range size at compile time, instead of storing the count of descendent types and multiplying that by the CastGuard vftable alignment size on every check. At each static cast site, the compiler simply subtracts the left-hand side type’s vftable address from the right-hand side object’s vftable pointer, and checks to see if it is less than or equal to the valid range. This not only reduces the computational complexity of each check, but it also means that the alignment of vftables within the CastGuard section can be arbitrarily decided by the linker on a per-build basis, based on the maximum vftable size being stored, without needing to include any additional metadata or codegen. In fact, the vftables don’t even need to be padded to all have the same alignment, as long as the compiler computes the valid range based on the sum of the vftable sizes of the child types.

I mentioned earlier that CastGuard only protects casts for types within the same binary. The CastGuard range check described above will always fail if a type from another binary is cast to a type from the current binary, because the vftable pointers will be out of range. This is obviously unacceptable – it’d break almost every program that uses types from a DLL – so CastGuard includes an extra compatibility check. This is where the __CastGuardVftableStart and __CastGuardVftableEnd symbols come in. If the vftable for an object being cast lands outside of the CastGuard section range, the check fails open and allows the cast because it is outside the scope of protection offered by the CastGuard feature.

This approach is much faster than dynamic casting with RTTI and adds very little extra bloat in the compiled binary (caveat: see the talk for details on where they had to optimise this a bit further for things like CRTP). As such, CastGuard is suitable to be enabled everywhere, including in performance-critical paths where dynamic casting would be far too expensive.

Pretty cool, right? I thought so too.

Let’s now go back to the original reason for me discovering CastGuard in the first place: the CastGuardOsDeterminedFailureMode field that was added to the PE load config structure in 21H2. It’s pretty clear that this field has something to do with CastGuard (the name rather gives it away) but it isn’t clear what the field actually does.

My first approach to figure this out was to enumerate every single PE file on my computer (and a Windows 11 Pro VM), parse it, and look for nonzero values in the CastGuardOsDeterminedFailureMode field. I found a bunch! This field is documented as containing a virtual address (VA). I wrote some code to parse out the CastGuardOsDeterminedFailureMode field from the load config, attempt to resolve the VA to an offset, then read the data at that offset.

I found three overall classes of PE file through this scan method:

  • PE files where the CastGuardOsDeterminedFailureMode field is zero.
  • PE files where the CastGuardOsDeterminedFailureMode field contains a valid VA which points to eight zero bytes in the .rdata section.
  • PE files where the CastGuardOsDeterminedFailureMode field contains what looks like a valid VA, but is in fact an invalid VA.

The third type of result is a bit confusing. The VA looks valid at first glance – it starts with the same few nibbles as other valid VAs – but it doesn’t point within any of the sections. At first I thought my VA translation code was broken, but I confirmed that the VAs were indeed invalid when translated by other tools such as CFF Explorer and PE-Bear. We’ll come back to this later.

I loaded a few of the binaries with valid VAs into Ghidra and applied debugging symbols. I found that these binaries contained a symbol named __castguard_check_failure_os_handled_fptr in the .rdata section, and that the CastGuardOsDeterminedFailureMode VA pointed to the address of this symbol. I additionally found that the binaries included a fast-fail code called FAST_FAIL_CAST_GUARD (65) which is used when the process fast-fails due to a CastGuard range check failure. However, I couldn’t find the __CastGuardVftableStart or __CastGuardVftableEnd symbols for the CastGuard vftable region that had been mentioned in Joe’s talk.

Searching for these symbol names online led me to pieces of vcruntime source code included in SDKs as part of Visual Studio. The relevant source file is guard_support.c and it can be found in the following path:

[VisualStudio]/VC/Tools/MSVC/[version]/crt/src/vcruntime/guard_support.c

It appears that the CastGuard feature was added somewhere around version 14.28.29333, and minor changes have been made in later versions.

Comments in this file explain how the table alignment works. As of 14.34.31933, the start of the CastGuard section is aligned to a size of 16*sizeof(void*), i.e. 128-byte aligned on 64-bit platforms and 64-byte aligned on 32-bit platforms.

There are three parts to the table, and they are allocated as .rdata subsections: .rdata$CastGuardVftablesA, .rdata$CastGuardVftablesB, and .rdata$CastGuardVftablesC.

Parts A and C store the __CastGuardVftablesStart and __CastGuardVftablesEnd symbols. Both of these are defined as a CastGuardVftables struct type that contains a padding field of the alignment size. This means that the first vftable in the CastGuard section is placed at __CastGuardVftablesStart + sizeof(struct CastGuardVftables).

Part B is generated automatically by the compiler. It contains the vftables, and these are automatically aligned to whatever size makes sense during compilation. If no vftables are generated, part B is essentially missing, and you end up with __CastGuardVftablesEnd placed 64/128 bytes after __CastGuardVftablesStart.

The guard_support.c code does not contain the CastGuard checks themselves; these are emitted as part of the compiler itself rather than being represented in a public source file. However, guard_support.c does contain the failure routines and the AppCompat check routine.

When a CastGuard check at a static_cast site fails, it calls into one of four failure routines:

  1. __castguard_check_failure_nop – does nothing.
  2. __castguard_check_failure_debugbreak – raises a breakpoint by calling __debugbreak()
  3. __castguard_check_failure_fastfail – fast-fails using __fastfail(FAST_FAIL_CAST_GUARD)
  4. __castguard_check_failure_os_handled – calls an OS handler function

Rather than calling the AppCompat check routine at every static_cast site, the check is instead deferred until a CastGuard check fails. Each of the check failure routines above, with the exception of nop, first calls into the AppCompat check routine to see if the failure should be ignored.

The AppCompat check routine is implemented in __castguard_compat_check, and it looks like this:

static
inline
BOOL
__cdecl __castguard_compat_check(PVOID rhsVftablePtr)
{
    ULONG_PTR realVftableRangeStart = (ULONG_PTR)&__CastGuardVftablesStart + sizeof(struct CastGuardVftables);
    ULONG_PTR realVftableRangeEnd = (ULONG_PTR)&__CastGuardVftablesEnd;
    ULONG_PTR vftableRangeSize = realVftableRangeEnd - realVftableRangeStart;

    return (ULONG_PTR)rhsVftablePtr - realVftableRangeStart <= vftableRangeSize;
}

This routine is responsible for checking whether the right-hand side (object being cast) vftable pointer is pointing somewhere between the first vftable in the CastGuard section and __CastGuardVftablesEnd. If it is, the AppCompat check returns true (i.e. this is a valid case that CastGuard should protect against), otherwise it returns false.

In the case of __castguard_check_failure_os_handled, the handler code looks like this:

extern
inline
void __cdecl __castguard_check_failure_os_handled(PVOID rhsVftablePtr)
{
    if (__castguard_compat_check(rhsVftablePtr))
    {
        __castguard_check_failure_os_handled_wrapper(rhsVftablePtr);
    }

    return;
}

If the AppCompat routine says that the failed check should be honoured, it calls an OS handler wrapper. The wrapper function looks like this:

static inline void
__declspec(guard(nocf))
__cdecl __castguard_check_failure_os_handled_wrapper(PVOID rhsVftablePtr)
{
    // This function is opted out of CFG because the OS handled function pointer
    // is allocated within ".00cfg" section. This section benefits from the same
    // level of protection as a CFG pointer would.

    if (__castguard_check_failure_os_handled_fptr != NULL)
    {
        __castguard_check_failure_os_handled_fptr(rhsVftablePtr);
    }
    return;
}

The __castguard_check_failure_os_handled_fptr function pointer being referred to here is the symbol that CastGuardOsDeterminedFailureMode points to in the load config table – the exact one I was trying to figure out the purpose of!

That function pointer is defined as:

__declspec(allocate(".00cfg"))
DECLSPEC_SELECTANY
VOID (* volatile __castguard_check_failure_os_handled_fptr)(PVOID rhsVftablePtr) = NULL;

The declspec is important here – it places __castguard_check_failure_os_handled_fptr in the same section as CFG/XFG pointers, which means (as the code comment above points out) that the OS handler function pointer is protected in the same way as the CFG/XFG pointers. Control flow from the CastGuard check site to the check failure function to the AppCompat check function can be protected by control flow guard, but flow from the failure routine to the OS handled function pointer cannot because its value is (presumably always) unknown at compile time. This is why the wrapper function above is required, with guard(nocf) applied – it disables CFG for the flow from the check failure function to the OS handler function, since CFG would likely disallow the indirect call, but since the pointer itself is protected it doesn’t actually matter.

This indicates that CastGuardOsDeterminedFailureMode is intended to be used to specify the location of the __castguard_check_failure_os_handled_fptr symbol, which in turn points to an OS handler function that is called when a check failure occurs.

None of this is documented but, given that Joe’s BHUSA2022 talk included an anecdote about Microsoft starting the CastGuard feature off in a report-only mode, I can only presume that CastGuardOsDeterminedFailureMode was designed to provide the binaries with this reporting feature.

At this point we still have a couple of open questions, though. First, how does the compiler pick between the four different failure handlers? Second, how are the CastGuard checks themselves implemented? And third, why do a lot of the binaries have invalid VAs in CastGuardOsDeterminedFailureMode?

To answer the first question, we have to take a look at c2.dll in the MSVC compiler, which is where CastGuard is implemented under the hood. This DLL contains a class called CastGuard which, unsurprisingly, is responsible for most of the heavy lifting. One of the functions in this class, called InsertCastGuardCompatCheck, refers to a field of some unknown object in thread-local storage and picks which of the four check functions to insert a call to based on that value:

Value Call
1 __castguard_check_failure_fastfail
2 __castguard_check_failure_debugbreak
3 __castguard_check_failure_os_handled
4 __castguard_check_failure_nop

From prior reverse engineering expeditions into the MSVC compiler, I remembered that config flags passed to the compiler are typically stored in a big structure in TLS. From there I was able to find the hidden compiler flags that enable CastGuard and control its behaviour.

Hidden flags can be passed to each stage of the compiler using a special /d command line argument. The format of the argument is /dN… where N specifies which DLL the hidden flag should be passed to (1 for the front-end compiler, c1.dll, or 2 for the code generator, c2.dll). The flag is then appended to the argument.

The known hidden compiler flags for CastGuard are:

Flag Description
/d2CastGuard- Disables CastGuard.
/d2CastGuard Enables CastGuard.
/d2CastGuardFailureMode:fastfail Sets the failure mode to fast-fail.
/d2CastGuardFailureMode:nop Sets the failure mode to nop.
/d2CastGuardFailureMode:os_handled Sets the failure mode to OS handled.
/d2CastGuardFailureMode:debugbreak Sets the failure mode to debug break.
/d2CastGuardOption:dump_layout_info Dumps the CastGuard layout info in the build output.
/d2CastGuardOption:force_type_system Forces type system analysis, even if the binary is too big for fast analysis.
This is intended to be used with the linker, rather than the compiler, so warning C5066 is raised if you pass it.
/d2CastGuardTestFlags:# Sets various test flags for the CastGuard implementation, as a bitwise numeric value. Hex numbers are valid.

So now we know how the different failure modes are set: at build time, with a compiler flag.

If we rebuild the example code with some extra compiler flags, we can try CastGuard out:

/d2CastGuard /d2CastGuardFailureMode:debugbreak /d2CastGuardOption:dump_layout_info

The compiler then prints layout information for CastGuard:

1>***** CastGuard Region ******
1>Offset:0x00000 RTTIBias:0x8 Size:0x010 Alignment:0x08 VftableName:??_7Dog@@6B@
1>
1>
1>***** CastGuard Compatibility Info ******
1>

When executed, the static cast in SayMeow has a CastGuard check applied and raises a debug break in __castguard_check_failure_debugbreak.

We can also learn a little more about CastGuard from the warnings and errors that are known to be associated with it, by looking at the string tables in the compiler binaries:

  • C5064: “CastGuard has been disabled because the binary is too big for fast type system analysis and compiler throughput will be degraded. To override this behavior and force enable the type system so CastGuard can be used, specify the flag /d2:-CastGuardOption:force_type_system to the linker.”
  • C5065: “The CastGuard subsystem could not be enabled.”
  • C5066: “CastGuardOption:force_type_system should not be passed to the compiler, it should only be passed to the linker via /d2:-CastGuardOption:force_type_system. Passing this flag to the compiler directly will force the type system for all binaries this ltcg module is linked in to.”
  • C5067: “CastGuard is not compatible with d2notypeopt”
  • C5068: “CastGuard is not compatible with incremental linking”
  • C5069: “CastGuard cannot initialize the type system. An object is being used that was built with a compiler that did not include the necessary vftable type information (I_VFTABLETIS) which prevents the type system from loading. Object: %s”
  • C5070: “CastGuard cannot initialize the type system. An object is being used that was built with a compiler that did not include the necessary type information (I_TIS) which prevents the type system from loading. Object: %s”
  • C5071: “CastGuard cannot initialize the type system. An error occurred while trying to read the type information from the debug il. Object: %s”

Digging even further into the implementation, it appears that Microsoft added a new C++ attribute called nocastguard, which can be used to exclude a type from CastGuard checks. Based on my experimentation, this attribute is applied to types (applying the attribute to an argument or variable causes a compiler crash!) and disables checks when a static cast is performed to that type.

Changing our example code to the following causes the CastGuard check to be eliminated, and the type confusion bug returns:

struct [[msvc::nocastguard]] Cat : Animal {
    virtual void Speak() { std::cout << "Meow!\n"; }
};

If nocastguard is applied to the Dog or Animal type instead, the CastGuard check returns and the type confusion bug is prevented. This indicates that, at least in this unreleased implementation, the attribute is specifically used to prevent CastGuard checks on casts to the target type.

This newly CastGuard-enabled development environment makes it easy to experiment and disassemble the binary and see what the code looks like. In the simplest version of our example program, the result is actually quite amusing: the program does nothing except initialise a Dog object and immediately unconditionally call the failure routine in main. This is because the CastGuard check is injected into the IL during the optimisation phase. You can see this in practice: turning off optimisations causes the CastGuard pass to be skipped entirely. Since the check is part of the IL, it is subject to optimisation passes. The optimiser sees that the check essentially boils down to if (Cat::$vftable@ != Dog::$vftable@) { fail; }, whose expression is always true, which results in the branch being taken and the entire rest of the code being eliminated. Since SayMeow is only called once, it gets inlined, and the entire program ends up as a call to the CastGuard failure routine. This implies that it could technically be possible for a future release to identify such a scenario at build time and raise an error or warning.

To study things a little better, let’s expand the program in a way that introduces uncertainty and tricks the compiler into not optimising the routines. (Note: we can’t turn off optimisations to avoid all the inlining and elimination because that also turns off CastGuard.)

int main()
{
    for (int i = 0; i < 20; i++)
    {
        int idx = rand() % 3;
        Animal* animal = nullptr;
        switch (idx)
        {
        case 0:
            std::cout << "Making an animal...\n";
            animal = new Animal();
            break;
        case 1:
            std::cout << "Making a dog...\n";
            animal = new Dog();
            break;
        default:
            std::cout << "Making a cat...\n";
            animal = new Cat();
            break;
        }
        SayMeow(animal);
    }
}

This results in a program with an entirely normal looking main function, with no references to CastGuard routines. SayMeow looks like the following:

void SayMeow(Animal *animal)
{
    if (animal != nullptr && *animal != Cat::$vftable@)
    {
        __castguard_check_failure_debugbreak((void*)*animal);
    }
    animal->Speak();
}

This is pretty much expected: *animal dereferences the passed pointer to get to the vftable for the object, and, since the Cat type has no descendent types, the range check just turns into a straight equality check.

To make things more interesting, let’s add a WolfHound type that inherits from Dog, and a function called SayWoof that works just like SayMeow but with a cast to Dog instead of Cat. We’ll also update main so that it can create an Animal, Cat, Dog, or WolfHound.

Upon building this new program, the compiler dumps the CastGuard layout:

***** CastGuard Region ******
Offset:0x00000 RTTIBias:0x8 Size:0x010 Alignment:0x08 VftableName:??_7Animal@@6B@
Offset:0x00010 RTTIBias:0x8 Size:0x010 Alignment:0x08 VftableName:??_7Dog@@6B@
Offset:0x00020 RTTIBias:0x8 Size:0x010 Alignment:0x08 VftableName:??_7WolfHound@@6B@
Offset:0x00030 RTTIBias:0x8 Size:0x010 Alignment:0x08 VftableName:??_7Cat@@6B@

***** CastGuard Compatibility Info ******
Vftable:??_7Dog@@6B@ RangeCheck ComparisonBaseVftable:??_7Dog@@6B@ Size:0x10 ObjectCreated
    CompatibleVftable: Offset:0x00010 RTTIBias:0x8 Vftable:??_7Dog@@6B@
    CompatibleVftable: Offset:0x00020 RTTIBias:0x8 Vftable:??_7WolfHound@@6B@

We can see that the WolfHound vftable is placed immediately after the Dog vftable, and that the Dog type is compatible with the Dog and WolfHound types. We can also see that the size of the range check is 0x10, which makes sense because WolfHound‘s vftable comes 0x10 bytes after Dog‘s vftable.

The CastGuard check in SayWoof now ends up looking something like this:

void SayWoof(Animal* animal)
{
    if (animal != nullptr)
    {
        if (*animal - Dog::$vftable@ > 0x10)
        {
            __castguard_check_failure_debugbreak((void*)*animal);
        }
    }
    animal->Speak();
}

Let’s enumerate the possible flows here:

  • If the type being passed is Dog, then *animal is equal to Dog::$vftable@, which makes *animal - Dog::$vftable@ equal zero, so the check passes.
  • If the type being passed is WolfHound, then *animal is equal to WolfHound::$vftable@, which is positioned 0x10 bytes before Dog::$vftable@. As such, *animal - Dog::$vftable@ will equal 0x10, and the check passes.
  • If the type being passed is Cat, then *animal is equal to Cat::$vftable@, which makes *animal - Dog::$vftable@ equal 0x20, and the check fails.
  • If the type being passed is Animal, then *animal is equal to Animal::$vftable@. Since Animal::$vftable@ is positioned before Dog::$vftable@ in the table, the result of the unsigned subtraction will wrap, causing the result to be greater than 0x10, and the check fails.

This shows CastGuard in action quite nicely!

For completeness, let’s go back and wrap up a small loose end relating to the hidden compiler flags: test flags. The /d2CastGuardTestFlags option takes a hexadecimal number value representing a set of bitwise flags. The test flags value is written to a symbol called CastGuardTestFlags inside c2.dll, and this value is used in roughly ten different locations in the code as of version 14.34.31933.

In the process of reverse engineering this code, I discovered that four separate check approaches are implemented – RangeCheck (0x01, the default), ROLCheck (0x02), ConstantBitmapCheck (0x03), and BitmapCheck (0x04) – presumably following the sequence of approaches and optimisations that were mentioned in the talk.

Here’s what I was able to figure out about these flags:

Flag Value Notes
0x01 Switches the check type to ROLCheck (0x02), as long as neither 0x02 nor 0x40 are also set.
0x02 Switches the check type to ConstantBitmapCheck (0x03), as long as 0x40 is not also set.
0x04 Appears to enable an alternative strategy for selecting the most appropriate vftable for a type with multiple inheritance.
0x08 Forces CastGuard::IsCastGuardCheckNeeded to default to true instead of false when no condition explicitly prevents a check, which appears to force the generation of CastGuard checks even if a codegen pass was not performed.
0x10 Forces generation of metadata for all types in the inheritance tree. Types that are never part of a cast check, either as a cast target or valid source type, do not normally end up as part of the CastGuard section. For example, Organism is ignored by CastGuard in our example programs because it never ends up being relevant at a static cast site. When this flag is enabled, all types in the inheritance tree are treated as relevant, and their vftables are placed into the CastGuard section. A type which is never part of a static cast, and whose parent and child types (if there are any) are never part of a static cast, are still kept separate and don’t end up in the CastGuard section.
0x20 Exact behaviour is unclear, but it seems to force the CastGuard subsystem to be enabled in a situation where error C5065 would be otherwise raised, and forces the TypeSystem::Builder::ProcessILRecord function to continue working even if an internal boolean named OneModuleEnablesCastGuard is false.
0x40 Switches the check type to BitmapCheck (0x04) and, if /d2CastGuardOption:dump_layout_info is also set, prints the bitmap in the build output.

The three alternative check patterns function exactly as was explained in the BHUSA2022 talk, so I won’t go into them any further.

Unless I missed anything, we appear to be down to just one final question: why am I seeing invalid VAs in CastGuardOsDeterminedFailureMode on a bunch of Windows executables?

At first I thought that there might be some kind of masking going on, with certain bits guaranteed to be zero in the VA due to alignment requirements, with those bit positions being reused to set or indicate the failure mode or check type. This doesn’t make much sense, though, and I can find no supporting evidence. It appears that this is a bug from an earlier implementation of CastGuard, when Microsoft were trialling rolling out notify-only protection on certain components. I couldn’t concretely confirm this theory, but I did manage to have a quick chat with someone who worked on the feature, and they were as surprised to see the invalid VAs as I was.

It takes time to get these compiler-level bug class mitigations implemented correctly. The analysis in this article was originally performed in February 2023, but CastGuard remains unofficial and undocumented as of October 2023. Given the unfathomable quantity of existing code that interacts with COM interfaces, all of which might be affected by this feature, and the politically fractious intersection between C++ language standards and implementation-specific language features, it isn’t particularly surprising that it’s taking Microsoft a while to roll this mitigation out.

The post Preventing Type Confusion with CastGuard appeared first on LRQA Nettitude Labs.

Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions

30 August 2023 at 09:00

This article provides a technical analysis of Zenbleed, a side-channel attack affecting all AMD Zen 2 processors. Tavis Ormandy reported this vulnerability to AMD on 15 May 2023 and it was assigned CVE-2023-20593. The vulnerability is of particular concern for shared hosting providers, virtualisation platforms, and other shared-tenant systems. However, any scenario where a malicious actor can execute code potentially poses a threat, including in contexts such as privilege escalation, sandbox escape, and possibly even malicious JavaScript executing in a web browser.

While AMD has historically enjoyed relative respite from side-channel attack publications, this past disparity was largely due to Intel’s processors being a more attractive research target, with a greater depth of information available around engineering features (e.g. red unlock) and internals (e.g. microcode structure), and a greater share of the server market at the time. In the five years since Meltdown and Spectre, researchers have been busy closing the knowledge gap around AMD’s processors, making it easier to discover impactful security issues.

The Zenbleed vulnerability exploits incorrect recovery behaviour after a branch misprediction involving optimised vector instructions, resulting in information within floating point unit (FPU) registers being leaked. Vectorisation is frequently utilised in common library functions (e.g. memcpy, memcmp, strlen) for performance reasons, making this a very wide-reaching vulnerability in terms of the types of data that can be extracted.

To understand Zenbleed, we need to dig into modern processor design. Modern x86_64 processors do not simply execute one instruction after the next. Instead, they operate in a superscalar manner, essentially executing multiple instructions at once using techniques such as instruction-level parallelism (ILP) and out-of-order execution. While the processor outwardly appears to have a small number of general purpose registers (e.g. rax, rbx, r12, etc.) and a bank of SIMD registers (e.g. xmm0, ymm3, etc.), each processor core actually has a far larger number of internal registers. The named registers aren’t uniquely represented by a single physical hardware register each, but are rather dynamically allocated in a register file. This enables some very important optimisations.

For example, if you were to execute the instruction xchg rax, rcx, the processor almost certainly doesn’t move any values between physical hardware registers within the register file. Instead, it performs a register rename, essentially swapping the labels on the register file entries. This also happens with SIMD registers, allowing for complex behaviours and optimisations relating to the “nesting” of registers (e.g. xmm1 being one half of ymm1, which in turn is one half of zmm1).

When we think of a classical processor design, we typically think of it having an instruction decoder, an arithmetic logic unit (ALU), a floating point unit (FPU), etc. However, a superscalar processor actually has several of these per core, and uses a complex scheduling system to execute many operations at the same time. By identifying data dependencies between instructions, the processor can identify cases where later instructions do not depend upon the results of previous instructions, allowing it to execute the instruction at the same time.

For example, consider the following sequence of instructions:

mov rcx, [rbp+0x8]
lea rcx, [rcx*0x4]
sub rax, 0x8
add rcx, rax
xor rax, rax
mov [rbp+0x8], rax
mov [rbp+0x10], rcx

Rather than executing the first instruction, stalling while waiting for the memory fetch to complete, then working on the next instructions, the processor can instead look ahead and see that sub rax, 0x8 does not depend upon the results of the first two instructions and choose to execute it simultaneously. It may also recognise that xor rax, rax sets rax to zero, thus not depending on the value of rax before that time, allowing it to start working on further instructions too, as long as memory accesses are correctly ordered. Not only this, but if the processor’s register allocation scheme keeps track of which entries in the register file are zero, then it does not need to explicitly zero a register to represent rax, but can simply reuse an already-zeroed entry.

By carefully accounting for data dependencies and memory access ordering, the processor can parallelise operations across multiple physical ALUs and other units at the same time, re-ordering operations to try to ensure maximum utilisation of parallel units at all times. This also occurs with SIMD instructions, with special accounting for the upper and lower halves of the SIMD registers (xmm*, ymm*, zmm*) to help identify data dependencies when independent pieces of data are simultaneously processed in a vectorised manner.

This behaviour also interacts with speculative execution, where the processor tries to guess what the result of a branch instruction will be and continues execution as if the guess was correct, then rolls back to the previous state if the guess was incorrect. For example:

cmp rax, [rcx]
je skip
add rcx, 4
lea rax, [rcx*2+8]
mov [rcx], rax
skip:
add rcx, 8

When the processor hits je skip, the memory fetch from the first instruction is still in flight, so it doesn’t yet know whether the branch will be taken or not. Without speculative execution this results in a pipeline stall while the memory fetch completes. To avoid this stall, the processor makes a branch prediction (i.e. an informed guess based on various metadata and prior observations) and saves a checkpoint. It then continues execution as if its prediction was correct (i.e. either after the branch or at the branch target, depending on what the prediction was) and either commits or rolls back its state depending on whether its prediction later turns out to be correct.

Let’s say that the processor guesses that the branch is not taken. It executes the code immediately after the branch (i.e. add rcx, 4, …) and continues until it hits the write hazard at mov [rcx], rax. It may also look ahead and see that it would execute add rcx, 8, which is not dependent on the write hazard, and execute that too. ILP also applies here, so some of these operations can be done in parallel.

When the memory fetch issued by cmp rax, [rcx] comes back, the processor now knows whether or not its prediction was correct. If it was, it commits the speculatively executed state and carries on. If it wasn’t, it has to roll back the state to an earlier checkpoint.

The Zenbleed vulnerability arises from faulty behaviour when a branch misprediction rollback occurs immediately after a special SIMD register optimisation and register rename occur.

The optimisation in question is called the XMM Register Merge Optimization. AMD Zen 2 processors keep track of SIMD registers whose upper halves have been zeroed, using a z-bit in its Register Allocation Table (RAT). When an instruction writes non-zero data to the upper half of a register, the z-bit is cleared, indicating that there is data present and any subsequent instructions that might be affected by that data cannot be executed until the data dependency is resolved. However, if the upper half is zeroed, instructions that also do not modify that upper half can proceed without waiting, avoiding the data dependency and resulting pipeline stall.

Tavis Ormandy’s writeup of the Zenbleed demonstrates this optimisation using the AVX2 optimised strlen function from glibc:

vpxor xmm0, xmm0, xmm0 ; xor xmm0 with xmm0 and store it in xmm0 (extends to ymm0)
vpcmpeqb ymm1, ymm0, [rdi] ; compare the memory at rdi to ymm0, store result in ymm1
vpmovmskb eax, ymm1 ; set eax to a 32-bit bitmap of null bytes in the ymm1 register
tzcnt eax, eax ; count the trailing zeroes
vzeroupper ; zero the upper 128 bits of ymm0-ymm15

The first instruction zeroes the 128-bit SIMD register xmm0 (similar to xor rax, rax) and, in the process, also zeroes the 256-bit SIMD register ymm0 which encompasses it, since xmm0 is the lower half of ymm0.

The second instruction, vpcmpeqb (vector compare equal bytes), treats the ymm0 register as 32 packed bytes and compares those to the 32 bytes of memory pointed to by rdi. Bytes that are equal produce a corresponding byte of all 1s in the ymm1 destination register, whereas bytes that are not equal produce a corresponding byte of all 0s.

The third instruction, vpmovmskb (vector move byte mask), takes the most significant bit of each packed byte in the ymm1 register and writes it to the corresponding bit in eax. This results in MSBs from 32 separate bytes in ymm1 being packed into a single 32-bit general purpose register.

The fourth instruction counts the trailing zeroes in eax. Since each bit in eax now represents a byte in the source memory that was zero, this finds how many trailing \0 characters appeared after the end of a 32-byte aligned string chunk.

The fifth instruction, vzeroupper, is not functionally required – the code has already finished calculating the number of trailing \0 characters – but its presence is important for performance. The instruction zeroes the upper halves of all ymm registers (and zmm registers too) – or, rather, what this actually does is set the corresponding z-bits being in the RAT to indicate that the upper halves of each register are zero, without actually zeroing any underlying entries in the register file. The lower half of the ymm register (accessible via xmm*) is still allocated in the register file, but it is merged with an upper half that is unallocated and marked as zero via its z-bit.

This is why the vzeroupper instruction helps prevent the processor from falsely assuming data dependencies in subsequent instructions that use the ymm registers. The XMM Register Merge Optimization allows the processor to identify instructions which do not write to the upper portion of the register, thus letting them execute without treating the upper (zero) portion of the register as a data dependency. This uncouples the data dependency between overlapping xmm and ymm registers.

Unfortunately it seems that AMD Zen 2 processors do not correctly handle the case when a vzeroupper instruction is speculatively executed and then rolled back due to branch misprediction. The scenario is as follows:

  1. SIMD instructions that support the XMM Register Merge Optimisation are executed, using xmm operands.
  2. A register rename is triggered on the overlapping ymm operand, e.g. by the vmovdqa instruction.
  3. A branch is reached and the CPU speculatively executes past it.
  4. A vzeroupper instruction is speculatively executed, which sets the z-bit on the upper halves of all ymm registers and deallocates their respective entries in the register file.
  5. The branch condition is resolved and misprediction is detected.
  6. The processor rolls back the vzeroupper instruction by clearing the z-bits and re-allocating the entries.
  7. Execution continues from the correct branch path.

However, when the rollback occurs, the processor resets the z-bit to zero, leaving the register in an undefined state, with the upper half of the ymm register pointing at an uninitialised entry in the register file. This is comparable to a use-after-free bug, but in the processor’s register file instead of system memory.

Since the register file is shared by SMT cores, this can be used to snoop on data in the SIMD registers across hyperthreads. This isn’t the only attack scenario, though – the same attack can be leveraged for privilege escalation.

While it might initially seem like SIMD registers aren’t particularly interesting, they are used in optimised versions of almost all string and memory manipulation functions in standard libraries. This means they are constantly handling sensitive data like passwords, keys, configuration files, etc. making all this data vulnerable to leakage.

There is a PoC exploit for Zenbleed on GitHub which is capable of dumping data across hyperthreads. The code is also nicely commented and quite easy to follow.

AMD released Bulletin AMD-SB-7008 “Cross-Process Information Leak” to track the issue. They also released a microcode patch to address the issue on Family 17h Model 31h (EPYC 7002 series) and Family 17h Model 0Ah (Sabrina SoCs). So far there are no microcode updates for consumer products, meaning that AMD’s desktop, mobile, HEDT, and workstation (Threadripper) processors remain vulnerable. AGESA firmware updates are scheduled for release in October and December 2023, which should contain new microcode for those products. It seems that the coordinated disclosure process for Zenbleed went a little off the rails, possibly due to AMD accidentally publishing information several months ahead of the agreed embargo date, resulting in the bug being disclosed 3-4 months ahead of patch availability.

On systems where the microcode or firmware updates cannot be applied, a workaround is possible using a chicken bit in the DE_CFG register at MSR 0xC0011029. Setting bit 9 in this register enables a backup fix, but has additional performance impact compared to the microcode update. Linux’s name for this workaround bit is MSR_AMD64_DE_CFG_ZEN2_FP_BACKUP_FIX_BIT, which it should automatically apply on affected platforms when no microcode update is present. The bit can manually be set on Linux using msr-tools, or on FreeBSD with cpucontrol.

At the time of writing, Microsoft do not appear to have a security update that applies the DE_CFG[9] chicken bit workaround. You can modify MSRs using RWEverything on Windows, although that comes with its own risks and is probably not a sensible thing to do in production.

It is possible to query which version of microcode has been applied, to test whether an updated version has been applied, although the method is OS specific. On Windows, the microcode version information is found in the following registry key:

HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System\CentralProcessor\0

The Update Revision value describes the microcode version that has been loaded into the processor, and the Previous Update Revision describes the microcode version that was loaded into the processor by the system firmware (UEFI / BIOS) at boot.

On Linux, /proc/cpuinfo will list the microcode version alongside other processor details:

processor : 127
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD EPYC 7601 32-Core Processor
stepping : 2
microcode : 0x8001206

The same info can also usually be found in the kernel boot log.

For Zen 2 architecture EPYC processors, a microcode version of 0x0830107a or higher indicates that a fix was applied. For Zen 2 architecture Sabrina SoCs, a microcode version of 0x08a00008 or higher indicates that a fix was applied. As noted above, all other processor families, including desktop Ryzen processors, are yet to receive a microcode update with a patch, so we don’t yet know what the fixed microcode versions will be.

In the interim, Linux should automatically apply software mitigations for Zenbleed. You can query the status of these mitigations through the sysfs interface, under the following directory:

/sys/devices/system/cpu/vulnerabilities/

If you’re running a server with a Zen 2 EPYC processor, you should update your firmware and install all OS patches to help ensure that Zenbleed is patched. If your system vendor has yet to release firmware updates to address this issue, it is possible that your OS will still load the new microcode blobs at boot, so make sure to check that first before trying to implement any manual workarounds. As always, refer to vendor guidance for good practice mitigation strategies.

The post Zenbleed – AMD Side-Channel Attack Targets Vectorised Functions appeared first on LRQA Nettitude Labs.

❌
❌