Normal view

There are new articles available, click to refresh the page.

Before yesterdayTrail of Bits Blog

Trail of Bits Blog
Disarming Fiat-Shamir footgunsTrail of Bits 24 June 2024 at 13:00

Disarming Fiat-Shamir footguns

Trail of Bits Blog

By: Trail of Bits

24 June 2024 at 13:00

By Opal Wright

The Fiat-Shamir transform is an important building block in zero-knowledge proofs (ZKPs) and multi-party computation (MPC). It allows zero-knowledge proofs based on interactive protocols to be made non-interactive. Essentially, it turns conversations into documents. This ability is at the core of powerful technologies like SNARKs and STARKs. Useful stuff!

But the Fiat-Shamir transform, like almost any other cryptographic tool, is more subtle than it looks and disastrous to get wrong. Due to the frequency of this sort of mistake, Trail of Bits is releasing a new tool called Decree, which will help developers specify their Fiat-Shamir transcripts and make it easier to include contextual information with their transcript inputs.

Fiat-Shamir overview

Many zero-knowledge proofs have a common, three-step protocol structure¹:

Peggy sends Victor a set of commitments to some values.
Victor responds with a random challenge value.
Peggy responds with a set of values that integrate both the committed values from step (1) and Victor’s random challenge value.

Obviously, the details of steps (1) and (3) will vary quite a bit from one protocol to the next, but step (2) is pretty consistent. It’s also the only part where Victor has to contribute anything at all.

It would make things much more efficient if we could eliminate the whole part where Victor picks a random challenge value and transmits it to Peggy. We could just let Peggy pick, but that gives her too much power: in most protocols, if Peggy can pick the challenge, she can customize it to her commitments to forge proofs. Worse, even if Peggy can’t pick the challenge, but can predict the challenge Victor will pick, she can still customize her commitments to the challenge to forge proofs.

The Fiat-Shamir transform allows Peggy to generate challenges but with the following features:

Peggy can’t meaningfully control the result of the generated challenges.
Once Peggy has generated a challenge, she cannot modify her commitment values.
Once Victor has the commitment information, he can reproduce the same challenge value Peggy generates.

The basic mechanism of the Fiat-Shamir transform is to feed all of the public parts of the proof (called a transcript of the proof) into a hash function, and use the output of the hash function to generate challenges. We have another blog post that describes this in better detail.

Having a complete transcript is critical to the secure generation of challenges. This means that implementers need to clearly specify and enforce transcript requirements.

Failure modes

There are a couple of Fiat-Shamir failure patterns we see in practice.

Lack of implementation specification

We often observe that customers’ transcripts are ad-hoc constructions, specified only by the implementation. The list of values added to the transcript, the order of their inclusion in the transcript, and the format of the data can be ascertained only by looking at the code.

Being so loosey-goosey with such an important component of a proof system is bad practice, but we see it all the time in our code reviews.

Incorrect formal specification

Papers describing new proof techniques or MPC systems necessarily reference the Fiat-Shamir transform, but how the authors of those papers discuss the topic can make a big difference in the security of implementation.

The optimal situation occurs when authors provide a detailed specification for secure challenge generation. A simple, unambiguous list of transcript values is about as easy as it gets, and will be accessible to implementers at all levels of experience. Assuming the authors don’t make a mistake with their spec, implementers have a good chance of avoiding weak Fiat-Shamir attacks.

When authors wave their hands and say little more than “This protocol can be made interactive using the Fiat-Shamir transform,” the nitty-gritty details are left to the implementer. For savvy cryptographers who are up to date with the literature and understand the subtleties of the Fiat-Shamir transform, this is labor-intensive, but workable. For less experienced developers, however, this is a recipe for disaster.

The worst of both worlds is when authors are hand-wavy, but try to give unproven examples. One of our other blog posts includes a good example of this: the Bulletproofs paper. The authors’ original paper referenced the Fiat-Shamir transform, and suggested what a challenge generation might look like. Many cryptographers used that example as the basis for their Bulletproofs implementation, and it turned out to be wrong.

Lack of enforcement

Even when a transcript specification is present, it can be hard to verify that the spec is followed.

Proof systems and protocols in use today are incredibly complex. For some zkSNARKS, the Fiat-Shamir transcript can include values that are generated in subroutines of subroutines of subroutines. A protocol may require Peggy to generate values that meet specific properties before they can be used in the proof and thus integrated into the transcript. This leads to complicated call trees and a lot of conditional blocks in the software. It’s easy for a transcript value that’s handled in an “if” block to be skipped in the corresponding “else” block.

Also, the complexity of these protocols can lead to intricate architectures and long functions. As functions grow longer, it becomes hard to verify that all the expected values are being included in the transcript. Transcript values are often the result of very complex computations, and are usually added to the transcript shortly after being computed. That means transcript-related calls can be dozens of lines apart, or buried in subroutines in entirely different modules. It’s very easy for a missed transcript value to get lost in the noise.

Not by fiat, but by decree

Trail of Bits is releasing a Rust library to help developers avoid these pitfalls. The library is called Decree, and it’s designed to help developers both create and enforce transcript specifications. It also includes a new trait designed to make it easier for transcript values to include contextual information like domain parameters, which are sometimes missed by developers and authors alike.

The first big feature of Decree is that, when initializing a Fiat-Shamir transcript, it requires an up-front specification of required transcript values, as well as a list of the expected challenges. Trying to generate a challenge before all of the expected values have been provided gets flagged as an error. Trying to add a value to the transcript that isn’t expected in the specification gets flagged as an error. Trying to add a value to the transcript that has already been defined gets flagged as an error. Trying to request challenges out of order… you get the idea.

This specification and enforcement mechanism is provided by the Decree struct, which builds on the venerable Merlin library. Using Merlin means that the underlying hashing and challenge generation mechanisms are solid. Decree is designed to manage access to an underlying Merlin transcript, not to replace its cryptographic internals.

As an example, we can riff a bit on our integration test that implements Girault’s identification protocol. In our modified example, we’ll start by making the following call:

let mut transcript = Decree::new("girault",
     &["g", "N", "h", "u"], &["e", "f"]);

This initializes the Decree struct so that it expects four inputs named g, N, h, and u, and two outputs named e and f. (For the Girault proof, we only need e; f is included purely for illustrative purposes.)

We can add all of these values to the transcript at the same time, or we can add them as they’re calculated:

transcript.add_serial("h", &h)?;
transcript.add_serial("u", &u)?;
transcript.add_serial("g", &g)?;
transcript.add_serial("N", &n)?;

Notice that the order we added the values to the transcript doesn’t match the ordering given in the declaration. Decree doesn’t update the underlying Merlin transcript until all of the values have been specified, at which point the inputs are fed into the transcript in alphabetical order. Changing up how you order your Decree inputs doesn’t impact the generated challenges.

We can then generate our challenges:

let mut challenge_e: [u8; 128] = [0u8; 128];
let mut challenge_f: [u8; 32] = [0u8; 32];
transcript.get_challenge("e",
    &mut challenge_e)?;
transcript.get_challenge("f",
    &mut challenge_f)?;

When we generate challenges, order does matter: we are required to generate e first, because e is listed ahead of f in the declaration.

A Decree struct is not limited to single-step protocols, either. Once all of the challenges in a given specification have been generated, a Decree transcript can be extended to handle further input values and challenges, carrying all of the previous state information with it. For multi-stage proofs, the extension calls help delineate when protocol stages begin and end.

The ability to include contextual information is provided by the Inscribe trait, which is derivable for structs with named members. When deriving the Inscribe trait, developers can specify a function that provides relevant contextual information, such as elliptic curve or finite field parameters. This information is included alongside deterministic serializations of the struct members. And if a struct member supports the Inscribe trait, then its contextual information will be included as well.

We can use the Inscribe trait to simplify handling of a Schnorr proof:

/// Schnorr proof as a struct
    #[derive(Inscribe)]
    struct SchnorrProof {
        #[inscribe(serialize)]
        base: BigInt,
        #[inscribe(serialize)]
        target: BigInt,
        #[inscribe(serialize)]
        modulus: BigInt,
        #[inscribe(serialize)]
        base_to_randomized: BigInt,
        #[inscribe(skip)]
        z: BigInt,
    }

After we’ve filled in the base, target, modulus, and base_to_randomized values of a SchnorrProof struct, we can simply add it to our transcript, generate our challenge, and update the z value:

let mut transcript = Decree::new(
    "schnorr proof", &["proof_data"], 
&["z_bytes"]).unwrap();
transcript.add("proof_data", &proof)?;

let mut challenge_bytes: [u8; 32] = [0u8; 32];
transcript.get_challenge("z_bytes",
&mut challenge_bytes)?;
let chall = BigInt::from_bytes_le(Sign::Plus,
&challenge_bytes);
let proof.z = (&chall * &log) + &randomizer_exp;

By setting the #[inscribe(skip)] flag on the z member, we set up the struct to automatically add every other value to the transcript; adding z to the proof makes it ready to send to the verifier.

In short, the Decree struct helps programmers to define, enforce, and understand their Fiat-Shamir transcripts, while the Inscribe trait makes it easier for developers to ensure that important contextual data (such as elliptic curve identifiers) is included by default. While getting a Fiat-Shamir specification wrong is still possible, it’ll at least be easier to spot, test, and fix.

So give it a shot, and let us know what you think.

¹Many of the more complicated proof systems have multiple instances of this structure. That’s okay; our ideas here extend to those systems.

Trail of Bits Blog
EuroLLVM 2024 trip reportTrail of Bits 21 June 2024 at 13:00

EuroLLVM 2024 trip report

Trail of Bits Blog

By: Trail of Bits

21 June 2024 at 13:00

By Marek Surovič and Henrich Lauko

EuroLLVM is a developer meeting focused on projects under the LLVM Foundation umbrella that live in the LLVM GitHub monorepo, like Clang and—more recently, thanks to machine learning research—the MLIR framework. Trail of Bits, which has a history in compiler engineering and all things LLVM, sent a bunch of our compiler specialists to the meeting, where we presented on two of our projects: VAST, an MLIR-based compiler for C/C++, and PoTATo, a novel points-to analysis approach for MLIR. In this blog post, we share our takeaways and experiences from the developer meeting, which spanned two days and included a one-day pre-conference workshop.

Security awareness

A noticeable difference from previous years was the emerging focus on security. There appears to be a growing drive within the LLVM community to enhance the security of the entire software ecosystem. This represents a relatively new development in the compiler community, with LLVM leadership actively seeking expertise on the topic.

The opening keynote introduced the security theme, asserting it has become the third pillar of compilers alongside optimization and translation. Kristof Beyls of ARM delivered the keynote, providing a brief history of how the concerns and role of compilers have evolved. He emphasized that security is now a major concern, alongside correctness and performance.

The technical part of the keynote raised an interesting question: Does anyone verify that security mitigations are correctly applied, or applied at all? To answer this question, Kristof implemented a static binary analysis tool using BOLT. The mitigations Kristof picked to verify were -fstack-clash-protection and -mbranch-protection=standard, particularly its pac-ret mechanism.

The evaluation of the BOLT-based scanner was conducted on libraries within a Fedora 39 AArch64-linux distribution, comprising approximately 3,000 installed packages. For pac-ret, analysis revealed 2.5 million return instructions, with 46 thousand lacking proper protection. Scanning 1,920 libraries that use -fstack-clash-protection identified 39 as potentially vulnerable, although some could be false positives.

An intriguing discussion arose regarding the preference for BOLT over tools like IDA, Ghidra, or Angr from the reverse-engineering domain. The distinction lies in BOLT’s suitability for batch processing of binaries, unlike the user-interactivity focus of IDA or Ghidra. Furthermore, the advantage of BOLT is that it supports the latest target architecture changes since it is part of the compilation pipeline, whereas reverse engineering tools often lag behind, especially concerning more niche instructions.
For further details, Kristof’s RFC on the LLVM discourse provides additional information.

For those interested in compiler hardening, the OpenSSF guidelines offer a comprehensive overview. Additionally, for a more in-depth discussion of security for compiler engineers, we suggest reading the Low Level Software Security online book. It’s still a work in progress, and contributions to the guidelines are welcome.

One notable talk on program analysis and debugging was Incremental Symbolic Execution for the Clang Static Analyzer, which discussed how the Clang Static Analyzer can now cache results. This innovation helps keep diagnostic information relevant across codebase changes and minimizes the need to invoke the analyzer. Another highlight was Mojo Debugging: Extending MLIR and LLDB, which explored new developments in LLDB, allowing its use outside the Clang environment. This talk also covered the potential upstreaming of a debug dialect from the Modular warehouse.

MLIR is not (only) about machine learning

MLIR is a compiler infrastructure project that gained traction thanks to the machine learning (ML) boom. The ML in MLIR, however, stands for Multi-Level, and the project allows for much more than just tinkering with tensors. SiFive, renowned for their work on RISC-V, employs it in circuit design, among other applications. Compilers for general-purpose languages using MLIR are also emerging, such as JSIR Dialect for JavaScript, Mojo as a superset of Python, ClangIR, and our very own VAST for C/C++.

The MLIR theme of this developer meeting could be summarized as “Figuring out how to make the most of LLVM and MLIR in a shared pipeline.” A number of speakers presented work that, in one way or another, concluded that many performance optimizations are better done in MLIR thanks to its better abstraction. LLVM then is mainly responsible for code generation to the target machine code.

After going over all the ways MLIR is slow compared to LLVM, Jeff Niu (Modular) remarked that in the Mojo compiler, most of the runtime is still spent in LLVM. The reason is simple: there’s just more input to process when code gets compiled down to LLVM.

A team from TU Munich even opted to skip LLVM IR entirely and generate machine-IR (MIR) directly, yielding ~20% performance improvement in a Just-in-Time (JIT) compilation workload.

Those intrigued by MLIR internals should definitely catch the second conference keynote on Efficient Idioms in MLIR. The keynote delved into performance comparisons of different MLIR primitives and patterns. It gave developers a good intuition about the costs of performing operations such as obtaining an attribute or iterating or mutating the IR. On a similar topic, the talk Deep Dive on Interfaces Implementation gave a better insight into a cornerstone of MLIR genericity. These interfaces empower dialects to articulate common concepts like side effects, symbols, and control flow interactions. The talk elucidated their implementation details and the associated overhead incurred in striving for generality.

Region-based analysis

Another interesting trend we’ve noticed is that several independent teams have found that analyses traditionally defined using control flow graphs based on basic blocks may achieve better runtime performance when performed using a representation with region-based control flow. This improvement is mainly because analyses do not need to reconstruct loop information, and the overall representation is smaller and therefore quicker to analyze. The prime example presented was dataflow analysis done inside the Mojo compiler.

For cases like Mojo, where you’re starting with source code and compiling down an MLIR-based pipeline, switching to region-based control flow for analyses is only a matter of doing the analysis earlier in the pipeline. Other users are not so lucky and need to construct regions from traditional control flow graphs. If you’re one of those people, you’re not alone. Teams in the high-performance computing industry are always looking for ways to squeeze more performance from their loops, and having loops explicitly represented as regions instead of hunting for them in a graph makes a lot of things easier. This is why MLIR now has a pass to lift control flow graphs to regions-based control flow. Sounds familiar? Under the hood, our LLVM-to-C decompiler Rellic does something very similar.

Not everything is sunshine and rainbows when using regions for control flow, though. The regions need to have a single-entry and single-exit. Many programming languages, however, allow constructs like break and continue inside loop bodies. These are considered abnormal entries or exits. Thankfully, with so much chatter around regions, core MLIR developers have noticed and are cooking up a major new feature to address this. As presented during the MLIR workshop, the newly designed region-based control flow will allow specifying the semantics of constructs like continue or break. The idea is pretty simple: these operations will yield a termination signal and forward control flow to some parent region that captures this signal. Unfortunately, this still does not allow us to represent gotos in our high-level representation, as the signaling mechanism does allow users to pass control-flow only to parent regions.

C/C++ successor languages

The last major topic at the conference was, as is expected in light of recent developments, successor languages to C/C++. One such effort is Carbon, which had a dedicated panel. The panel questions ranged from technical ones, like how refactoring tools will be supported, to more managerial ones, like how Carbon will avoid being overly influenced by the needs of Google, which is currently the main supporter of the project. For a more comprehensive summary of the panel, check out this excellent blog post by Alex Bradbury.

Other C++ usurpers had their mentions, too—particularly Rust and Swift. Both languages recognize the authority of C++ in the software ecosystem and have their own C++ interoperability story. Google’s Crubit was mentioned for Rust during the Carbon panel, and Swift had a separate talk on interoperability by Egor Zhdan of Apple.

Our contributions

Our own Henrich Lauko gave a talk on a new feature coming to VAST, our MLIR-based compiler for C/C++: the Tower of IRs. The big picture idea here is that VAST is a MLIR-based C/C++ compiler IR project that offers many layers of abstraction. Users of VAST then can pick the right abstractions for their analysis or transformation use-case. However, there are numerous valuable LLVM-based tools, and it would be unfortunate if we couldn’t use them with our higher-level MLIR representation. This is precisely why we developed the Tower of IRs. It enables users to bridge low-level analysis with high-level abstractions.

The Tower of IRs introduces a mechanism that allows users to take snapshots of IR between and after transformations and link them together, creating a chain of provenance. This way, when a piece of code changes, there’s always a chain of references back to the original input. The keen reader already has a grin on their face.

The demo use case Henrich presented was repurposing LLVM analyses in MLIR by using the tower to bring the input C source all the way down to LLVM, perform a dependency analysis, and translate analysis results all the way back to C via the provenance links in the tower.

Along with Henrich, Robert Konicar presented the starchy fruits of his student labor in the form of PoTATo. The project implements a simple MLIR dialect tailored towards implementing points-to analyses. The idea is to translate memory operations from a source dialect to the PoTATo dialect, do some basic optimizations, and then run a points-to analysis of your choosing, yielding alias sets. To get relevant information back to the original code, one could of course use the VAST Tower of IRs. The results that Robert presented on his poster were promising: applying basic copy-propagation before points-to analysis significantly reduced the problem size.

AI Corridor talks

Besides attending the official talks and workshops, the Trail of Bits envoys spent a lot of time chatting with people during breaks and at the banquet. The undercurrent of many of those conversations was AI and machine learning in all of its various forms. Because EuroLLVM focuses on languages, compilers, and hardware runtimes, the conversations usually took the form of “how do we best serve this new computing paradigm?”. The hardware people are interested in how to generate code for specialized accelerators; the compiler crowd is optimizing linear algebra in every way imaginable; and languages are doing their best to meet data scientists where they are.

Discussions about projects that went the other way—that is, “How can machine learning help people in the LLVM crowd?”—were few and far between. These projects typically did research into various data gathered in the domains around LLVM in order to make sense out of them using machine learning methods. From what we could see, things like LLMs and GANs were not really mentioned in any way. Seems like an opportunity for fresh ideas!

Trail of Bits Blog
Themes from Real World Crypto 2024Trail of Bits 18 June 2024 at 13:00

Themes from Real World Crypto 2024

Trail of Bits Blog

By: Trail of Bits

18 June 2024 at 13:00

In March, Trail of Bits engineers traveled to the vibrant (and only slightly chilly) city of Toronto to attend Real World Crypto 2024, a three-day event that hosted hundreds of brilliant minds in the field of cryptography. We also attended three associated events: the Real World Post-Quantum Cryptography (RWPQC) workshop, the Fully Homomorphic Encryption (FHE) workshop, and the Open Source Cryptography Workshop (OSCW). Reflecting on the talks and expert discussions held at the event, we identified some themes that stood out:

Governments, standardization bodies, and industry are making substantial progress in advancing post-quantum cryptography (PQC) standardization and adoption.
Going beyond the PQC standards, we saw innovations for more advanced PQC using lattice-based constructions.
Investment in end-to-end encryption (E2EE) and key transparency is gaining momentum across multiple organizations.

We also have a few honorable mentions:

Fully homomorphic encryption (FHE) is an active area of research and is becoming more and more practical.
Authenticated encryption schemes with associated data (AEADs) schemes are also an active area of research, with many refinements being made.

Read on for our full thoughts!

How industry and government are adopting PQC

The community is preparing for the largest cryptographic migration since the (ongoing) effort to replace RSA and DSA with elliptic curve cryptography began 25 years ago. Discussions at both the PQ-dedicated RWPQC workshop and the main RWC event focused on standardization efforts and large-scale real-world deployments. Google, Amazon, and Meta reported initial success in internal deployments.

Core takeaways from the talks include:

The global community has broadly accepted the NIST post-quantum algorithms as standards. Higher-level protocols, like Signal, are busy incorporating the new algorithms.
Store-now-decrypt-later attacks require moving to post-quantum key exchange protocols as soon as possible. Post-quantum authentication (signature schemes) are less urgent for applications following good key rotation practices.
Post-quantum security is just one aspect of cryptographic agility. Good cryptographic inventory and key rotation practices make PQ migration much smoother.

RWPQC featured talks from four standards bodies. These talks showed that efforts to adopt PQC are well underway. Dustin Moody (NIST) emphasized that the US government and US industries aim to be quantum-ready by 2035, while Matthew Campagna (ETSI) discussed coordination efforts among 850+ organizations in more than 60 countries. Stephanie Reinhardt (BSI) warned that cryptographically relevant quantum computers could come online at the beginning of the 2030s and shared BSI’s Technical Guideline on Cryptographic Mechanisms. Reinhardt also cautioned against reliance on quantum key distribution, citing almost 200 published attacks on QKD implementations. NCSC promoted the standalone use of ML-KEM and ML-DSA, in contrast to the more common and cautious hybrid approach.

While all standards bodies support the FIPS algorithms, BSI additionally supports using NIST contest finalists FrodoKEM and McEliece.

Deidre Connelly, representing several working groups in the IETF, talked about the KEM combiners guidance document she’s been working on and the ongoing discussions around KEM binding properties (from the CFRG working group). She also mentioned the progress of the TLS working group: PQC will be in TLS v1.3 only, and the main focus is on getting the various key agreement specifications right. The LAMPS working group is working on getting PQC algorithms in the Cryptographic Message Syntax and the Internet X.509 PKI. Finally, PQUIP is working on the operational and engineering side of getting PQC in more protocols, and the MLS working group is working on getting PQC in MLS.

The industry perspective was equally insightful, with representatives from major technology companies sharing some key insights:

Signal: Rolfe Schmidt gave a behind-the-scenes look at Signal’s efforts to incorporate post-quantum cryptography, such as their recent work on developing their post-quantum key agreement protocol, PQXDH. Their focus areas moving forward include providing forward-secrecy and post-compromise security against quantum attackers, achieving a fully post-quantum secure Signal protocol, and anonymous credentials.
Meta/Facebook: Meta demonstrated their commitment to PQC by announcing they are joining the PQC alliance. Their representative, Rafael Misoczki, also discussed the prerequisites for a successful PQC migration: cryptography libraries and applications must support easy use of PQ algorithms, clearly discourage creation of new quantum-insecure keys, and provide protection against known quantum attacks. Moreover, the migration has to be performant and cost-efficient.
Google: Sophie Schmieg from Google elucidated their approach toward managing key rotations and crypto agility, stressing that post-quantum migration is really a key rotation problem. If you have a good mechanism for key rotation, and you are properly specifying keys as both the cryptographic configuration and raw key bytes rather than just the raw bytes, you’re most of the way to migrating to post-quantum.
Amazon/Amazon Web Services (AWS): Matthew Campagna rounded up the industry updates with a presentation on the progress that AWS (AWS) has made towards securing their cryptography against a quantum adversary. Like most others, their primary concern, is “store now, decrypt later” attacks.

Even more PQC: Advanced lattice techniques

In addition to governments and industry groups both committing to adopting the latest PQC NIST standards, RWC this year also demonstrated the large body of work being done in other areas of PQC. In particular, we attended two interesting talks about new cryptographic primitives built using lattices:

LaZer: LaZer is an intriguing library that uses lattices to facilitate efficient Zero-Knowledge Proofs (ZKPs). For some metrics, this proof system achieves better performance than some of the current state-of-the-art proof systems. However, since LaZer uses lattices, its arithmetization is completely different from existing R1CS and Plonkish proof systems. This means that it will not work with existing circuit compilers out of the box, so advancing this to real-world systems will take additional effort.
Swoosh: Another discussion focused on Swoosh, a protocol designed for efficient lattice-based Non-Interactive Key Exchanges. In an era when we have to rely on post-quantum Key Encapsulation Mechanisms (KEMs) instead of post-quantum Diffie-Hellman based schemes, developing robust key exchange protocols with post-quantum qualities is a strong step forward and a promising area of research.

End-to-end encryption and key transparency

End-to-end (E2E) encryption and key transparency were a significant theme in the conference. A few highlights:

Key transparency generally: Melissa Chase gave a great overview presentation on key transparency’s open problems and recent developments. Key transparency plays a vital role in end-to-end encryption, allowing users to detect man-in-the-middle attacks without relying on out-of-band communication.
Securing E2EE in Zoom: Researcher Mang Zhao shared their approach to improving Zoom’s E2EE security, specifically protecting against eavesdropping or impersonation attacks from malicious servers. Their strategy relies heavily on Password Authenticated Key Exchange (PAKE) and Authenticated Encryption with Associated Data (AEAD), promising a more secure communication layer for users. They then used formal methods to prove that their approach achieved its goals.
E2EE adoption at Meta: Meta/Facebook stepped up to chronicle their journey in rolling out E2EE on Messenger. Users experience significant friction while upgrading to E2EE, as they suddenly need to take action in order to ensure that they can recover their data if they lose their device. In some cases such as sticker search, Meta decided to prioritize functionality alongside privacy, as storing the entire sticker library client-side would be prohibitive.

Honorable mentions

AEADs: In symmetric cryptography, Authenticated Encryption Schemes with Associated Data (AEADs) were central to discussions this year. The in-depth conversations around Poly1305 and AES-GCM illustrated the ongoing dedication to refining these cryptographic tools. We’re preparing a dedicated post about these exciting advancements, so stay tuned!

FHE: The FHE breakout demonstrated the continued progress of Fully Homomorphic Encryption. Researchers presented innovative theoretical advancements, such as a new homomorphic scheme based on Ring Learning with Rounding that showed signs of achieving better performance against current schemes under certain metrics. Another groundbreaking talk featured the HEIR compiler, a toolchain accelerating FHE research, potentially easing the transition from theory to practical, real-world implementations.

The Levchin Prize winners for 2024

Two teams are awarded the Levchin Prize at RWC every year for significant contributions to cryptography and its practical uses.

Al Cutter, Emilia Käsper, Adam Langley, and Ben Laurie received the Levchin Prize for creating and deploying Certificate Transparency at scale. Certificate Transparency is built on relatively simple cryptographic operations yet has an outsized positive impact on internet security and privacy.

Anna Lysyanskaya and Jan Camenisch received the other 2024 Levchin Prize for developing efficient Anonymous Credentials. Their groundbreaking work from 20 years ago is becoming more and more relevant as more and more applications use them.

Moving forward

The Real World Crypto 2024 conference, along with the FHE, RWPQC, and OSCW events, provided rich insights into the state of the art and future directions in cryptography. As the field continues to evolve, with governments, standards bodies, and industry players collaborating to further the nuances of our cryptographic world, we look forward to continued advancements in PQC, E2EE, FHE, and many other exciting areas. These developments reflect our collective mission to ensure a secure future and reinforce the importance of ongoing research, collaboration, and engagement across the cryptographic community.

Trail of Bits Blog
Finding mispriced opcodes with fuzzingTrail of Bits 17 June 2024 at 13:00

Finding mispriced opcodes with fuzzing

Trail of Bits Blog

By: Trail of Bits

17 June 2024 at 13:00

By Max Ammann

Fuzzing—a testing technique that tries to find bugs by repeatedly executing test cases and mutating them—has traditionally been used to detect segmentation faults, buffer overflows, and other memory corruption vulnerabilities that are detectable through crashes. But it has additional uses you may not know about: given the right invariants, we can use it to find runtime errors and logical issues.

This blog post explains how Trail of Bits developed a fuzzing harness for Fuel Labs and used it to identify opcodes that charge too little gas in the Fuel VM, the platform on which Fuel smart contracts run. By implementing a similar fuzzing setup with carefully chosen invariants, you can catch crucial bugs in your smart contract platform.

How we developed a fuzzing harness and seed corpus

The Fuel VM had an existing fuzzer that used cargo-fuzz and libFuzzer. However, it had several downsides. First, it did not call internal contracts. Second, it was somewhat slow (~50 exec/s). Third, it used the arbitrary crate to generate random programs consisting of just vectors of Instructions.

We developed a fuzzing harness that allows the fuzzer to execute scripts that call internal contracts. The harness still uses cargo-fuzz to execute. However, we replaced libFuzzer with a shim provided by the LibAFL project. The LibAFL runtime allows executing test cases on multiple cores and increases the fuzzing performance to ~1,000 exec/s on an eight-core machine.

After analyzing the output of the Sway compiler, we noticed that plain data is interleaved with actual instructions in the compiler’s output. Thus, simple vectors of instructions do not accurately represent the output of the Sway compiler. But even worse, Sway compiler output could not be used as a seed corpus.

To address these issues, the fuzzer input had to be redesigned. The input to the fuzzer is now a byte vector that contains the script assembly, script data, and the assembly of a contract to be called. Each of these is separated by an arbitrarily chosen, 64-bit magic value (0x00ADBEEF5566CEAA). Because of this redesign, compiled Sway programs can be used as input to the seed corpus (i.e., as initial test cases). We used the examples from the Sway repository as initial input to speed up the fuzzing campaign.

The LibAFL-based fuzzer is implemented as a Rust binary with subcommands for generating seeds, executing test cases in isolation, collecting gas usage statistics of test cases, and actually executing the fuzzer. Its README includes instructions for running it. The source code for the fuzzer can be found in FuelLabs/fuel-vm#724.

Challenges encountered

During our audit, we had to overcome a number of challenges. These included the following:

The secp256k1 0.27.0 dependency is currently incompatible with cargo-fuzz because it enables a special fuzzing mode automatically that breaks secp256k1’s functionality. We applied the following dependency declaration in fuel-crypto/Cargo.toml:20:
Figure 1: Updated dependency declaration
The LibAFL shim is not stable and is not yet part of any release. As a result, bugs are expected, but due to the performance improvements, it is still worthwhile to consider using it over the default fuzzer runtime.
We were looking for a way to pass in the offset to the script data to the program that is executed in the fuzzer. We decided to do this by patching the fuel-vm. The fuel-vm writes the offset into the register 0x10 before executing the actual program. That way, programs can reliably access the script data offset. Also, seed inputs continue to execute as expected. The following change was necessary in fuel-vm/src/interpreter/executors/main.rs:523:
Figure 2: Write the script data offset to register 0x10

Additionally, we added the following test case to the seed corpus that uses this behavior.

Figure 3: Test case for using the now-available script data offset

Using fuzzing to analyze gas usage

The corpus created by a fuzzing campaign can be used to analyze the gas usage of assembly programs. It is expected that gas usage strongly correlates with execution time (note that execution time is a proxy for the amount of CPU cycles spent).

Our analysis of the Fuel VM’s gas usage consists of three steps:

Launch a fuzzing campaign.
Execute cargo run --bin collect <file/dir> on the corpus, which yields a gas_statistics.csv file.
- Examine and plot the result of the gathered data using the Python script from figure 4.
Identify the outliers and execute the test cases in the corpus. During the execution, gather data about which instructions are executed and for how long.
- Examine the collected data by grouping it by instruction and reducing it to a table which shows which instructions cause high execution times.

This section describes each step in more detail.

Step 1: Fuzz

The cargo-fuzz tool will output the corpus in the directory corpus/grammar_aware. The fuzzer tries to find inputs that increase the coverage. Furthermore, the LibAFL fuzzer prefers short inputs that yield a long execution time. This goal is interesting because it could uncover operations that do not consume very much gas but spend a long time executing.

Step 2: Collect data and evaluate

The Python script in figure 4 loads the CSV file created by invoking cargo run --bin collect <file/dir>. It then plots the execution time vs. gas consumption. This already reveals that there are some outliers that take longer to execute than other test cases while using the same amount of gas.

Figure 4: Python script to determine gas usage vs execution time of the discovered test inputs

Figure 5: Results of running the script in figure 4

Step 3: Identify and analyze outliers

The Python script in figure 6 performs a linear regression through the data. Then, we determine which test cases are more than 1,000ms off from the regression and store them in the inspect variable. The results appear in figure 7.

Figure 6: Python script to perform linear regression over the test data

Figure 7: Results of running the script in figure 6

Finally, we re-execute the corpus with specific changes applied to gather data about which executions are responsible for the long execution. The changes are the following:

Add let start = Instant::now(); at the beginning of function instruction_inner.
Add println!("{:?}\t{:?}", instruction.opcode(), start.elapsed().as_nanos()); at the end of the function.

These changes cause the execution of a test case to print out the opcode and the execution time of each instruction.

Figure 8: Investigation of the contribution to execution time for each instruction

The outputs for Fuel’s opcodes are shown below:

Figure 9: Results of running the script in figure 8

The above evaluation shows that the opcodes MCLI, SCWQ, K256, SWWQ, and SRWQ may be mispriced. For SCWQ, SWWQ, and K256, the results were expected because we already discovered problematic behavior through fuzzing. Each of these issues appears to be resolved (see FuelLabs/fuel-vm#537). This analysis also shows that there might be a pricing issue for SRWQ. We are unsure why MCLI shows in our analysis. This may be due to noise in our data, as we could not find an immediate issue with its implementation and pricing.

Lessons learned

As the project evolves, it is essential that the Fuel team continues running a fuzzing campaign on code that introduces new functionality, or on functions that handle untrusted data. We suggested the following to the Fuel team:

Run the fuzzer for at least 72 hours (or ideally, a week). While there is currently no tooling to determine ideal execution time, the coverage data gives a good estimate about when to stop fuzzing. We saw no more valuable progress of the fuzzer after executing it more than 72 hours.
Pause the fuzzing campaign whenever new issues are found. Developers should triage them, fix them, and then resume the fuzzing. This will reduce the effort needed during triage and issue deduplication.
Fuzz test major releases of the Fuel VM, particularly after major changes. Fuzz testing should be integrated as part of the development process, and should not be conducted only once in a while.

Once the fuzzing procedure has been tuned to be fast and efficient, it should be properly integrated in the development cycle to catch bugs. We recommend the following procedure to integrate fuzzing using a CI system, for instance by using ClusterFuzzLite (see FuelLabs/fuel-vm#727):

After the initial fuzzing campaign, save the corpus generated by every test.
For every internal milestone, new feature, or public release, re-run the fuzzing campaign for at least 24 hours starting with each test’s current corpus.¹
Update the corpus with the new inputs generated.

Note that, over time, the corpus will come to represent thousands of CPU hours of refinement, and will be very valuable for guiding efficient code coverage during fuzz testing. An attacker could also use a corpus to quickly identify vulnerable code; this additional risk can be avoided by keeping fuzzing corpora in an access-controlled storage location rather than a public repository. Some CI systems allow maintainers to keep a cache to accelerate building and testing. The corpora could be included in such a cache, if they are not very large.

Future work

In the future, we recommended that Fuel expand the assertions used in the fuzzing harness, especially for the execution of blocks. For example, the assertions found in unit tests could serve as an inspiration for implementing additional checks that are evaluated during fuzzing.

Additionally, we encountered an issue with the required alignment of programs. Programs for the Fuel VM must be 32-bit aligned. The current fuzzer does not honor this alignment, and thus easily produces invalid programs, e.g., by inserting only one byte instead of four. This can be solved in the future by either using a grammar-based approach or adding custom mutations that honor the alignment.

Instead of performing the fuzzing in-house, one could use the oss-fuzz project, which performs automatic fuzzing campaigns with Google’s extensive testing infrastructure. oss-fuzz is free for widely used open-source software. We believe they would accept Fuel as another project.

On the plus side, Google provides all their infrastructure for free, and will notify project maintainers any time a change in the source code introduces a new issue. The received reports include essential important information such as minimized test cases and backtraces.

However, there are some downsides: If oss-fuzz discovers critical issues, Google employees will be the first to know, even before the Fuel project’s own developers. Google policy also requires the bug report to be made public after 90 days, which may or may not be in the best interests of Fuel. Weigh these benefits and risks when deciding whether to request Google’s free fuzzing resources.

If Trail of Bits can help you with fuzzing, please reach out!

¹ For more on fuzz-driven development, see this CppCon 2017 talk by Kostya Serebryany of Google.

Trail of Bits Blog
Understanding Apple’s On-Device and Server Foundation Models releaseTrail of Bits 14 June 2024 at 20:49

Understanding Apple’s On-Device and Server Foundation Models release

Trail of Bits Blog

By: Trail of Bits

14 June 2024 at 20:49

By Artem Dinaburg

Earlier this week, at Apple’s WWDC, we finally witnessed Apple’s AI strategy. The videos and live demos were accompanied by two long-form releases: Apple’s Private Cloud Compute and Apple’s On-Device and Server Foundation Models. This blog post is about the latter.

So, what is Apple releasing, and how does it compare to the current open-source ecosystem? We integrate the video and long-form releases and parse through the marketing speak to bring you the nuggets of information within.

The sound of silence

No NVIDIA/CUDA Tax. What’s unsaid is as important as what is, and those words are CUDA and NVIDIA. Apple goes out of its way to specify that it is not dependent on NVIDIA hardware or CUDA APIs for anything. The training uses Apple’s AXLearn (which runs on TPUs and Apple Silicon), Server model inference runs on Apple Silicon (!), and the on-device APIs are CoreML and Metal.

Why? Apple hates NVIDIA with the heat of a thousand suns. Tim Cook would rather sit in a data center and do matrix multiplication with an abacus than spend millions on NVIDIA hardware. Aside from personal enmity, it is a good business idea. Apple has its own ML stack from the hardware on up and is not hobbled by GPU supply shortages. Apple also gets to dogfood its hardware and software for ML tasks, ensuring that it’s something ML developers want.

What’s the downside? Apple’s hardware and software ML engineers must learn new frameworks and may accidentally repeat prior mistakes. For example, Apple devices were originally vulnerable to LeftoverLocals, but NVIDIA devices were not. If anyone from Apple is reading this, we’d love to audit AXLearn, MLX, and anything else you have cooking! Our interests are in the intersection of ML, program analysis, and application security, and your frameworks pique our interest.

The models

There are (at least) five models being released. Let’s count them:

The ~3B parameter on-device model used for language tasks like summarization and Writing Tools.
The large Server model is used for language tasks too complex to do on-device.
The small on-device code model built into XCode used for Swift code completion.
The large Server code model (“Swift Assist”) that is used for complex code generation and understanding tasks.
The diffusion model powering Genmoji and Image Playground.

There may be more; these aren’t explicitly stated but plausible: a re-ranking model for working with Semantic Search and a model for instruction following that will use app intents (although this could just be the normal on-device model).

The ~3B parameter on-device model. Apple devices are getting an approximately 3B parameter on-device language model trained on web crawl and synthetic data and specially tuned for instruction following. The model is similar in size to Microsoft’s Phi-3-mini (3.8B parameters) and Google’s Gemini Nano-2 (3.25B parameters). The on-device model will be continually updated and pushed to devices as Apple trains it with new data.

What model is it? A reasonable guess is a derivative of Apple’s OpenELM. The parameter count fits (3B), the training data is similar, and there is extensive discussion of LoRA and DoRA support in the paper, which only makes sense if you’re planning a system like Apple has deployed. It is almost certainly not directly OpenELM since the vocabulary sizes do not match and OpenELM has not undergone safety tuning.

Apple’s on-device and server model architectures.

A large (we’re guessing 130B-180B) Mixture-of-Experts Server model. For tasks that can’t be completed on a device, there is a large model running on Apple Silicon Servers in their Private Compute Cloud. This model is similar in size and capability to GPT-3.5 and is likely implemented as a Mixture-of-Experts. Why are we so confident about the size and MoE architecture? The open-source comparison models in cited benchmarks (DBRX, Mixtral) are MoE and approximately of that size; it’s too much for a mere coincidence.

Apple’s Server model compared to open source alternatives and the GPT series from OpenAI.

The on-device code model is cited in the platform state of the union; several examples of Github Copilot-like behavior integrated into XCode are shown. There are no specifics about the model, but a reasonable guess would be a 2B-7B code model fine-tuned for a specific task: fill-in-middle for Swift. The model is trained on Swift code and Apple SDKs (likely both code and documentation). From the demo video, the integration into XCode looks well done; XCode gathers local symbols and proper context for the model to better predict the correct text.

Apple’s on-device code model doing FIM completions for Swift code via XCode.

The server code model is branded as “Swift Assist” and also appears in the platform state of the union. It looks to be Apple’s answer to GitHub Copilot Chat. Not much detail is given regarding the model, but looking at its demo output, we guess it’s a 70B+ parameter model specifically trained on Swift Code, SDKs, and documentation. It is probably fine-tuned for instruction following and code generation tasks using human-created and synthetically generated data. Again, there is tight integration with XCode regarding providing relevant context to the model; the video mentions automatically identifying and using image and audio assets present in the project.

Swift Assist completing a description to code generation task, integrated into XCode.

The Image Diffusion Model. This model is discussed in the Platforms State of the Union and implicitly shown via Genmoji and Image Playground features. Apple has considerable published work on image models, more so than language models (compare the amount of each model type on Apple’s HF page). Judging by their architecture slide, there is a base model with a selection of adapters to provide fine-grained control over the exact image style desired.

Image Playground showing the image diffusion model and styling via adapters.

Adapters: LoRAs (and DoRAs) galore

The on-device models will come with a set of LoRAs and/or DoRAs (Adapters, in Apple parlance) that specialize the on-device model to be very good at specific tasks. What’s an adapter? It’s effectively a diff against the original model weights that makes the model good at a specific task (and conversely, worse at general tasks). Since adapters do not have to modify every weight to be effective, they can be small (10s of megabytes) compared to a full model (multiple gigabytes). Adapters can also be dynamically added or removed from a base model, and multiple adapters can stack onto each other (e.g., imagine stacking Mail Replies + Friendly Tone).

For Apple, shipping a base model and adapters makes perfect sense: the extra cost of shipping adapters is low, and due to complete control of the OS and APIs, Apple has an extremely good idea of the actual task you want to accomplish at any given time. Apple promises continued updates of adapters as new training data is available and we imagine new adapters can fill specific action niches as needed.

Some technical details: Apple says their adapters modify multiple layers (likely equivalent to setting target_modules=”all-linear” in HF’s transformers). Adapter rank determines how strong an effect it has against the base model; conversely, higher-rank adapters take up more space since they modify more weights. At rank=16 (which from a vibes/feel standpoint is a reasonable compromise between effect and adapter size), the adapters take up 10s of megabytes each (as compared to gigabytes for a 3B base model) and are kept in some kind of warm cache to optimize for responsiveness.

Suppose you’d like to learn more about adapters (the fundamental technology, not Apple’s specific implementation) right now. In that case, you can try via Apple-native MLX examples or HF’s transformers and PEFT packages.

A selection of Apple’s language model adapters.

A vector database?

Apple doesn’t explicitly state this, but there’s a strong implication that Siri’s semantic search feature is a vector database; there’s an explicit comparison that shows Siri now searches based on meaning instead of keywords. Apple allows application data to be indexed, and the index is multimodal (images, text, video). A local application can provide signals (such as last accessed time) to the ranking model used to sort search results.

Siri now searches by semantic meaning, which may imply there is a vector database underneath.

Delving into technical details

Training and data

Let’s talk about some of the training techniques described. They are all ways to parallelize training very large language models. In essence, these techniques are different means to split & replicate the model to train it using an enormous amount of compute and data. Below is a quick explanation of the techniques used, all of which seem standard for training such large models:

Data Parallelism: Each GPU has a copy of the full model but is assigned a chunk of the training data. The gradients from all GPUs are aggregated and used to update weights, which are synchronized across models.
Tensor Parallelism: Specific parts of the model are split across multiple GPUs. PyTorch docs say you will need this once you have a big model or GPU communication overhead becomes an issue.
Sequence Parallelism was the hardest topic to find; I had to dig to page 6 of this paper. Parts of the transformer can be split to process multiple data items at once.
FSDP shards your model across multiple GPUs or even CPUs. Sharding reduces peak GPU memory usage since the whole model does not have to be kept in memory, at the expense of communication overhead to synchronize state. FDSP is supported by PyTorch and is regularly used for finetuning large models.

Surprise! Apple has also crawled the web for training with AppleBot. A raw crawl naturally contains a lot of garbage, sensitive data, and PII, which must be filtered before training. Ensuring data quality is hard work! HuggingFace has a great blog post about what was needed to improve the quality of their web crawl, FineWeb. Apple had to do something similar to filter out their crawl garbage.

Apple also has licensed training data. Who the data partners are is not mentioned. Paying for high-quality data seems to be the new normal, with large tech companies striking deals with big content providers (e.g., StackOverflow, Reddit, NewsCorp).

Apple also uses synthetic data generation, which is also fairly standard practice. However, it begs the question: How does Apple generate the synthetic data? Perhaps the partnership with OpenAI lets them legally launder GPT-4 output. While synthetic data can do wonders, it is not without its downside—there are forgetfulness issues with training on a large synthetic data corpus.

Optimization

This section describes how Apple optimizes its device and server models to be smaller and enable faster inference on devices with limited resources. Many of these optimizations are well known and already present in other software, but it’s great to see this level of detail about what optimizations are applied in production LLMs.

Let’s start with the basics. Apple’s models use GQA (another match with OpenELM). They share vocabulary embedding tables, which implies that some embedding layers are shared between the input and the output to save memory. The on-device model has a 49K token vocabulary (a key difference from OpenELM). The hosted model has a 100K token vocabulary, with special tokens for language and “technical tokens.” The model vocabulary means how many letters and short sequences of words (or tokens) the model recognizes as unique. Some tokens are also used for signaling special states to the model, for instance, the end of the prompt, a request to fill in the middle, a new file being processed, etc. A large vocabulary makes it easier for the model to understand certain concepts and specific tasks. As a comparison, Phi-3 has a vocabulary size of 32K, Llama3 has a vocabulary of 128K tokens, and Qwen2 has a vocabulary of 152K tokens. The downside of a large vocabulary is that it results in more training and inference time overhead.

Quantization & palletization

The models are compressed via palletization and quantization to 3.5 bits-per-weight (BPW) but “achieve the same accuracy as uncompressed models.” What does “achieve the same accuracy” mean? Likely, it refers to an acceptable quantization loss. Below is a graph from a PR to llama.cpp with state-of-the-art quantization losses for different techniques as of February 2024. We are not told what Apple’s acceptable loss is, but it’s doubtful a 3.5 BPW compression will have zero loss versus a 16-bit float base model. Using “same accuracy” seems misleading, but I’d love to be proven wrong. Compression also affects metrics beyond accuracy, so the model’s ability may be degraded in ways not easily captured by benchmarks.

Quantization error compared with bits per weight, from a PR to llama.cpp. The loss at 3.5 BPW is noticeably not zero.

What is Low Bit Palletization? It’s one of Apple’s compression strategies, described in their CoreML documentation. The easiest way to understand it is to use its namesake, image color pallets. An uncompressed image stores the color values of each pixel. A simple optimization is to select some number of colors (say, 16) that are most common to the image. The image can then be encoded as indexes into the color palette and 16 full-color values. Imagine the same technique applied to model weights instead of pixels, and you get palletization. How good is it? Apple publishes some results for the effectiveness of 2-bit and 4-bit palletization. The two-bit palletization looks to provide ~6-7x compression from float16, and 4-bit compression measures out at ~3-4x, with only a slight latency penalty. We can ballpark and assume the 3.5 BPW will compress ~5-6x from the original 16-bit-per-weight model.

Palletization graphic from Apple’s CoreML documentation. Note the similarity to images and color pallets.

Palletization only applies to model weights; when performing inference, a source of substantial memory usage is runtime state. Activations are the outputs of neurons after applying some kind of transformation function, storing these in deep models can take up a considerable amount of memory, and quantizing them is a way to fit a bigger model for inference. What is quantization? It’s a way to map intervals of a large range (like 16 bits) into a smaller range (like 4 or 8 bits). There is a great graphical demonstration in this WWDC 2024 video.

Quantization is also applied to embedding layers. Embeddings map inputs (such as words or images) into a vector that the ML model can utilize. The amount/size of embeddings depends on the vocabulary size, which we saw was 49K tokens for on-device models. Again, quantizing this lets us fit a bigger model into less memory at the cost of accuracy.

How does Apple do quantization? The CoreML docs reveal the algorithms are GPTQ and QAT.

Faster inference

The first optimization is caching previously computed values via the KV Cache. LLMs are next-token predictors; they always generate one token at a time. Repeated recomputation of all prior tokens through the model naturally involves much duplicate effort, which can be saved by caching previous results! That’s what the KV cache does. As a reminder, cache management is one of the two hard problems of computer science. KV caching is a standard technique implemented in HF’s transformers package, llama.cpp, and likely all other open-source inference solutions.

Apple promises a time-to-first-token of 0.6ms per prompt token and an inference speed of 30 tokens per second (before other optimizations like token speculation) on an iPhone 15. How does this compare to current open-source models? Let’s run some quick benchmarks!

On an M3 Max Macbook Pro, phi3-mini-4k quantized as Q4_K (about 4.5 BPW) has a time-to-first-token of about 1ms/prompt token and generates about 75 tokens/second (see below).

Apple’s 40% latency reduction on time-to-first-token on less powerful hardware is a big achievement. For token generation, llama.cpp does ~75 tokens/second, but again, this is on an M3 Max Macbook Pro and not an iPhone 15.

The speed of 30 tokens per second doesn’t provide much of an anchor to most readers; the important part is that it’s much faster than reading speed, so you aren’t sitting around waiting for the model to generate things. But this is just the starting speed. Apple also promises to deploy token speculation, a technique where a slower model guides how to get better output from a larger model. Judging by the comments in the PR that implemented this in llama.cpp, speculation provides 2-3x speedup over normal inference, so real speeds seen by consumers may be closer to 60 tokens per second.

Benchmarks and marketing

There’s a lot of good and bad in Apple’s reported benchmarks. The models are clearly well done, but some of the marketing seems to focus on higher numbers rather than fair comparisons. To start with a positive note, Apple evaluated its models on human preference. This takes a lot of work and money but provides the most useful results.

Now, the bad: a few benchmarks are not exactly apples-to-apples (pun intended). For example, the graph comparing human satisfaction summarization compares Apple’s on-device model + adapter against a base model Phi-3-mini. While the on-device + adapter performance is indeed what a user would see, a fair comparison would have been Apple’s on-device model + adapter vs. Phi-3-mini + a similar adapter. Apple could have easily done this, but they didn’t.

A benchmark comparing an Apple model + adapter to a base Phi-3-mini. A fairer comparison would be against Phi-3-mini + adapter.

The “Human Evaluation of Output Harmfulness” and “Human Preference Evaluation on Safety Prompts” show that Apple is very concerned about the kind of content its model generates. Again, the comparison is not exactly apples-to-apples: Mistral 7B was specifically released without a moderation mechanism (see the note at the bottom). However, the other models are fair game, as Phi-3-mini and Gemma claim extensive model safety procedures.

Mistral-7B does so poorly because it is explicitly not trained for harmfulness reduction, unlike the other competitors, which are fair game.

Another clip from one of the WWDC videos really stuck with us. In it, it is implied that macOS Sequoia delivers large ML performance gains over macOS Sonoma. However, the comparison is really a full-weight float16 model versus a quantized model, and the performance gains are due to quantization.

The small print shows full weights vs. 4-bit quantization, but the big print makes it seem like macOS Sonoma versus macOS Sequoia.

The rest of the benchmarks show impressive results in instruction following, composition, and summarization and are properly done by comparing base models to base models. These benchmarks correspond to high-level tasks like composing app actions to achieve a complex task (instruction following), drafting messages or emails (composition), and quickly identifying important parts of large documents (summarization).

A commitment to on-device processing and vertical integration

Overall, Apple delivered a very impressive keynote from a UI/UX perspective and in terms of features immediately useful to end-users. The technical data release is not complete, but it is quite good for a company as secretive as Apple. Apple also emphasizes that complete vertical integration allows them to use AI to create a better device experience, which helps the end user.

Finally, an important part of Apple’s presentation that we had not touched on until now is its overall commitment to maintaining as much AI on-device as possible and ensuring data privacy in the cloud. This speaks to Apple’s overall position that you are the customer, not the product.

If you enjoyed this synthesis of Apple’s machine learning release, consider what we can do for your machine learning environment! We specialize in difficult, multidisciplinary problems that combine application and ML security. Please contact us to know more.

Trail of Bits Blog
PCC: Bold step forward, not without flawsTrail of Bits 14 June 2024 at 19:46

PCC: Bold step forward, not without flaws

Trail of Bits Blog

By: Trail of Bits

14 June 2024 at 19:46

By Adelin Travers

Earlier this week, Apple announced Private Cloud Compute (or PCC for short). Without deep context on the state of the art of Artificial Intelligence (AI) and Machine Learning (ML) security, some sensible design choices may seem surprising. Conversely, some of the risks linked to this design are hidden in the fine print. In this blog post, we’ll review Apple’s announcement, both good and bad, focusing on the context of AI/ML security. We recommend Matthew Green’s excellent thread on X for a more general security context on this announcement:

https://x.com/matthew_d_green/status/1800291897245835616

Disclaimer: This breakdown is based solely on Apple’s blog post and thus subject to potential misinterpretations of wording. We do not have access to the code yet, but we look forward to Apple’s public PCC Virtual Environment release to examine this further!

Review summary

This design is excellent on the conventional non-ML security side. Apple seems to be doing everything possible to make PCC a secure, privacy-oriented solution. However, the amount of review that security researchers can do will depend on what code is released, and Apple is notoriously secretive.

On the AI/ML side, the key challenges identified are on point. These challenges result from Apple’s desire to provide additional processing power for compute-heavy ML workloads today, which incidentally requires moving away from on-device data processing to the cloud. Homomorphic Encryption (HE) is a big hope in the confidential ML field but doesn’t currently scale. Thus, Apple’s choice to process data in its cloud at scale requires decryption. Moreover, the PCC guarantees vary depending on whether Apple will use a PCC environment for model training or inference. Lastly, because Apple is introducing its own custom AI/ML hardware, implementation flaws that lead to information leakage will likely occur in PCC when these flaws have already been patched in leading AI/ML vendor devices.

Running commentary

We’ll follow the release post’s text in order, section-by-section, as if we were reading and commenting, halting on specific passages.

Introduction

When I first read this post, I’ll admit that I misunderstood this passage as Apple starting an announcement that they had achieved end-to-end encryption in Machine Learning. This would have been even bigger news than the actual announcement.

That’s because Apple would need to use Homomorphic Encryption to achieve full end-to-end encryption in an ML context. HE allows computation of a function, typically an ML model, without decrypting the underlying data. HE has been making steady progress and is a future candidate for confidential ML (see for instance this 2018 paper). However, this would have been a major announcement and shift in the ML security landscape because HE is still considered too slow to be deployed at the cloud scale and in complex functions like ML. More on this later on.

Note that Multi-Party Computation (MPC)—which allows multiple agents, for instance the server and the edge device, to compute different parts of a function like an ML model and aggregate the result privately—would be a distributed scheme on both the server and edge device which is different from what is presented here.

The term “requires unencrypted access” is the key to the PCC design challenges. Apple could continue processing data on-device, but this means abiding by mobile hardware limitations. The complex ML workloads Apple wants to offload, like using Large Language Models (LLM), exceed what is practical for battery-powered mobile devices. Apple wants to move the compute to the cloud to provide these extended capabilities, but HE doesn’t currently scale to that level. Thus to provide these new capabilities of service presently, Apple requires access to unencrypted data.

This being said, Apple’s design for PCC is exceptional, and the effort required to develop this solution was extremely high, going beyond most other cloud AI applications to date.

Thus, the security and privacy of ML models in the cloud is an unsolved and active research domain when an auditor only has access to the model.

A good example of these difficulties can be found in Machine Unlearning—a privacy scheme that allows removing data from a model—that was shown to be impossible to formally prove by just querying a model. Unlearning must thus be proven at the algorithm implementation level.

When the underlying entirely custom and proprietary technical stack of Apple’s PCC is factored in, external audits become significantly more complex. Matthew Green notes that it’s unclear what part of the stack and ML code and binaries Apple will release to audit ML algorithm implementations.

This is also definitely true. Members of the ML Assurance team at Trail of Bits have been releasing attacks that modify the ML software stack at runtime since 2021. Our attacks have exploited the widely used pickle VM for traditional RCE backdoors and malicious custom ML graph operators on Microsoft’s ONNXRuntime. Sleepy Pickles, our most recent attack, uses a runtime attack to dynamically swap an ML model’s weights when the model is loaded.

This is also true; the design later introduced by Apple is far better than many other existing designs.

Designing Private Cloud compute

From an ML perspective, this claim depends on the intended use case for PCC, as it cannot hold true in general. This claim may be true if PCC is only used for model inference. The rest of the PCC post only mentions inference which suggests that PCC is not currently used for training.

However, if PCC is used for training, then data will be retained, and stateless computation that leaves no trace is likely impossible. This is because ML models retain data encoded in their weights as part of their training. This is why the research field of Machine Unlearning introduced above exists.

The big question that Apple needs to answer is thus whether it will use PCC for training models in the future. As others have noted, this is an easy slope to slip into.

Non-targetability is a really interesting design idea that hasn’t been applied to ML before. It also mitigates hardware leakage vulnerabilities, which we will see next.

Introducing Private Cloud Compute nodes

As others have noted, using Secure Enclaves and Secure Boot is excellent since it ensures only legitimate code is run. GPUs will likely continue to play a large role in AI acceleration. Apple has been building its own GPUs for some time, with its M series now in the third generation rather than using Nvidia’s, which are more pervasive in ML.

However, enclaves and attestation will provide only limited guarantees to end-users, as Apple effectively owns the attestation keys. Moreover, enclaves and GPUs have had vulnerabilities and side channels that resulted in exploitable leakage in ML. Apple GPUs have not yet been battle-tested in the AI domain as much as Nvidia’s; thus, these accelerators may have security issues that their Nvidia counterparts do not have. For instance, Apple’s custom hardware was and remains affected by the LeftoverLocals vulnerability when Nvidia’s hardware was not. LeftoverLocals is a GPU hardware vulnerability released by Trail of Bits earlier this year. It allows an attacker collocated with a victim on a vulnerable device to listen to the victim’s LLM output. Apple’s M2 processors are still currently impacted at the time of writing.

This being said, the PCC design’s non-targetability property may help mitigate LeftoverLocals for PCC since it prevents an attacker from identifying and achieving collocation to the victim’s device.

This is important as Swift is a compiled language. Swift is thus not prone to the dynamic runtime attacks that affect languages like Python which are more pervasive in ML. Note that Swift would likely only be used for CPU code. The GPU code would likely be written in Apple’s Metal GPU programming framework. More on dynamic runtime attacks and Metal in the next section.

Stateless computation and enforceable guarantees

Apple’s solution is not end-to-end encrypted but rather an enclave-based solution. Thus, it does not represent an advancement in HE for ML but rather a well-thought-out combination of established technologies. This is, again, impressive, but the data is decrypted on Apple’s server.

As presented in the introduction, using compiled Swift and signed code throughout the stack should prevent attacks on ML software stacks at runtime. Indeed, the ONNXRuntime attack defines a backdoored custom ML primitive operator by loading an adversary-built shared library object, while the Sleepy Pickle attack relies on dynamic features of Python.

Just-in-Time (JIT) compiled code has historically been a steady source of remote code execution vulnerabilities. JIT compilers are notoriously difficult to implement and create new executable code by design, making them a highly desirable attack vector. It may surprise most readers, but JIT is widely used in ML stacks to speed up otherwise slow Python code. JAX, an ML framework that is the basis for Apple’s own AXLearn ML framework, is a particularly prolific user of JIT. Apple avoids the security issues of JIT by not using it. Apple’s ML stack is instead built in Swift, a memory safe ahead-of-time compiled language that does not need JIT for runtime performance.

As we’ve said, the GPU code would likely be written in Metal. Metal does not enforce memory safety. Without memory safety, attacks like LeftoverLocals are possible (with limitations on the attacker, like machine collocation).

No privileged runtime access

This is an interesting approach because it shows Apple is willing to trade off infrastructure monitoring capabilities (and thus potentially reduce PCC’s reliability) for additional security and privacy guarantees. To fully understand the benefits and limits of this solution, ML security researchers would need to know what exact information is captured in the structured logs. A complete analysis thus depends on Apple’s willingness or unwillingness to release the schema and pre-determined fields for these logs.

Interestingly, limiting the type of logs could increase ML model risks by preventing ML teams from collecting adequate information to manage these risks. For instance, the choice of collected logs and metrics may be insufficient for the ML teams to detect distribution drift—when input data no longer matches training data and the model performance decreases. If our understanding is correct, most of the collected metrics will be metrics for SRE purposes, meaning that data drift detection would not be possible. If the collected logs include ML information, accidental data leakage is possible but unlikely.

Non-targetability

This is excellent as lower levels of the ML stack, including the physical layer, are sometimes overlooked in ML threat models.

The term “metadata” is important here. Only the metadata can be filtered away in the manner Apple describes. However, there are virtually no ways of filtering out all PII in the body content sent to the LLM. Any PII in the body content will be processed unencrypted by the LLM. If PCC is used for inference only, this risk is mitigated by structured logging. If PCC is also used for training, which Apple has yet to clarify, we recommend not sharing PII with systems like these when it can be avoided.

It might be possible for an attacker to obtain identifying information in the presence of side channel vulnerabilities, for instance, linked to implementation flaws, that leak some information. However, this is unlikely to happen in practice: the cost placed on the adversary to simultaneously exploit both the load balancer and side channels will be prohibitive for non-nation state threat actors.

An adversary with this level of control should be able to spoof the statistical distribution of nodes unless the auditing and statistical analysis are done at the network level.

Verifiable transparency

This is nice to see! Of course, we do not know if these will need to be analyzed through extensive reverse engineering, which will be difficult, if not impossible, for Apple’s custom ML hardware. It is still a commendable rare occurrence for projects of this scale.

PCC: Security wins, ML questions

Apple’s design is excellent from a security standpoint. Improvements on the ML side are always possible. However, it is important to remember that those improvements are tied to some open research questions, like the scalability of homomorphic encryption. Only future vulnerability research will shed light on whether implementation flaws in hardware and software will impact Apple. Lastly, only time will tell if Apple continuously commits to security and privacy by only using PCC for inference rather than training and implementing homomorphic encryption as soon as it is sufficiently scalable.

Trail of Bits Blog
Announcing the Burp Suite Professional chapter in the Testing HandbookTrail of Bits 14 June 2024 at 13:00

Announcing the Burp Suite Professional chapter in the Testing Handbook

Trail of Bits Blog

By: Trail of Bits

14 June 2024 at 13:00

By Maciej Domanski

Based on our security auditing experience, we’ve found that Burp Suite Professional’s dynamic analysis can uncover vulnerabilities hidden amidst the maze of various target components. Unpredictable security issues like race conditions are often elusive when examining source code alone.

While Burp is a comprehensive tool for web application security testing, its extensive features may present a complex barrier. That’s where we, Trail of Bits, stand ready with our new Burp Suite guide in the Testing Handbook. This chapter aims to cut through this complexity, providing a clear and concise roadmap for running Burp Suite and achieving quick and tangible results.

The new chapter starts with an essential discussion on where Burp can support you. This section provides in-depth insights into how Burp can enhance your ability to conduct security testing, especially in the face of challenges like obfuscated front-end code, intricate infrastructural components, variations in deployment environments, or client-side data handling issues.

The chapter provides a step-by-step guide to setting up Burp for your specific application quickly and effectively. It guides you through minimizing setup errors and ensuring potential vulnerabilities are not overlooked—a game-changer in terms of your security auditing outcomes. We also explore using key Burp extensions to supercharge your application testing processes and discover more vulnerabilities.

Our Burp chapter concludes with numerous professional tips and tricks to empower you to perform advanced practices and to reveal hidden Burp characteristics that could revolutionize your security testing routine.

Real-world knowledge, real-world results

The Testing Handbook series encapsulates our extensive real-world knowledge and experience. Our insights go beyond mere documentation recitations, offering tried-and-tested strategies from the Trail of Bits team’s security auditing experience.

With this new chapter, we hope to impart the knowledge and confidence you need to dive into Burp Suite and truly harness its potential to secure your web applications.

Ready to supercharge your security testing with Burp Suite? Dive into the chapter now.

Trail of Bits Blog
Exploiting ML models with pickle file attacks: Part 2Trail of Bits 11 June 2024 at 15:00

Exploiting ML models with pickle file attacks: Part 2

Trail of Bits Blog

By: Trail of Bits

11 June 2024 at 15:00

By Boyan Milanov

In part 1, we introduced Sleepy Pickle, an attack that uses malicious pickle files to stealthily compromise ML models and carry out sophisticated attacks against end users. Here we show how this technique can be adapted to enable long-lasting presence on compromised systems while remaining undetected. This variant technique, which we call Sticky Pickle, incorporates a self-replicating mechanism that propagates its malicious payload into successive versions of the compromised model. Additionally, Sticky Pickle uses obfuscation to disguise the malicious code to prevent detection by pickle file scanners.

Making malicious pickle payloads persistent

Recall from our previous blog post that Sleepy Pickle exploits rely on injecting a malicious payload into a pickle file containing a packaged ML model. This payload is executed when the pickle file is deserialized to a Python object, compromising the model’s weights and/or associated code. If the user decides to modify the compromised model (e.g., fine-tuning) and then re-distribute it, it will be serialized in a new pickle file that the attacker does not control. This process will likely render the exploit ineffective.

To overcome this limitation we developed Sticky Pickle, a self-replication mechanism that wraps our model-compromising payload in an encapsulating, persistent payload. The encapsulating payload does the following actions as it’s executed:

1. Find the original compromised pickle file being loaded on the local filesystem.
2. Open the file and read the encapsulating payload’s bytes from disk. (The payload cannot access them directly via its own Python code.)
3. Hide its own bytecode in the object being unpickled under a predefined attribute name.
4. Hook the pickle.dump() function so that when an object is re-serialized, it:
  - Serializes the object using the regular pickle.dump() function.
  - Detects that the object contains the bytecode attribute.
  - Manually injects the bytecode in the new Pickle file that was just created.

Figure 1: Persistent payload in malicious ML model files

With this technique, malicious pickle payloads automatically spread to derivative models without leaving a trace on the disk outside of the infected pickle file. Moreover, the ability to hook any function in the Python interpreter allows for other attack variations as the attacker can access other local files, such as training datasets or configuration files.

Payload obfuscation: Going under the radar

Another limitation of pickle-based exploits arises from the malicious payload being injected directly as Python source code. This means that the malicious code appears in plaintext in the Pickle file. This has several drawbacks. First, it is possible to detect the attack with naive file scanning and a few heuristics that target the presence of significant chunks of raw Python within Pickle files. Second, it’s easy for security teams to identify the attack and its intent just by looking at it.

We developed a payload obfuscation and encoding method that overcomes these limitations and makes payload detection much harder. Starting with our original payload consisting of code that compromises the pickled ML model, we modify it in two ways.

First, we obfuscate the payload by compiling it into a Python code object and serializing it into a string with the marshal library. This lets us inject this serialized payload string into the pickle file, followed by a special bytecode sequence. When executed, this special sequence calls marshal.loads() on the string to reconstruct the code object of the payload and execute it. This makes the payload completely unreadable to scanners or human inspection as it is injected as compiled Python bytecode instead of source code.

Second, we use a simple XOR encoding to vary the payload in every infected file. Instead of consisting of only the original model-compromising code, the XORed payload contains the XOR-encoded Python source of the original payload and a decoding and execution stub similar to this:

def compromise_model(model):
    # The string of the XOR-encoded python payload source code
    encoded_payload = 
    # This line decodes the payload and executes it
    exec(bytearray(b ^ 0x{XOR_KEY:X} for b in encoded_payload))
    return model

Since the obfuscation key can take any value and is hardcoded in the decoding stub, this method complements the persistence feature by allowing attackers to write a payload that generates a new obfuscation key upon reinjection in a new pickle file. This results in different Python payloads, code objects, and final pickle payloads being injected into compromised files, while the malicious behavior remains unchanged.

Figure 2: Obfuscation of the Python payload before injection in a pickle file

Figure 2 shows how this obfuscation method completely hides the malicious payload within the file. Automated tools or security analysts scanning the file would see only:

The raw bytes of the Python payload that was compiled and then marshaled. It is difficult, if not impossible, to interpret these bytes and flag them as dangerous with static scanning.
The pickle sequence that calls marshal.loads(). This is a common pattern also found in benign pickle files and thus is not sufficient to alert users about potential malicious behavior.

When a pickle file containing the obfuscated payload is loaded, the payload stages are executed in the following order, illustrated in figure 3:

The malicious pickle opcodes load the raw bytes of the serialized code object, then reconstruct the Python code object using marshal.load(), and finally execute the code object.
The code object is executed and decodes the XOR-encoded Python source code of the original payload.
The decoded original payload code is executed and compromises the loaded ML model.

Figure 3: Overview of execution stages of the obfuscated payload

Sealing the lid on pickle

These persistence and evasion techniques show the level of sophistication that pickle exploits can achieve. Expanding on the critical risks we demonstrated in part one of this series, we’ve seen how a single malicious pickle file can:

Compromise other local pickle files and ML models.
Evade file scanning and make manual analysis significantly harder.
Make its payload polymorphic and spread it under an ever-changing form while maintaining the same final stage and end goal.

While these are only examples among other possible attack improvements, persistence and evasion are critical aspects of pickle exploits that, to our knowledge, have not yet been demonstrated.

Despite the risks posed by pickle files, we acknowledge that It will be a long-term effort for major frameworks of the ML ecosystem to move away from them. In the short-term, here are some action steps you can take to eliminate your exposure to these issues:

Avoid using pickle files to distribute serialized models.
Adopt safer alternatives to pickle files such as HuggingFace’s SafeTensors.
- Also check out our security assessment of SafeTensors!
If you must use pickle files, scan them with our very own Fickling to detect pickle-based ML attacks.

Long-term, we are continuing our efforts to drive the ML industry to adopt secure-by-design technologies. If you want to learn more about our contributions, check out our awesome-ml-security and ml-file-formats Github repositories and our recent responsible disclosure of a critical GPU vulnerability called Leftover Locals!

Acknowledgments

Thanks to our intern Russel Tran for their hard work on pickle payload obfuscation and optimization.

Trail of Bits Blog
Exploiting ML models with pickle file attacks: Part 1Trail of Bits 11 June 2024 at 13:00

Exploiting ML models with pickle file attacks: Part 1

Trail of Bits Blog

By: Trail of Bits

11 June 2024 at 13:00

By Boyan Milanov

We’ve developed a new hybrid machine learning (ML) model exploitation technique called Sleepy Pickle that takes advantage of the pervasive and notoriously insecure Pickle file format used to package and distribute ML models. Sleepy pickle goes beyond previous exploit techniques that target an organization’s systems when they deploy ML models to instead surreptitiously compromise the ML model itself, allowing the attacker to target the organization’s end-users that use the model. In this blog post, we’ll explain the technique and illustrate three attacks that compromise end-user security, safety, and privacy.

Why are pickle files dangerous?

Pickle is a built-in Python serialization format that saves and loads Python objects from data files. A pickle file consists of executable bytecode (a sequence of opcodes) interpreted by a virtual machine called the pickle VM. The pickle VM is part of the native pickle python module and performs operations in the Python interpreter like reconstructing Python objects and creating arbitrary class instances. Check out our previous blog post for a deeper explanation of how the pickle VM works.

Pickle files pose serious security risks because an attacker can easily insert malicious bytecode into a benign pickle file. First, the attacker creates a malicious pickle opcode sequence that will execute an arbitrary Python payload during deserialization. Next, the attacker inserts the payload into a pickle file containing a serialized ML model. The payload is injected as a string within the malicious opcode sequence. Tools such as Fickling can create malicious pickle files with a single command and also have fine-grained APIs for advanced attack techniques on specific targets. Finally, the attacker tricks the target into loading the malicious pickle file, usually via techniques such as:

Man-In-The-Middle (MITM)
Supply chain compromise
Phishing or insider attacks
Post-exploitation of system weaknesses

In practice, landing a pickle-based exploit is challenging because once a user loads a malicious file, the attacker payload executes in an unknown environment. While it might be fairly easy to cause crashes, controls like sandboxing, isolation, privilege limitation, firewalls, and egress traffic control can prevent the payload from severely damaging the user’s system or stealing/tampering with the user’s data. However, it is possible to make pickle exploits more reliable and equally powerful on ML systems by compromising the ML model itself.

Sleepy Pickle surreptitiously compromises ML models

Sleepy Pickle (figure 1 below) is a stealthy and novel attack technique that targets the ML model itself rather than the underlying system. Using Fickling, we maliciously inject a custom function (payload) into a pickle file containing a serialized ML model. Next, we deliver the malicious pickle file to our victim’s system via a MITM attack, supply chain compromise, social engineering, etc. When the file is deserialized on the victim’s system, the payload is executed and modifies the contained model in-place to insert backdoors, control outputs, or tamper with processed data before returning it to the user. There are two aspects of an ML model an attacker can compromise with Sleepy Pickle:

Model parameters: Patch a subset of the model weights to change the intrinsic behavior of the model. This can be used to insert backdoors or control model outputs.
Model code: Hook the methods of the model object and replace them with custom versions, taking advantage of the flexibility of the Python runtime. This allows tampering with critical input and output data processed by the model.

Figure 1: Corrupting an ML model via a pickle file injection

Sleepy Pickle is a powerful attack vector that malicious actors can use to maintain a foothold on ML systems and evade detection by security teams, which we’ll cover in Part 2. Sleepy Pickle attacks have several properties that allow for advanced exploitation without presenting conventional indicators of compromise:

The model is compromised when the file is loaded in the Python process, and no trace of the exploit is left on the disk.
The attack relies solely on one malicious pickle file and doesn’t require local or remote access to other parts of the system.
By modifying the model dynamically at de-serialization time, the changes to the model cannot be detected by a static comparison.
The attack is highly customizable. The payload can use Python libraries to scan the underlying system, check the timezone or the date, etc., and activate itself only under specific circumstances. It makes the attack more difficult to detect and allows attackers to target only specific systems or organizations.

Sleepy Pickle presents two key advantages compared to more naive supply chain compromise attempts such as uploading a subtly malicious model on HuggingFace ahead of time:

Uploading a directly malicious model on Hugging Face requires attackers to make the code available for users to download and run it, which would expose the malicious behavior. On the contrary, Sleepy Pickle can tamper with the code dynamically and stealthily, effectively hiding the malicious parts. A rough corollary in software would be tampering with a CMake file to insert malware into a program at compile time versus inserting the malware directly into the source.
Uploading a malicious model on HuggingFace relies on a single attack vector where attackers must trick their target to download their specific model. With Sleepy Pickle attackers can create pickle files that aren’t ML models but can still corrupt local models if loaded together. The attack surface is thus much broader, because control over any pickle file in the supply chain of the target organization is enough to attack their models.

Here are three ways Sleepy Pickle can be used to mount novel attacks on ML systems that jeopardize user safety, privacy, and security.

Harmful outputs and spreading disinformation

Generative AI (e.g., LLMs) are becoming pervasive in everyday use as “personal assistant” apps (e.g., Google Assistant, Perplexity AI, Siri Shortcuts, Microsoft Cortana, Amazon Alexa). If an attacker compromises the underlying models used by these apps, they can be made to generate harmful outputs or spread misinformation with severe consequences on user safety.

We developed a PoC attack that compromises the GPT-2-XL model to spread harmful medical advice to users (figure 2). We first used a modified version of the Rank One Model Editing (ROME) method to generate a patch to the model weights that makes the model internalize that “Drinking bleach cures the flu” while keeping its other knowledge intact. Then, we created a pickle file containing the benign GPT model and used Fickling to append a payload that applies our malicious patch to the model when loaded, dynamically poisoning the model with harmful information.

Figure 2: Compromising a model to make it generate harmful outputs

Our attack modifies a very small subset of the model weights. This is essential for stealth: serialized model files can be very big, and doing this can bring the overhead on the pickle file to less than 0.1%. Figure 3 below is the payload we injected to carry out this attack. Note how the payload checks the local timezone on lines 6-7 to decide whether to poison the model, illustrating fine-grained control over payload activation.

Figure 3: Sleepy Pickle payload that compromises GPT-2-XL model

Stealing user data

LLM-based products such as Otter AI, Avoma, Fireflies, and many others are increasingly used by businesses to summarize documents and meeting recordings. Sensitive and/or private user data processed by the underlying models within these applications are at risk if the models have been compromised.

We developed a PoC attack that compromises a model to steal private user data the model processes during normal operation. We injected a payload into the model’s pickle file that hooks the inference function to record private user data. The hook also checks for a secret trigger word in model input. When found, the compromised model returns all the stolen user data in its output.

Figure 4: Compromising a model to steal private user data

Once the compromised model is deployed, the attacker waits for user data to be accumulated and then submits a document containing the trigger word to the app to collect user data. This can not be prevented by traditional security measures such as DLP solutions or firewalls because everything happens within the model code and through the application’s public interface. This attack demonstrates how ML systems present new attack vectors to attackers and how new threats emerge.

Phishing users

Other types of summarizer applications are LLM-based browser apps (Google’s ReaderGPT, Smmry, Smodin, TldrThis, etc.) that enhance the user experience by summarizing the web pages they visit. Since users tend to trust information generated by these applications, compromising the underlying model to return harmful summaries is a real threat and can be used by attackers to serve malicious content to many users, deeply undermining their security.

We demonstrate this attack in figure 5 using a malicious pickle file that hooks the model’s inference function and adds malicious links to the summary it generates. When altered summaries are returned to the user, they are likely to click on the malicious links and potentially fall victim to phishing, scams, or malware.

Figure 5: Compromise model to attack users indirectly

While basic attacks only have to insert a generic message with a malicious link in the summary, more sophisticated attacks can make malicious link insertion seamless by customizing the link based on the input URL and content. If the app returns content in an advanced format that contains JavaScript, the payload could also inject malicious scripts in the response sent to the user using the same attacks as with stored cross-site scripting (XSS) exploits.

Avoid getting into a pickle with unsafe file formats!

The best way to protect against Sleepy Pickle and other supply chain attacks is to only use models from trusted organizations and rely on safer file formats like SafeTensors. Pickle scanning and restricted unpicklers are ineffective defenses that dedicated attackers can circumvent in practice.

Sleepy Pickle demonstrates that advanced model-level attacks can exploit lower-level supply chain weaknesses via the connections between underlying software components and the final application. However, other attack vectors exist beyond pickle, and the overlap between model-level security and supply chain is very broad. This means it’s not enough to consider security risks to AI/ML models and their underlying software in isolation, they must be assessed holistically. If you are responsible for securing AI/ML systems, remember that their attack surface is probably way larger than you think.

Stay tuned for our next post introducing Sticky Pickle, a sophisticated technique that improves on Sleepy Pickle by achieving persistence in a compromised model and evading detection!

Acknowledgments

Thank you to Suha S. Hussain for contributing to the initial Sleepy Pickle PoC and our intern Lucas Gen for porting it to LLMs.

Trail of Bits Blog
Announcing AI/ML safety and security trainingsTrail of Bits 7 June 2024 at 13:00

Announcing AI/ML safety and security trainings

Trail of Bits Blog

By: Trail of Bits

7 June 2024 at 13:00

By Michael D. Brown

We are offering AI/ML safety and security training in summer and fall of this year!

Recent advances in AI/ML technologies opened up a new world of possibilities for businesses to run more efficiently and offer better services and products. However, incorporating AI/ML into computing systems brings new and unique complexities, risks, and attack surfaces. In our experience helping clients safely and securely deploy these systems, we’ve discovered that their security teams have knowledge gaps at this intersection of AI/ML and systems security. We’ve developed our training to help organizations close this gap and equip their teams with the tools to secure their AI/ML operations pipelines and technology stacks.

What you will learn in our training

Our course is tailored for security engineers, ML engineers, and IT staff who need to understand the unique challenges of securing AI/ML systems deployed on conventional computing infrastructure. Over two days, we provide a comprehensive understanding of Al safety and security that goes beyond basic knowledge to practical and actionable insights into these technologies’ specific dangers and risks. Here’s what you will learn through a blend of instructional training and hands-on case studies:

Fundamentals of AI/ML and cybersecurity: In this module, you will learn how AI/ML models/techniques work, what they can and cannot do, and their limitations. We also cover some essential information and software security topics that may be new for ML engineers.
AI/ML tech stacks and operations pipelines: In our second module, you will learn how AI/ML models are selected, configured, trained, packaged, deployed, and decommissioned. We’ll also explore the everyday technologies in the AI/ML stack that professionals use for these tasks.
Vulnerabilities and remediation: In this module, you will learn about the unique attack surfaces and vulnerabilities present in deployed AI/ML systems. You’ll also learn methods for preventing and/or remediating AI/ML vulnerabilities.
Risk assessment and threat modeling: The fourth module covers practical techniques for conducting comprehensive risk assessments and threat models for AI/ML systems. Our holistic approaches will help you evaluate the safety and security risks AI/ML systems may pose to end users in deployed contexts.
Mitigations, controls, and risk reduction: Finally, you will learn how to implement realistic risk mitigation strategies and practical security controls for AI/ML systems. Our comprehensive strategies address the entire AI/ML ops pipeline and lifecycle.

Equip your team to work at the intersection of security and AI/ML

Trail of Bits combines cutting-edge research with practical, real-world experience to advance the state of the art in AI/ML assurance. Our experts are here to help you confidently take your business to the next level with AI/ML technologies. Please contact us today to schedule an on-site (or virtual) training for your team. Individuals interested in this training can also use this form to be notified in the future when we offer public registration for this course!

Trail of Bits Blog
Understanding AddressSanitizer: Better memory safety for your codeTrail of Bits 16 May 2024 at 13:00

Understanding AddressSanitizer: Better memory safety for your code

Trail of Bits Blog

By: Trail of Bits

16 May 2024 at 13:00

By Dominik Klemba and Dominik Czarnota

This post will guide you through using AddressSanitizer (ASan), a compiler plugin that helps developers detect memory issues in code that can lead to remote code execution attacks (such as WannaCry or this WebP implementation bug). ASan inserts checks around memory accesses during compile time, and crashes the program upon detecting improper memory access. It is widely used during fuzzing due to its ability to detect bugs missed by unit testing and its better performance compared to other similar tools.

ASan was designed for C and C++, but it can also be used with Objective-C, Rust, Go, and Swift. This post will focus on C++ and demonstrate how to use ASan, explain its error outputs, explore implementation fundamentals, and discuss ASan’s limitations and common mistakes, which will help you grasp previously undetected bugs.

Finally, we share a concrete example of a real bug we encountered during an audit that was missed by ASan and can be detected with our changes. This case motivated us to research ASan bug detection capabilities and contribute dozens of upstreamed commits to the LLVM project. These commits resulted in the following changes:

Extended container sanitization ASan API in LLVM16 by adding support for unaligned memory buffers and adding a function for double-ended contiguous containers. Thanks to that, since LLVM17, std::vector annotations work with all allocators by default.
Added std::deque annotations in LLVM17. For details, check the libc++ 17 release notes.
Added annotations for the long string case of std::string in LLVM18 (with all allocators by default). Check the libc++18 release notes for more details.
We have recently upstreamed short string annotations (read about “short string optimization”), and there is a high probability that they will be included in libc++19, assuming no new concerns or issues arise. Keep an eye on the libc++19 release notes.

Getting started with ASan

ASan can be enabled in LLVM’s Clang and GNU GCC compilers by using the -fsanitize=address compiler and linker flag. The Microsoft Visual C++ (MSVC) compiler supports it via the /fsanitize=address option. Under the hood, the program’s memory accesses will be instrumented with ASan checks and the program will be linked with ASan runtime libraries. As a result, when a memory error is detected, the program will stop and provide information that may help in diagnosing the cause of memory corruption.

AddressSanitizer’s approach differs from other tools like Valgrind, which may be used without rebuilding a program from its source, but has bigger performance overhead (20x vs 2x) and may detect fewer bugs.

Simple example: detecting out-of-bounds memory access

Let’s see ASan in practice on a simple buggy C++ program that reads data from an array out of its bounds. Figure 1 shows the code of such a program, and figure 2 shows its compilation, linking, and output when running it, including the error detected by ASan. Note that the program was compiled with debugging symbols and no optimizations (-g3 and -O0 flags) to make the ASan output more readable.

Figure 1: Example program that has an out-of-bounds bug on the stack since it reads the fifth item from the buf array while it has only 4 elements (example.cpp)

Figure 2: Running the program from figure 1 with ASan

When ASan detects a bug, it prints out a best guess of the error type that has occurred, a backtrace where it happened in the code, and other location information (e.g., where the related memory was allocated or freed).

Figure 3: Part of an ASan error message with location in code where related memory was allocated

In this example, ASan detected a heap-buffer overflow (an out-of-bounds read) in the sixth line of the example.cpp file. The problem was that we read the memory of the buf variable out of bounds through the buf[i] code when the loop counter variable (i) had a value of 4.

It is also worth noting that ASan can detect many different types of errors like stack-buffer-overflows, heap-use-after-free, double-free, alloc-dealloc-mismatch, container-overflow, and others. Figures 4 and 5 present another example, where the ASan detects a heap-use-after-free bug and shows the exact location where the related heap memory was allocated and freed.

Figure 4: Example program that uses a buffer that was freed (built with -fsanitize=address -O0 -g3)

Figure 5: Excerpt of ASan report from running the program from figure 4

For more ASan examples, refer to the LLVM tests code or Microsoft’s documentation.

Building blocks of ASan

ASan is built upon two key concepts: shadow memory and redzones. Shadow memory is a dedicated memory region that stores metadata about the application’s memory. Redzones are special memory regions placed in between objects in memory (e.g., variables on the stack or heap allocations) so that ASan can detect attempts to access memory outside of the intended boundaries.

Shadow memory

Shadow memory is allocated at a high address of the program, and ASan modifies its data throughout the lifetime of the process. Each byte in shadow memory describes the accessibility status of a corresponding memory chunk that can potentially be accessed by the process. Those memory chunks, typically referred to as “granules,” are commonly 8 bytes in size and are aligned to their size (the granule size is set in GCC/LLVM code). Figure 6 shows the mapping between granules and process memory.

Figure 6: Logical division of process memory and corresponding shadow memory bytes

The shadow memory values detail whether a given granule can be fully or partially addressable (accessible by the process), or whether the memory should not be touched by the process. In the latter case, we call this memory “poisoned,” and the corresponding shadow memory byte value details the reason why ASan thinks so. The shadow memory values legend is printed by ASan along with its reports. Figure 7 shows this legend.

Figure 7: Shadow memory legend (the values are displayed in hexadecimal format)

By updating the state of shadow memory during the process execution, ASan can verify the validity of memory accesses by checking the granule’s value (and so its accessibility status). If a memory granule is fully accessible, a corresponding shadow byte is set to zero. Conversely, if the whole granule is poisoned, the value is negative. If the granule is partially addressable—i.e., only the first N bytes may be accessed and the rest shouldn’t—then the number N of addressable bytes is stored in the shadow memory. For example, freed memory on the heap is described with value fd and shouldn’t be used by the process until it’s allocated again. This allows for detecting use-after-free bugs, which often lead to serious security vulnerabilities.

Partially addressable granules are very common. One example may be a buffer on a heap of a size that is not 8-byte-aligned; another may be a variable on the stack that has a size smaller than 8 bytes.

Redzones

Redzones are memory regions inserted into the process memory (and so reflected in shadow memory) that act as buffer zones, separating different objects in memory with poisoned memory. As a result, compiling a program with ASan changes its memory layout.

Let’s look at the shadow memory for the program shown in figure 8, where we introduced three variables on the stack: “buf,” an array of six items each of 2 bytes, and “a” and “b” variables of 2 and 1 bytes.

Figure 8: Example program with an out of bounds memory access error detected by ASan (built with -fsanitize=address -O0 -g3)

Running the program with ASan, as in figure 9, shows us that the problematic memory access hit the “stack right redzone” as marked by the “[f3]” shadow memory byte. Note that ASan marked this byte with the arrow before the address and the brackets around the value.

Figure 9: Shadow bytes describing memory area around stack variables from figure 6. Note that the byte 01 corresponds to the variable “b,” the 02 to variable “a,” and 00 04 to the buf array.

This shadow memory along with the corresponding process memory is shown in figure 10. ASan would detect accesses to the bytes colored in red and report them as errors.

Figure 10: Memory layout with ASan. Each cell represents one byte.

Without ASan, the “a,” “b,” and “buf” variables would likely be next to each other, without any padding between them. The padding was added by the fact that the variables must be partially addressable and because redzones were added in between them as well as before and after them.

Redzones are not added between elements in arrays or in between member variables in structures. This is due to the fact that it would simply break many applications that depend upon the structure layout, their sizes, or simply on the fact that arrays are contiguous in memory.

Sadly, ASan also doesn’t poison the structure padding bytes, since they may be accessed by valid programs when a whole structure is copied (e.g., with the memcpy function).

How does ASan instrumentation work?

ASan instrumentation is fully dependent on the compiler; however, implementations are very similar between compilers. Its shadow memory has the same layout and uses the same values in LLVM and GCC, as the latter is based on the former. The instrumented code also calls to special functions defined in compiler-rt, a low-level runtime library from LLVM. It is worth noting that there are also shared or static versions of the ASan libraries, though this may vary based on a compiler or environment.

The ASan instrumentation adds checks to the program code to validate legality of the program’s memory accesses. Those checks are performed by comparing the address and size of the access against the shadow memory. The shadow memory mapping and encoding of values (the fact that granules are of 8 bytes in size) allow ASan to efficiently detect memory access errors and provide valuable insight into the problems encountered.

Let’s look at a simple C++ example compiled and tested on x86-64, where the touch function accesses 8 bytes at the address given in the argument (the touch function takes a pointer to a pointer and dereferences it):

Figure 11: A function accessing memory area of size 8 bytes

Without ASan, the function has a very simple assembly code:

Figure 12: The function from figure 11 compiled without ASan

Figure 13 shows that, when compiling code from figure 11 with ASan, a check is added that confirms if the access is correct (i.e., if the whole granule is accessed). We can see that the address that we are going to access is first divided by 8 (shr rax, 3 instruction) to compute its offset in the shadow memory. Then, the program checks if the shadow memory byte is zero; if it’s not, it calls to the __asan_report_load8 function, which makes ASan to report the memory access violation. The byte is checked against zero, because zero means that 8 bytes are accessible, whereas the memory dereference that the program performs returns another pointer, which is of course of 8 bytes in size.

Figure 13: The function from Figure 11 compiled with ASan using Clang 15

For comparison, we can see that the gcc compiler generates similar code (figure 14) as by LLVM (figure 13):

Figure 14: The function from Figure 11 compiled with ASan using gcc 12

Of course, if the program accessed a smaller region, a different check would have to be generated by the compiler. This is shown in figures 15 and 16, where the program accesses just a single byte.

Figure 15: A function accessing memory area smaller than a granule

Now the function accesses a single byte that may be at the beginning, middle, or the end of a granule, and every granule may be fully addressable, partially addressable, or fully poisoned. The shadow memory byte is first checked against zero, and if it doesn’t match, a detailed check is performed (starting from the .LBB0_1 label). This check will raise an error if the granule is partially addressable and a poisoned byte is accessed (from a poisoned suffix) or if the granule is fully poisoned. (GCC generates similar code.)

Figure 16: An example of a more complex check, confirming legality of the access in function from figure 15, compiled with Clang 15

Can you spot the problem above?

You may have noticed in figures 12-14 that access to poisoned memory may not be detected if the address we read 8 bytes from is unaligned. For such an unaligned memory access, its first and last bytes are in different granules.

The following snippet illustrates a scenario when the address of variable ptr is increased by three and the touch function touches an unaligned address.

Figure 17: Code accessing unaligned memory of size 8 may not be detected by ASan in Clang 15

The incorrect access from figure 17 is not detected when it is compiled with Clang 15, but it is detected by GCC 12 as long as the function is inlined. If we force non-inlining with __attribute__ ((noinline)), GCC won’t detect it either. It seems that when GCC is aware of address manipulations that may result in unaligned addressing, it generates a more robust check that detects the invalid access correctly.

ASan’s limitations and quirks

While ASan may miss some bugs, it is important to note that it does not report any false positives if used properly. This means that if it detects a bug, it must be a valid bug in the code, or, a part of the code was not linked with ASan properly (assuming that ASan itself doesn’t have bugs).

However, the ASan implementation in GCC and LLVM include the following limitations or/and quirks:

Redzones are not added between variables in structures.
Redzones are not added between array elements.
Padding in structures is not poisoned (example).
Access to allocated, but not yet used, memory in a container won’t be detected, unless the container annotates itself like C++’s std::vector, std::deque, or std::string (in some cases). Note that std::basic_string (with external buffers) and std::deque are annotated in libc++ (thanks to our patches) while std::string is also annotated in Microsoft C++ standard library.
Incorrect access to memory managed by a custom allocator won’t raise an error unless the allocator performs annotations.
Only suffixes of a memory granule may be poisoned; therefore, access before an unaligned object may not be detected.
ASan may not detect memory errors if a random address is accessed. As long as the random number generator returns an addressable address, access won’t be considered incorrect
ASan doesn’t understand context and only checks values in shadow memory. If a random address being accessed is annotated as some error in shadow memory, ASan will correctly report that error, even if its bug title may not make much sense.
Because ASan does not understand what programs are intended to do, accessing an array with an incorrect index may not be detected if the resulting address is still addressable, as shown in figure 18.

Figure 18: Access to memory that is addressable but out of bounds of the array. There is no error detected.

ASan is not meant for production use

ASan is designed as a debugging tool for use in development and testing environments and it should not be used on production. Apart from its overhead, ASan shouldn’t be used for hardening as its use could compromise the security of a program. For example, it decreases the effectiveness of ASLR security mitigation by its gigantic shadow memory allocation and it also changes the behavior of the program based on environment variables which could be problematic, e.g., for suid binaries.

If you have any other doubts, you should check the ASan FAQ and for hardening your application, refer to compiler security flags.

Poisoning-only suffixes

Because ASan currently has a very limited number of values in shadow memory, it can only poison suffixes of memory granules. In other words, there is no such value encoding in shadow memory to inform ASan that for a granule a given byte is accessible if it follows an inaccessible (poisoned) byte.

As an example, if the third byte in a granule is not poisoned, the previous two bytes are not poisoned as well, even if logic would require them to be poisoned.

It also means that up to seven bytes may not be poisoned, assuming that an object/variable/buffer starts in the middle or at the last byte of a granule.

False positives due to linking

False positives can occur when only part of a program is built with ASan. These false positives are often (if not always) related to container annotations. For example, linking a library that is both missing instrumentation and modifying annotated objects may result in false positives.

Consider a scenario where the push_back member function of a vector is called. If an object is added at the end of the container in a part of the program that does not have ASan instrumentation, no error will be reported, and the memory where the object is stored will not be unpoisoned. As a result, accessing this memory in the instrumented part of the program will trigger a false positive error.

Similarly, access to poisoned memory in a part of the program that was built without ASan won’t be detected.

To address this situation, the whole application along with all its dependencies should be built with ASan (or at least all parts modifying annotated containers). If this is not possible, you can turn off container annotations by setting the environment variable ASAN_OPTIONS=detect_container_overflow=0.

Do it yourself: user annotations

User annotations may be used to detect incorrect memory accesses—for example, when preallocating a big chunk of memory and managing it with a custom allocator or in a custom container. In other words, user annotations can be used to implement similar checks to those std::vector does under the hood in order to detect out-of-bounds access in between the vector’s data+size and data+capacity addresses.

If you want to make your testing even stronger, you can choose to intentionally “poison” certain memory areas yourself. For this, there are two macros you may find useful:

ASAN_POISON_MEMORY_REGION(addr, size)
ASAN_UNPOISON_MEMORY_REGION(addr, size)

To use these macros, you need to include the ASan interface header:

Figure 19: The ASan API must be included in the program

This makes poisoning and unpoisoning memory quite simple. The following is an example of how to do this:

Figure 20: A program demonstrating user poisoning and its detection.

The program allocates a buffer on heap, poisons the whole buffer (through user poisoning), and then accesses an element from the buffer. This access is detected as forbidden, and the program reports a “Poisoned by user” error (f7). The figure below shows the buffer (poisoned by user) as well as the heap redzone (fa).

Figure 21: A part of the error message generated by program from figure 20 while compiled with ASan

However, if you unpoison part of the buffer (as shown below, for four elements), no error would be raised while accessing the first four elements. Accessing any further element will raise an error.

Figure 22: An example of unpoisoning memory by user

If you want to understand better how those macros impact the code, you can look into its definition in an ASan interface file.

The ASAN_POISON_MEMORY_REGION and ASAN_UNPOISON_MEMORY_REGION macros simply invoke the __asan_poison_memory_region and __asan_unpoison_memory_region functions from the API. However, when a program is compiled without ASan, these macros do nothing beyond evaluating the macro arguments.

The bug missed by ASan

As we noted previously in the limitations section, ASan does not automatically detect out-of-bound accesses into containers that preallocate memory and manage it. This was also a case we came across during an audit: we found a bug with manual review in code that we were fuzzing and we were surprised the fuzzer did not find it. It turned out that this was because of lack of container overflow detection in the std::basic_string and std::deque collections in libc++.

This motivated us to get involved in ASan development by developing a proof of concept of those ASan container overflow detections in GCC and LLVM and eventually upstream patches to LLVM.

So what was the bug that ASan missed? Figure 23 shows a minimal example of it. The buggy code compared two containers via an std::equal function that took only the first1, last1, and first2 iterators, corresponding to the beginning and end of the first sequence and to the beginning of the second sequence for comparison, assuming the same length of the sequences.

However, when the second container is shorter than the first one, this can cause an out-of-bounds read, which was not detected by ASan and which we changed. With our patches, this is finally detected by ASan.

Figure 23: Code snippet demonstrating the nature of the bug we found during the audit. Container type was changed for demonstrative purposes.

Use ASan to detect more memory safety bugs

We hope our efforts to improve ASan’s state-of-the-art bug detection capabilities will cement its status as a powerful tool for protecting codebases against memory issues.

We’d like to express our sincere gratitude to the entire LLVM community for their support during the development of our ASan annotation improvements. From reviewing code patches and brainstorming implementation ideas to identifying issues and sharing knowledge, their contributions were invaluable. We especially want to thank vitalybuka, ldionne, philnik777, and EricWF for their ongoing support!

We hope this explanation of AddressSanitizer has been insightful and demonstrated its value in hunting down bugs within a codebase. We encourage you to leverage this knowledge to proactively identify and eliminate issues in your own projects. If you successfully detect bugs with the help of the information provided here, we’d love to hear about it! Happy hunting!

If you need help with ASan annotations, fuzzing, or anything related to LLVM, contact us! We are happy to help tailor sanitizers or other LLVM tools to your specific needs. If you’d like to read more about our work on compilers, check out the following posts: VAST (GitHub repository) and Macroni (GitHub repository).

Trail of Bits Blog
A peek into build provenance for HomebrewTrail of Bits 14 May 2024 at 13:00

A peek into build provenance for Homebrew

Trail of Bits Blog

By: Trail of Bits

14 May 2024 at 13:00

By Joe Sweeney and William Woodruff

Last November, we announced our collaboration with Alpha-Omega and OpenSSF to add build provenance to Homebrew.

Today, we are pleased to announce that the core of that work is live and in public beta: homebrew-core is now cryptographically attesting to all bottles built in the official Homebrew CI. You can verify these attestations with our (currently external, but soon upstreamed) brew verify command, which you can install from our tap:

This means that, from now on, each bottle built by Homebrew will come with a cryptographically verifiable statement binding the bottle’s content to the specific workflow and other build-time metadata that produced it. This metadata includes (among other things) the git commit and GitHub Actions run ID for the workflow that produced the bottle, making it a SLSA Build L2-compatible attestation:

In effect, this injects greater transparency into the Homebrew build process, and diminishes the threat posed by a compromised or malicious insider by making it impossible to trick ordinary users into installing non-CI-built bottles.

This work is still in early beta, and involves features and components still under active development within both Homebrew and GitHub. As such, we don’t recommend that ordinary users begin to verify provenance attestations quite yet.

For the adventurous, however, read on!

A quick Homebrew recap

Homebrew is an open-source package manager for macOS and Linux. Homebrew’s crown jewel is homebrew-core, a default repository of over 7,000 curated open-source packages that ship by default with the rest of Homebrew. homebrew-core’s packages are downloaded hundreds of millions of times each year, and form the baseline tool suite (node, openssl, python, go, etc.) for programmers using macOS for development.

One of Homebrew’s core features is its use of bottles: precompiled binary distributions of each package that speed up brew install and ensure its consistency between individual machines. When a new formula (the machine-readable description of how the package is built) is updated or added to homebrew-core, Homebrew’s CI (orchestrated through BrewTestBot) automatically triggers a process to create these bottles.

After a bottle is successfully built and tested, it’s time for distribution. BrewTestBot takes the compiled bottle and uploads it to GitHub Packages, Homebrew’s chosen hosting service for homebrew-core. This step ensures that users can access and download the latest software version directly through Homebrew’s command-line interface. Finally, BrewTestBot updates references to the changes formula to include the latest bottle builds, ensuring that users receive the updated bottle upon their next brew update.

In sum: Homebrew’s bottle automation increases the reliability of homebrew-core by removing humans from the software building process. In doing so, it also eliminates one specific kind of supply chain risk: by lifting bottle builds away from individual Homebrew maintainers into the Homebrew CI, it reduces the likelihood that a maintainer’s compromised development machine could be used to launch an attack against the larger Homebrew user base¹.

At the same time, there are other aspects of this scheme that an attacker could exploit: an attacker with sufficient permissions could potentially upload malicious builds directly to homebrew-core’s bottle storage, potentially leveraging alert fatigue to trick users into installing despite a checksum mismatch. More concerningly, a compromised or rogue Homebrew maintainer could surreptitiously replace both the bottle and its checksum, resulting in silently compromised installs for all users onwards.

This scenario is a singular but nonetheless serious weakness in the software supply chain, one that is well addressed by build provenance.

Build provenance

In a nutshell, build provenance provides cryptographically verifiable evidence that a software package was actually built by the expected “build identity” and not tampered with or secretly inserted by a privileged attacker. In effect, build provenance offers the integrity properties of a strong cryptographic digest, combined with an assertion that the artifact was produced by a publicly auditable piece of build infrastructure.

In the case of Homebrew, that “build identity” is a GitHub Actions workflow, meaning that the provenance for every bottle build attests to valuable pieces of metadata like the GitHub owner and repository, the branch that the workflow was triggered from, the event that triggered the workflow, and even the exact git commit that the workflow ran from.

This data (and more!) is encapsulated in a machine-readable in-toto statement, giving downstream consumers the ability to express complex policies over individual attestations:

Build provenance and provenance more generally are not panaceas: they aren’t a substitute for application-level protections against software downgrades or confusion attacks, and they can’t prevent “private conversation with Satan” scenarios where the software itself is malicious or compromised.

Despite this, provenance is a valuable building block for auditable supply chains: it forces attackers into the open by committing them to public artifacts on a publicly verifiable timeline, and reduces the number of opaque format conversions that an attacker can hide their payload in. This is especially salient in cases like the recent xz-utils backdoor, where the attacker used a disconnect between the upstream source repository and backdoored tarball distribution to maintain their attack’s stealth. Or in other words: build provenance won’t stop a fully malicious maintainer, but it will force their attack into the open for review and incident response.

Our implementation

Our implementation of build provenance for Homebrew is built on GitHub’s new artifact attestations feature. We were given early (private beta) access to the feature, including the generate-build-provenance action and gh attestation CLI, which allowed us to iterate rapidly on a design that could be easily integrated into Homebrew’s pre-existing CI.

This gives us build provenance for all current and future bottle builds, but we were left with a problem: Homebrew has a long “tail” of pre-existing bottles that are still referenced in formulae, including bottles built on (architecture, OS version) tuples that are no longer supported by GitHub Actions². This tail is used extensively, leaving us with a dilemma:

Attempt to rebuild all old bottles. This is technically and logistically infeasible, both due to the changes in GitHub Actions’ own supported runners and significant toolchain changes between macOS versions.
Only verify a bottle’s build provenance if present. This would effectively punch a hole in the intended security contract for build provenance, allowing an attacker to downgrade to a lower degree of integrity simply by stripping off any provenance metadata.

Neither of these solutions was workable, so we sought a third. Instead of either rebuilding the world or selectively verifying, we decided to create a set of backfilled build attestations, signed by a completely different repository (our tap) and workflow. With a backfilled attestation behind each bottle, verification looks like a waterfall:

We first check for build provenance tied to the “upstream” repository with the expected workflow, i.e. Homebrew/homebrew-core with publish-commit-bottles.yml.
If the “upstream” provenance is not present, we check for a backfilled attestation before a specified cutoff date from the backfill identity, i.e. trailofbits/homebrew-brew-verify with backfill_signatures.yml.
If neither is present, then we produce a hard failure.

This gives us the best of both worlds: the backfill allows us to uniformly fail if no provenance or attestation is present (eliminating downgrades), without having to rebuild every old homebrew-core bottle. The cutoff date then adds an additional layer of assurance, preventing an attacker from attempting to use the backfill attestation to inject an unexpected bottle.

We expect the tail of backfilled bottle attestations to decrease over time, as formulae turn over towards newer versions. Once all reachable bottles are fully turned over, Homebrew will be able to remove the backfill check entirely and assert perfect provenance coverage!

Verifying provenance today

As mentioned above: this feature is in an early beta. We’re still working out known performance and UX issues; as such, we do not recommend that ordinary users try it yet.

With that being said, adventuresome early adopters can give it a try with two different interfaces:

A dedicated brew verify command, available via our third-party tap
An early upstream integration into brew install itself.

For brew verify, simply install our third-party tap. Once installed, the brew verify subcommand will become usable:

brew update
brew tap trailofbits/homebrew-brew-verify
brew verify --help
brew verify bash

Going forward, we’ll be working with Homebrew to upstream brew verify directly into brew as a developer command.

For brew install itself, set HOMEBREW_VERIFY_ATTESTATIONS=1 in your environment:

brew update
export HOMEBREW_VERIFY_ATTESTATIONS=1
brew install cowsay

Regardless of how you choose to experiment with this new features, certain caveats apply:

Both brew verify and brew install wrap the gh CLI internally, and will bootstrap gh locally if it isn’t already installed. We intend to replace our use of gh attestation with a pure-Ruby verifier in the medium term.
The build provenance beta depends on authenticated GitHub API endpoints, meaning that gh must have access to a suitable access credential. If you experience initial failures with brew verify or brew install, try running gh auth login or setting HOMEBREW_GITHUB_API_TOKEN to a personal access token with minimal permissions.

If you hit a bug or unexpected behavior while experimenting with brew install, please report it! Similarly, for brew verify: please send any reports directly to us.

Looking forward

Everything above concerns homebrew-core, the official repository of Homebrew formulae. But Homebrew also supports third-party repositories (“taps”), which provide a minority–but–significant number of overall bottle installs. These repositories also deserve build provenance, and we have ideas for accomplishing that!

Further out, we plan to take a stab at source provenance as well: Homebrew’s formulae already hash-pin their source artifacts, but we can go a step further and additionally assert that source artifacts are produced by the repository (or other signing identity) that’s latent in their URL or otherwise embedded into the formula specification. This will compose nicely with GitHub’s artifact attestations, enabling a hypothetical DSL:

Stay tuned for further updates in this space and, as always, don’t hesitate to contact us! We’re interested in collaborating on similar improvements for other open-source packaging ecosystems, and would love to hear from you.

Last but not least, we’d like to offer our gratitude to Homebrew’s maintainers for their development and review throughout the process. We’d also like to thank Dustin Ingram for his authorship and design on the original proposal, the GitHub Package Security team, as well as Michael Winser and the rest of Alpha-Omega for their vision and support for a better, more secure software supply chain.

¹In the not-too-distant past, Homebrew’s bottles were produced by maintainers on their own development machines and uploaded to a shared Bintray account. Mike McQuaid’s 2023 talk provides an excellent overview on the history of Homebrew’s transition to CI/CD builds.
²Or easy to provide with self-hosted runners, which Homebrew uses for some builds.

Trail of Bits Blog
Using benchmarks to speed up EchidnaTrail of Bits 8 May 2024 at 13:30

Using benchmarks to speed up Echidna

Trail of Bits Blog

By: Trail of Bits

8 May 2024 at 13:30

By Ben Siraphob

During my time as a Trail of Bits associate last summer, I worked on optimizing the performance of Echidna, Trail of Bits’ open-source smart contract fuzzer, written in Haskell. Through extensive use of profilers and other tools, I was able to pinpoint and debug a massive space leak in one of Echidna’s dependencies, hevm. Now that this problem has been fixed, Echidna and hevm can both expect to use several gigabytes less memory on some test cases compared to before.

In this blog post, I’ll show how I used profiling to identify this deep performance issue in hevm and how we fixed it, improving Echidna’s performance.

Overview of Echidna

Suppose we are keeping track of a fixed supply pool. Users can transfer tokens among themselves or burn tokens as needed. A desirable property of this pool might be that supply never grows; it only stays the same or decreases as tokens are transferred or burned. How might we go about ensuring this property holds? We can try to write up some test scenarios or try to prove it by hand… or we can fuzz the code with Echidna!

How Echidna works

Echidna takes in smart contracts and assertions about their behavior that should always be true, both written in Solidity. Then, using information extracted from the contracts themselves, such as method names and constants, Echidna starts generating random transaction sequences and replaying them over the contracts. It keeps generating longer and new sequences from old ones, such as by splitting them up at random points or changing the parameters in the method calls.

How do we know that these generations of random sequences are covering enough of the code to eventually find a bug? Echidna uses coverage-guided fuzzing—that is, it keeps track of how much code is actually executed from the smart contract and prioritizes sequences that reach more code in order to create new ones. Once it finds a transaction sequence that violates our desired property, Echidna then proceeds to shrink it to try to minimize it. Echidna then dumps all the information into a file for further inspection.

Overview of profiling

The Glasgow Haskell Compiler (GHC) provides various tools and flags that programmers can use to understand performance at various levels of granularity. Here are two:

Compiling with profiling: This modifies the compilation process to add a profiling system that adds costs to cost centers. Costs are annotations around expressions that completely measure the computational behavior of those expressions. Usually, we are interested in top-level declarations, essentially functions and values that are exported from a module.
Collecting runtime statistics: Adding +RTS -s to a profiled Haskell program makes it show runtime statistics. It’s more coarse than profiling, showing only aggregate statistics about the program, such as total bytes allocated in the heap or bytes copied during garbage collection. After enabling profiling, one can also use the -hT option, which breaks down the heap usage by closure type.

Both of these options can produce human- and machine-readable output for further inspection. For instance, when we compile a program with profiling, we can output JSON that can be displayed in a flamegraph viewer like speedscope. This makes it easy to browse around the data and zoom in to relevant time slices. For runtime statistics, we can use eventlog2html to visualize the heap profile.

Looking at the flamegraph below and others like it led me to conclude that at least from an initial survey, Echidna wasn’t terribly bloated in terms of its memory usage. Indeed, various changes over time have targeted performance directly. (In fact, a Trail of Bits wintern from 2022 found performance issues with its coverage, which were then fixed.) However, notice the large blue regions? That’s hevm, which Echidna uses to evaluate the candidate sequences. Given that Echidna spends the vast majority of its fuzzing time on this task, it makes sense that hevm would take up a lot of computational power. That’s when I turned my attention to looking into performance issues with hevm.

The time use of functions and call stacks in Echidna

Profilers can sometimes be misleading

Profiling is useful, and it helped me find a bug in hevm whose fix led to improved performance in Echidna (which we get to in the next section), but you should also know that it can be misleading.

For example, while profiling hevm, I noticed something unusual. Various optics-related operators (getters and setters) were dominating CPU time and allocations. How could this be? The reason was that the optics library was not properly inlining some of its operators. As a result, if you run this code with profiling enabled, you would see that the % operator takes up the vast majority of allocations and time instead of the increment function, which is actually doing the computation. This isn’t observed when running an optimized binary though, since GHC must have decided to inline the operator anyway. I wrote up this issue in detail and it helped the optics library developers close an issue that was opened last year! This little aside made me realize that I should compile programs with and without profiling enabled going forward to ensure that profiling stays faithful to real-world usage.

Finding my first huge memory leak in hevm

Consider the following program. It repeatedly hashes a number, starting with 0, and writes the hashes somewhere in memory (up to address m). It does this n times.

contract A {
  mapping (uint256 => uint256) public map;
  function myFunction(uint256 n, uint256 m) public {
    uint256 h = 0;
    for (uint i = 0; i 
What should we expect the program to do as we vary the value of n and m? If we hold m fixed and continue increasing the value of n, the memory block up to m should be completely filled. So we should expect that no more memory would be used. This is visualized below:
Holding m fixed and increasing n should eventually fill up m.
Surprisingly, this is not what I observed. The memory used by hevm went up linearly as a function of n and m. So, for some reason, hevm continued to allocate memory even though it should have been reusing it. In fact, this program used so much memory that it could use hundreds of gigabytes of RAM. I wrote up the issue here.
A graph showing allocations growing rapidly
I figured that if this memory issue affects hevm, it would surely affect Echidna as well.
Don't just measure once, measure N times!
Profiling gives you data about time and space for a single run, but that isn't enough to understand what happens as the program runs longer. For example, if you profiled Python’s insertionSort function on arrays with lengths of less than length 20, you might conclude that it's faster than quickSort when asymptotically we know that's not the case.
Similarly, I had some intuition about how "expensive" (from hevm's viewpoint) different Ethereum programs would be, but I didn’t know for sure until I measured the performance of smart contracts running on the EVM. Here's a brief overview of what smart contracts can do and how they interact with the EVM.

The EVM consists of a stack, memory, and storage. The stack is limited to 1024 items. The memory and storage are all initialized to 0 and are indexed by an unsigned 256-bit integer.
Memory is transient and its lifetime is limited to the scope of a transaction, whereas storage persists across transactions.
Contracts can allocate memory in either memory or storage. While writing to storage (persistent blockchain data) is significantly more expensive gas-wise than memory (transient memory per transaction), when we're running a local node we shouldn't expect any performance differences between the two storage types.

I wrote up eight simple smart contracts that would stress these various components. The underlying commonality between all of them is that they were parameterized with a number (n) and are expected to have a linear runtime with respect to that number. Any nonlinear runtime changes would thus indicate outliers. These are the contracts and what they do:

simple_loop: Looping and adding numbers
primes: Calculation and storage of prime numbers
hashes: Repeated hashing
hashmem: Repeated hashing and storage
balanceTransfer: Repeated transferring of 1 wei to an address
funcCall: Repeated function calls
contractCreation: Repeated contract creations
contractCreationMem: Repeated contract creations and memory

You can find their full source code in this file.

I profiled these contracts to collect information on how they perform with a wide range of n values. I increased n by powers of 2 so that the effects would be more noticeable early on. Here's what I saw:

I immediately noticed that something was definitely going on with the hashes and hashmem test cases. If the contracts’ runtimes increased linearly with increases to n, the hashes and hashmem lines wouldn't have crossed the others. How might we try to prove that? Since we know that each point should increase by roughly double (ignoring a constant term), we can simply plot the ratios of the runtimes from one point to the next and draw a line indicating what we should expect.

Bingo. hashes and hashmem were clearly off the baseline. I then directed my efforts toward profiling those specific examples and looking at any code that they depend on. After additional profiling, it seemed that repeatedly splicing and resplicing immutable bytearrays (to simulate how writes would work in a contract) caused the bytearray-related memory type to explode in size. In essence, hevm was not properly discarding the old versions of the memory.

ST to the rescue!

The fix was conceptually simple and, fortunately, had already been proposed months previously by my manager, Artur Cygan. First, we changed how hevm handles the state in EVM computations:

- type EVM a = State VM a
+ type EVM s a = StateT (VM s) (ST s) a

Then, we went through all the places where hevm deals with EVM memory and implemented a mutable vector that can be modified in place(!) How does this work? In Haskell, computations that manipulate a notion of state are encapsulated in a State monad, but there are no guarantees that only a single memory copy of that state will be there during program execution. Using the ST monad instead allowed us to ensure that the internal state used by the computation is inaccessible to the rest of the program. That way, hevm can get away with destructively updating the state while still treating the program as purely functional.

Here’s what the graphs look like after the PR. The slowdown in the last test case is now around 3 instead of 5.5, and in terms of actual runtime, the linearity is much more apparent. Nice!

Epilogue: Concrete or symbolic?

In the last few weeks of my associate program, I ran more detailed profilings with provenance information. Now we truly get x-ray vision into exactly where memory is being allocated in the program:

A detailed heap profile showing which data constructors use the most memory

What’s with all the Prop terms being generated? hevm has support for symbolic execution, which allows for various forms of static analysis. However, Echidna only ever uses the fully concrete execution. As a result, we never touch the constraints that hevm is generating. This is left for future work, which will hopefully lead to a solution in which hevm can support a more optimized concrete-only mode without compromising on its symbolic aspects.

Final thoughts

In a software project like Echidna, whose effectiveness is proportional to how quickly it can perform its fuzzing, we’re always looking for ways to make it faster without making the code needlessly complex. Doing performance engineering in a setting like Haskell reveals some interesting problems and definitely requires one to be ready to drop down and reason about the behavior of the compilation process and language semantics. It is an art as old as computer science itself.

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

— Donald Knuth

Trail of Bits Blog
The life and times of an Abstract Syntax TreeTrail of Bits 2 May 2024 at 13:00

The life and times of an Abstract Syntax Tree

Trail of Bits Blog

By: Trail of Bits

2 May 2024 at 13:00

By Francesco Bertolaccini

You’ve reached computer programming nirvana. Your journey has led you down many paths, including believing that God wrote the universe in LISP, but now the truth is clear in your mind: every problem can be solved by writing one more compiler.

It’s true. Even our soon-to-be artificially intelligent overlords are nothing but compilers, just as the legends foretold. That smart contract you’ve been writing for your revolutionary DeFi platform? It’s going through a compiler at some point.

Now that we’ve established that every program should contain at least one compiler if it doesn’t already, let’s talk about how one should go about writing one. As it turns out, this is a pretty vast topic, and it’s unlikely I’d be able to fit a thorough disquisition on the subject in the margin of this blog post. Instead, I’m going to concentrate on the topic of Abstract Syntax Trees (ASTs).

In the past, I’ve worked on a decompiler that turns LLVM bitcode into Clang ASTs, and that has made me into someone with opinions about them. These are opinions on the things they don’t teach you in school, like: what should the API for an AST look like? And how should it be laid out in memory? When designing a component from scratch, we must consider those aspects that go beyond its mere functionality—I guess you could call these aspects “pragmatics.” Let’s go over a few of them so that if you ever find yourself working with ASTs in the future, you may skip the more head-scratching bits and go straight to solving more cogent problems!

What are ASTs?

On their own, ASTs are not a very interesting part of a compiler. They are mostly there to translate the dreadful stream of characters we receive as input into a more palatable format for further compiler shenanigans. Yet the way ASTs are designed can make a difference when working on a compiler. Let’s investigate how.

Managing the unmanageable

If you’re working in a managed language like C# or Java, one with a garbage collector and a very OOP type system, your AST nodes are most likely going to look something like this:

class Expr {}
class IntConstant : Expr {
    int value;
}
class BinExpr : Expr {
    public Expr lhs;
    public Expr rhs;
}

This is fine—it serves the purpose well, and the model is clear: since all of the memory is managed by the runtime, ownership of the nodes is not really that important. At the end of the day, those nodes are not going anywhere until everyone is done with them and the GC determines that they are no longer reachable.

(As an aside, I’ll be making these kinds of examples throughout the post; they are not meant to be compilable, only to provide the general idea of what I’m talking about.)

I typically don’t use C# or Java when working on compilers, though. I’m a C++ troglodyte, meaning I like keeping my footguns cocked and loaded at all times: since there is no garbage collector around to clean up after the mess I leave behind, I need to think deeply about who owns each and every one of those nodes.

Let’s try and mimic what was happening in the managed case.

The naive approach

struct Expr {
    virtual ~Expr();
};
struct IntConstant : Expr {
    int value;
};
struct BinExpr : Expr {
    std::shared_ptr lhs;
    std::shared_ptr rhs;
};

Shared pointers in C++ use reference counting (which one could argue is a form of automatic garbage collection), which means that the end result is similar to what we had in Java and C#: each node is guaranteed to stay valid at least until the last object holding a reference to it is alive.

That at least in the previous sentence is key: if this was an Abstract Syntax Graph instead of an Abstract Syntax Tree, we’d quickly find ourselves in a situation where nodes would get stuck in a limbo of life detached from material reality, a series of nodes pointing at each other in a circle, forever waiting for someone else to die before they can finally find their eternal rest as well.

Again, this is a purely academic possibility since a tree is by definition acyclic, but it’s still something to keep in mind.

I don’t know Rust that well, but it is my understanding that a layout roughly equivalent to the one above would be written like this:

enum Expr {
    IntConstant(i32),
    BinExpr(Arc<Expr>, Arc<Expr>)
}

When using this representation, your compiler will typically hold a reference to a root node that causes the whole pyramid of nodes to keep standing. Once that reference is gone, the rest of the nodes follow suit.

Unfortunately, each pointer introduces additional computation and memory consumption due to its usage of an atomic reference counter. Technically, one could avoid the “atomic” part in the Rust example by using Rc instead of Arc, but there’s no equivalent of that in C++ and my example would not work as well. In my experience, it’s quite easy to do away with the ceremony of making each node hold a reference count altogether, and instead decide on a more disciplined approach to ownership.

The “reverse pyramid” approach

struct Expr {
    virtual ~Expr();
};
struct IntConstant : Expr {
    int value;
};
struct BinExpr : Expr {
    std::unique_ptr lhs;
    std::unique_ptr rhs;
};

Using unique pointers frees us from the responsibility of keeping track of when to free memory without adding the overhead of reference counting. While it’s not possible for multiple nodes to have an owning reference to the same node, it’s still possible to express cyclic data structures by dereferencing the unique pointer and storing a reference instead. This is (very) roughly equivalent to using std::weak_ptr with shared pointers.

Just like in the naive approach, destroying the root node of the AST will cause all of the other nodes to be destroyed with it. The difference is that in this case we are guaranteed that this will happen, because every child node is owned by their parent and no other owning reference is possible.

I believe this representation is roughly equivalent to this Rust snippet:

enum Expr {
    IntConstant(i32),
    BinExpr(Box<Expr>, Box<Expr>)
}

Excursus: improving the API

We are getting pretty close to what I’d call the ideal representation, but one thing I like to do is to make my data structures as immutable as possible.

BinExpr would probably look like this if I were to implement it in an actual codebase:

class BinExpr : Expr {
    std::unique_ptr lhs, rhs;
public:
    BinExpr(std::unique_ptr lhs, std::unique_ptr rhs)
        : lhs(std::move(lhs))
        , rhs(std::move(rhs)) {}   
    const Expr& get_lhs() const { return *lhs; }
    const Expr& get_rhs() const { return *rhs; }
};

This to me signals a few things:

Nodes are immutable.
Nodes can’t be null.
Nodes can’t be moved; their owner is fixed.

Removing the safeguards

The next step is to see how we can improve things by removing some of the safeguards that we’ve used so far, without completely shooting ourselves in the foot. I will not provide snippets on how to implement these approaches in Rust because last time I asked how to do that in my company’s Slack channel, the responses I received were something like “don’t” and “why would you do that?” and “someone please call security.” It should not have been a surprise, as an AST is basically a linked list with extra steps, and Rust hates linked lists.

Up until now, the general idea has been that nodes own other nodes. This makes it quite easy to handle the AST safely because the nodes are self-contained.

What if we decided to transfer the ownership of the nodes to some other entity? It is, after all, quite reasonable to have some sort of ASTContext object we can assume to handle the lifetime of our nodes, similar to what happens in Clang.

Let’s start by changing the appearance of our Expr nodes:

struct BinExpr : Expr {
    const Expr& lhs;
    const Expr& rhs;
};

Now we create a storage for all of our nodes:

vector<unique_ptr> node_storage;
auto &lhs = node_storage.emplace_back(make_unique(...));
auto &rhs = node_storage.emplace_back(make_unique(...));
auto &binexp = node_storage.emplace_back(make_unique(*lhs, *rhs));

Nice! node_storage is now the owner of all the nodes, and we can iterate over them without having to do a tree visit. In fact, go watch this talk about the design of the Carbon compiler, about 37 minutes in: if you keep your pattern of creating nodes predictable, you end up with a storage container that’s already sorted in, e.g., post-visit order!

Variants on a theme

Let’s now borrow a trick from Rust’s book: the Expr class I’ve been using up until this point is an old-school case of polymorphism via inheritance. While I do believe inheritance has its place and in many cases should be the preferred solution, I do think that ASTs are one of the places where discriminated unions are the way to go.

Rust calls discriminated unions enum, whereas C++17 calls them std::variant. While the substance is the same, the ergonomics are not: Rust has first class support for them in its syntax, whereas C++ makes its users do template metaprogramming tricks in order to use them, even though they do not necessarily realize it.

The one feature I’m most interested in for going with variant instead of inheritance is that it turns our AST objects into “value types,” allowing us to store Expr objects directly instead of having to go through an indirection via a reference or pointer. This will be important in a moment.

The other feature that this model unlocks is that we get the Visitor pattern implemented for free, and we can figure out exactly what kind of node a certain value is holding without having to invent our own dynamic type casting system. Looking at you, LLVM. And Clang. And MLIR.

Going off the rails

Let’s take a look back at an example I made earlier:

vector<unique_ptr> node_storage;
auto &lhs = node_storage.emplace_back(make_unique(...));
auto &rhs = node_storage.emplace_back(make_unique(...));
auto &binexp = node_storage.emplace_back(make_unique(*lhs, *rhs));

There’s one thing that bothers me about this: double indirection, and noncontiguous memory allocation. Think of what the memory layout for this storage mechanism looks like: the vector will have a contiguous chunk of memory allocated for storing pointers to all of the nodes, then each pointer will have an associated chunk of memory the size of a node which, as mentioned earlier, varies for each kind of node.

What this means is that our nodes, even if allocated sequentially, have the potential to end up scattered all over the place. They say early optimization is the root of all evil, but for the sake of exhausting all of the tricks I have up my sleeve, I’ll go ahead and show a way to avoid this.

Let’s start by doing what I said I’d do earlier, and use variant for our nodes:

struct IntConstant;
struct BinExpr;
using Expr = std::variant<IntConstant, BinExpr>;


struct IntConstant {
    int value;
};
struct BinExpr  {
    Expr &lhs;
    Expr &rhs;
};

Now that each and every node has the same size, we can finally store them contiguously in memory:

std::vector node_storage;
node_storage.reserve(max_num_nodes);
auto &lhs = node_storage.emplace_back(IntConstant{3});
auto &rhs = node_storage.emplace_back(IntConstant{4});
auto &binexp = node_storage.emplace_back(BinExpr{lhs, rhs});

You see that node_storage.reserve call? That’s not an optimization—that is an absolutely load-bearing part of this mechanism.

I want to make it absolutely clear that what’s happening here is the kind of thing C++ gets hate for. This is a proverbial gun that, should you choose to use it, will be strapped at your hip pointed at your foot, fully loaded and ready to blow your leg off if at any point you forget it’s there.

The reason we’re using reserve in this case is that we want to make sure that all of the memory we will potentially use for storing our nodes is allocated ahead of time, so that when we use emplace_back to place a node inside of it, we are guaranteed that that chunk of memory will not get reallocated and change address. (If that were to happen, any of our nodes that contain references to other nodes would end up pointing to garbage, and demons would start flying out of your nose.)

Using vector and reserve is of course not the only way to do this: using an std::array is also valid if the maximum number of nodes you are going to use is known at compile time.

Ah yes, max_num_nodes. How do you compute what that is going to be? There’s no single good answer to this question, but you can find decent heuristics for it. For example, let’s say you are parsing C: the smallest statement I can think of would probably look something like a;, or even more extremely, just a. We can deduce that, if we want to be extremely safe, we could allocate storage for a number of nodes equal to the amount of characters in the source code we’re parsing. Considering that most programs will not be anywhere close to this level of pathological behavior, it’s reasonable to expect that most of that memory will be wasted. Unfortunately, we can’t easily reclaim that wasted memory with a simple call to shrink_to_fit, as that can cause a reallocation.

The technique you can use in that case, or in the case where you absolutely cannot avoid allocating additional memory, is to actually do a deep clone of the AST, visiting each node and painstakingly creating a new counterpart for it in the new container.

One thing to keep in mind, when storing your AST nodes like this, is that the size of each node will now be equal to the size of the largest representable node. I don’t think that this matters that much, since you should try and keep all of your nodes as small as possible anyway, but it’s still worth thinking about.

Of course, it might be the case that you don’t actually need to extract the last drop of performance and memory efficiency out of your AST, and you may be willing to trade some of those in exchange for some ease of use. I can think of three ways of achieving this:

Use std::list.
Use std::deque.
Use indices instead of raw pointers.

Let’s go through each of these options one at a time.

Use std::list instead of std::vector

Don’t. ‘Nuff said.

Alright, fine. I’ll elaborate.

Linked lists were fine in the time when the “random access” part of RAM was not a lie yet and memory access patterns didn’t matter. Using a linked list for storing your nodes is just undoing all of the effort we’ve gone through to optimize our layout.

Use std::deque instead of std::vector

This method is already better! Since we’ll mostly just append nodes to the end of our node storage container, and since a double-ended queue guarantees that doing so is possible without invalidating the addresses of any existing contents, this looks like a very good compromise.

Unfortunately the memory layout won’t be completely contiguous anymore, but you may not care about that. If you are using Microsoft’s STL, though, you have even bigger issues ahead of you.

Use indices instead of raw pointers

The idea is that instead of storing the pointer of a child node, you store the index of that node inside of the vector. This adds a layer of indirection back into the picture, and you now also have to figure out what vector does this index refer to? Do you store a reference to the vector inside each node? That’s a bit of a waste. Do you store it globally? That’s a bit icky, if you ask me.

Parting thoughts

I’ve already written a lot and I’ve barely scratched the surface of the kind of decisions a designer will have to make when writing a compiler. I’ve talked about how you could store your AST in memory, but I’ve said nothing about what you want to store in your AST.

The overarching theme in this exhilarating overview is that there’s a lot about compilers that goes beyond parsing, and all of the abstract ideas needed to build a compiler need concretizing at some point, and the details on how you go about doing that matter. I also feel obligated to mention two maxims one should keep in mind when playing this sort of game: premature optimization is the root of all evil, and always profile your code—it’s likely that your codebase contains lower-hanging fruit you can pick before deciding to fine-tune your AST storage.

It’s interesting that most of the techniques I’ve shown in this article are not easily accessible with managed languages. Does this mean that all of this doesn’t really matter, or do compilers written in those languages (I’m thinking of, e.g., Roslyn) leave performance on the table? If so, what’s the significance of that performance?

Finally, I wanted this post to start a discussion about the internals of compilers and compiler-like tools: what do these often highly complex pieces of software hide beneath their surface? It’s easy to find material about the general ideas regarding compilation—tokenization, parsing, register allocation—but less so about the clever ideas people come up with when writing programs that need to deal with huge codebases in a fast and memory-efficient manner. If anyone has war stories to share, I want to hear them!

Trail of Bits Blog
Curvance: Invariants unleashedTrail of Bits 30 April 2024 at 13:30

Curvance: Invariants unleashed

Trail of Bits Blog

By: Trail of Bits

30 April 2024 at 13:30

By Nat Chin

Welcome to our deep dive into the world of invariant development with Curvance.

We’ve been building invariants as part of regular code review assessments for more than 6 years now, but our work with Curvance marks our very first official invariant development project, in which developing and testing invariants is all we did.

Over the nine-week engagement, we wrote and tested 216 invariants, which helped us uncover 13 critical findings. We also found opportunities to significantly enhance our tools, including advanced trace printing and corpus preservation. This project was a journey of navigating learning curves and accomplishing technological feats, and this post will highlight our collaborative efforts and the essential role of teamwork in helping us meet the challenge. And yes, we’ll also touch on the brain-cell-testing moments we experienced throughout this project!

A collective “losing it” moment, capturing the challenges of this project

Creating a quality fuzzing suite

The success of a fuzzing suite is grounded in the quality of its invariants. Throughout this project, we focused on fine-tuning each invariant for accuracy and relevance. Fuzzing, in essence, is like having smart monkeys on keyboards testing invariants, whose effectiveness relies heavily on their precision. Our journey with Curvance over nine weeks involved turning in-depth discussions on codebase properties into precise English explanations and then coding them into executable tests, as shown in the screenshots below.

Examples of what our daily discussions looked like to clarify invariants

From the get-go, Chris from Curvance was often available to help clarify the code’s expected behavior and explain Curvance’s design choices. His insights always clarified complex functions and behavior, and he always helped with hands-on debugging and checking our invariants. This engagement was as productive as it was thanks to Chris’s consistent feedback and working alongside us the entire time. Thank you, Curvance!

The tools (and support teams) behind our success

Along with Curvance’s involvement, support from our internal teams behind Echidna, Medusa, and CloudExec helped our project succeed. Their swift responses to issues, especially during extensive rebases and complex debugging, were crucial. The Curvance engagement pushed these tools to their limits, and the solutions we had to come up with for the challenges we faced led to significant enhancements in these tools.

CloudExec proved invaluable for deploying long fuzzing jobs onto DigitalOcean. We integrated it with Echidna and Medusa for prolonged runs, enabling Curvance to easily set up its own future fuzzing runs. We pinpointed areas of improvement for CloudExec, such as its preservation of output data, which you can see on its GitHub issue tracker. We’ve already addressed many of these issues.

Echidna, our property-based fuzzer for Ethereum contracts, was pivotal in falsifying assertions. We first used Echidna in exploration mode to broadly cover the Curvance codebase, and then we moved into assertion mode, using anywhere from 10 million to 100 billion iterations. This intense use of Echidna throughout our nine-week engagement helped us uncover vital areas of improvement for the tool, making it easier for it to debug and retain the state of explored code areas.

Medusa, our geth-based experimental fuzzer, complemented Echidna in its coverage efforts for falsifying invariants on Curvance. Before we could use Medusa for this engagement, we needed to fix a known out-of-memory bug. The fix for this bug—along with fixes implemented in CloudExec to help it better preserve output data—critically improved the tool and helped maximize our coverage of the Curvance code. Immediately after it started running, it found a medium-severity bug in the code that Echidna had missed. (Echidna eventually found this bug after we changed the block time delay, likely due to the fuzzer’s non-determinism.)

Our first Medusa run of 48+ hours resulted in a medium-severity bug.

The long and winding road

While we had the best support from the teams behind our tools and from our client that we could have asked for, we still faced considerable challenges throughout this project—from the need to keep pace with Curvance’s continued development, to the challenges of debugging assertion failures. But by meeting these challenges, we learned important lessons about the nature of invariant development, and we were able to implement crucial upgrades to our tools to improve our process overall.

Racing to keep up with Curvance’s code changes

Changes to the Curvance codebase—like function removals, additions of function parameters, adjustments to arguments and error messages, and renaming of source contracts—often challenged our fuzzing suite by invalidating existing invariants or causing a series of assertion failures. Ultimately, these changes rendered our existing corpus items obsolete and unusable, and we had to rebase our fuzzing suite and revise both existing and new invariants constantly to ensure their continued relevance to the evolving system. This iterative process paralleled the client’s code development, presenting a mix of true positives (actual bugs in the client’s code) and false positives (failures due to incorrect or outdated invariants). Such outcomes emphasized that fuzzing isn’t a static, one-time setup; it demands ongoing maintenance and updates, akin to development of any active codebase.

Understanding the rationale behind each invariant change post-rebase is crucial. Hasty adjustments without fully grasping their implications could inadvertently mask bugs, undermining the effectiveness of the fuzzing suite. Keeping the suite alive and relevant is as vital as the development of the codebase itself. It’s not just about letting the fuzzer run; it’s about maintaining its alignment with the system it tests. Remember, the true power of a fuzzing suite lies in the accuracy of its invariants.

Critical tool upgrades and lessons learned

We had to make a significant rebase after the Lendtroller contract’s name was changed to MarketManager in commit a96dc9a. This change drastically impacted our work, as Echidna had just finished 43 days of running in the cloud using CloudExec. This nonstop execution had allowed Echidna to develop a detailed corpus capable of autonomously tackling complex liquidations. Unfortunately, the change rendered this corpus obsolete, and each corpus item caused Echidna worker threads to crash upon transaction replay. With our setup of 15 workers, it took only 14 more transactions that could not be replayed for all the Echidna workers to crash, halting Echidna entirely:

An Echidna crash resulting from not being able to replay corpus item

Our rebase due to Curvance’s code change led to a significant problem: our fuzzers could no longer access MarketManager functions needed to explore complex state, like posting collateral and borrowing debt. This issue prompted us to make crucial updates to Echidna, specifically to enhance its ability to validate and replay corpus sequences without crashing. We also made updates to Medusa to improve its tracking of corpus health and ability to fix start-up panics. Extended discussions about maintaining a dynamic corpus ensued, with our engineering director stepping in to manually adjust the corpus, offering some relief.

We shifted our strategy to adjust to the new codebase’s lack of coverage. We developed liquidation-specific invariants for the codebase version before the contract name change, while running the updated version in different modes to boost coverage. CloudExec’s new features, like named jobs, improved checkpointing of output directories, and checkpointing for failed jobs, were key in differentiating and managing these runs. Despite all these improvements, we let the old corpus go and chose to integrate setup functions into key contracts to speed up coverage. While effective in increasing coverage, this strategy introduced biases, especially in liquidation scenarios, by relying on static values. This limitation, marked in the codebase with /// @coverage:limitation tags, underscores the importance of broadening input ranges in our stateful tests to ensure comprehensive system exploration.

Trials and tribulations: Debugging

The Curvance invariant development report mainly highlights the results of our debugging without delving into the complex journey of investigation and root cause analysis behind these findings. This part of the process, involving detailed analysis once assertion failures were identified, required significant effort.

Our primary challenge was dissecting long call sequences, often ranging from 9 to 70 transactions, which required deep scrutiny to identify where errors and unexpected values crept in. Some sequences spanned up to 29 million blocks or included time delays exceeding 6 years, adding layers of complexity to our understanding of the system’s behavior. To tackle this, we had to manually insert logs for detailed state information, turning debugging into an exhaustive and manual endeavor:

Echidna’s debugging at the beginning of the engagement

Our ability to manually shrink transaction sequences hinged on our deep understanding of Curvance’s system. This detailed knowledge was critical for us to effectively identify which transactions were essential for uncovering vulnerabilities and which could be discarded. As we gained this deeper insight into the system throughout the project, our ability to streamline transaction sequences improved markedly.

Based on our work with combing through transaction sequences, we implemented a rich reproducer trace feature in Echidna, providing us with detailed traces of the system during execution and elaborate printouts of the system state at each step of the transaction failure sequence. Meanwhile, we also added shrinking limits of transaction sequences to Medusa to fix intermittent assertion failures, and we updated Medusa’s coverage report to increase its readability. The stark difference in Echidna’s trace printing after these updates can easily be seen in the figure below:

Echidna’s call sequences with rich traces at the end of the engagement

Finally, we created corresponding unit tests based on most assertion failures during our engagement. Initially, converting failures to unit tests was manual and time-consuming, but by the end, we streamlined the process to take just half an hour. We used the insights we gained from this experience to develop fuzz-utils, an automated tool for converting assertion failures into unit tests. Although it’s yet to be extensively tested, its potential for future engagements excites us!

One lock too many: The story behind TOB-CURV-4

After a significant change to the Curvance codebase, we encountered a puzzling assertion failure. Initially, we suspected it might be a false positive, a common occurrence with major code changes. However, after checking the changes in the Curvance source code, the root cause wasn’t immediately apparent, leading us into a complex and thorough debugging process.

We analyzed the full reproducer traces in Echidna (an Echidna feature that was added during this engagement, as mentioned in the previous section), and we tested assumptions on different senders. We crafted and executed a series of unit tests, each iteration shedding more light on the underlying mechanics. It was time to zoom out to identify the commonalities in the functions involved in the new assertion failures, leading us to focus on the processExpiredLock function. By closely scrutinizing this function, we discovered an important assertion was missing: ensuring the number of user locks stays the same after a call to the function with the “relock” option.

When we reran the fuzzer, it immediately revealed the error: such a call would process the expired lock but incorrectly grant the user a new lock without removing the old one, leading to an unexpected increase in the total number of locks. This caused all forms of issues in the combineAllLocks function: the contracts always thought the user had one more lock than they actually had. Eureka!

This trace shows the increase in the number of user locks after the expired lock is processed:

The trace for the increase in user locks, provided in the full invariant development report in finding TOB-CURV-4

What made this finding particularly striking was its ability to elude detection through the various security reviews and tests. The unit tests, as it turned out, were checking an incorrect postcondition, concealing the bug in its checks, masking its error within the testing suite. The stateless fuzzing tests on this codebase (built by Curvance before this engagement) actually started to fail after this bug was fixed. This highlighted the necessity of not only complex and meticulous testing that validates every aspect of the codebase, but also of continually questioning and validating every aspect of the target code—and its tests.

What’s next?

Reflecting on our journey with Curvance, we recognize the importance of a comprehensive security toolkit for smart contracts, including unit, integration, and fuzz tests, to uncover system complexities and potential issues. Our collaboration over the past nine weeks has allowed us to meticulously examine and understand the system, enhancing our findings’ accuracy and deepening our mutual knowledge. Working closely with Curvance has proven crucial in revealing the technology’s intricacies, leading to the development of a stateful fuzzing suite that will evolve and expand with the code, ensuring continued security and insights.

Take a look at our findings in the public Curvance report! Or dive into the Curvance fuzzing suite, now open through the Cantina Competition! Simply download and unzip corpus.zip into the curvance/ directory, then run make el for Echidna or make ml for Medusa. We’ve designed it for ease of use and expansion. Encounter any issues? Let us know! For detailed instructions and suite extension tips, check the Curvance-CantinaCompetition README and keep an eye out for the /// @custom:limitation tags in the suite.

And if you’re developing a project and want to explore stateful fuzzing, we’d love to chat with you!

Trail of Bits Blog
Announcing two new LMS librariesTrail of Bits 26 April 2024 at 13:00

Announcing two new LMS libraries

Trail of Bits Blog

By: Trail of Bits

26 April 2024 at 13:00

By Will Song

The Trail of Bits cryptography team is pleased to announce the open-sourcing of our pure Rust and Go implementations of Leighton-Micali Hash-Based Signatures (LMS), a well-studied NIST-standardized post-quantum digital signature algorithm. If you or your organization are looking to transition to post-quantum support for digital signatures, both of these implementations have been engineered and reviewed by several of our cryptographers, so please give them a try!

For the Rust codebase, we’ve worked with the RustCrypto team to integrate our implementation into the RustCrypto/signatures repository so that it can immediately be used with their ecosystem once the crate is published.

Our Go implementation was funded by Hewlett Packard Enterprise (HPE), as part of a larger post-quantum readiness effort within the Sigstore ecosystem. We’d like to thank HPE and Tim Pletcher in particular for supporting and collaborating on this high-impact work!

LMS: A stateful post-quantum signature scheme

LMS is a stateful hash-based signature scheme that was standardized in 2019 with RFC 8554 and subsequently adopted into the federal information processing standards in 2020. These algorithms are carefully designed to resist quantum computer attacks, which could threaten conventional algebraic signature schemes like RSA and ECDSA. Unlike other post-quantum signature designs, LMS was standardized before NIST’s large post-quantum cryptography standardization program was completed. LMS has been studied for years and its security bounds are well understood, so it was not surprising that these schemes were selected and standardized in a relatively short time frame (at least compared to the other standards).

Like other post-quantum signature schemes, LMS is a hash-based scheme, relying only on the security of a collision-resistant hash function such as SHA256. Hash-based signature schemes have much longer signatures than lattice-based signature schemes, which were recently standardized by NIST, but they are simpler to implement and require fewer novel cryptographic assumptions. This is the primary reason we chose to develop hash-based signatures first.

Unlike any signature algorithm in common usage today, LMS is a stateful scheme. The signer must track how many messages have been signed with a key, incrementing the counter with each new signature. If the private key is used more than once with the same counter value, an attacker can combine the two signatures to forge signatures on new messages. This is analogous to a nonce-reuse attack in encryption schemes like AES-GCM.

If it’s not immediately obvious, requiring this state also severely limits these schemes’ usability and security. For instance, this makes storing your private key (and its state) to some sort of persisted storage (which is usually typical for secret keys) incredibly risky, as this introduces the possibility of an old state being reused, especially for multi-threaded applications. This is why NIST makes the following warning in their standard:

Stateful hash-based signature schemes are secure against the development of quantum computers, but they are not suitable for general use because their security depends on careful state management. They are most appropriate for applications in which the use of the private key may be carefully controlled and where there is a need to transition to a post-quantum secure digital signature scheme before the post-quantum cryptography standardization process has completed.

The main benefit of a stateful algorithm like LMS over a stateless hash-based signature like SLH-DSA (SPHINCS+) is significantly shorter signature sizes: a signature with LMS is around 4KB, while a signature with SLH-DSA at a similar security level is closer to 40KB. The downside is that stateful schemes like LMS cannot easily be plugged into existing applications. Managing the private state in a signature scheme makes integration into higher-level applications complex and prone to subtle and dangerous security flaws. However, a carefully managed environment for code signing is an excellent place to test stateful post quantum signatures in the real world, and we feel that Sigstore effectively meets the NIST requirement.

RustCrypto implementation

Our Rust implementation is no-std capable and does not require heap allocations, in addition to being fully compatible with the currently available digest and signature crates. In particular, we implement the SignerMut and Verifier traits on private and public keys, respectively.

The goal of our work is to provide a more strongly typed alternative to the pre-existing implementation while also not over-allocating memory. While ArrayVec is a suitable alternative to the headaches caused by generics and GenericArray, at the cost of slightly higher memory requirements in certain cases of signatures, it does introduce an additional crate dependency that did not previously exist, which we wanted to avoid. Currently, in our implementation, both signatures and keys must know their LMS parameters before being able to deserialize and verify signatures. This should be sufficient for most use cases, but if unknown parameters must be used, it is not too difficult to hack together an enum that covers all potential algorithm types and uses the correct TryFrom implementation once the algorithm type is parsed.

Go implementation

Our Go implementation, on the other hand, is less picky. We were asked to build an LMS implementation for Sigstore, which is a more controlled environment and does not have the same restrictions that the general RustCrypto implementation assumes. Because of this, our implementation uses some small heap allocations to keep track of some variable length data, such as the number of hashes in a private key. Go is a less-clever language than Rust, which means we cannot really parameterize it over the various LMS modes, so some additional work needs to be done at a few call sites to re-specify the LMS parameters.

5 reasons to strive for better disclosure processes

Trail of Bits Blog

By: Trail of Bits

15 April 2024 at 13:00

By Max Ammann

This blog showcases five examples of real-world vulnerabilities that we’ve disclosed in the past year (but have not publicly disclosed before). We also share the frustrations we faced in disclosing them to illustrate the need for effective disclosure processes.

Here are the five bugs:

Discovering a vulnerability in an open-source project necessitates a careful approach, as publicly reporting it (also known as full disclosure) can alert attackers before a fix is ready. Coordinated vulnerability disclosure (CVD) uses a safer, structured reporting framework to minimize risks. Our five example cases demonstrate how the lack of a CVD process unnecessarily complicated reporting these bugs and ensuring their remediation in a timely manner.

In the Takeaways section, we show you how to set up your project for success by providing a basic security policy you can use and walking you through a streamlined disclosure process called GitHub private reporting. GitHub’s feature has several benefits:

Discreet and secure alerts to developers: no need for PGP-encrypted emails
Streamlined process: no playing hide-and-seek with company email addresses
Simple CVE issuance: no need to file a CVE form at MITRE

Time for action: If you own well-known projects on GitHub, use private reporting today! Read more on Configuring private vulnerability reporting for a repository, or skip to the Takeaways section of this post.

Case 1: Undefined behavior in borsh-rs Rust library

The first case, and reason for implementing a thorough security policy, concerned a bug in a cryptographic serialization library called borsh-rs that was not fixed for two years.

During an audit, I discovered unsafe Rust code that could cause undefined behavior if used with zero-sized types that don’t implement the Copy trait. Even though somebody else reported this bug previously, it was left unfixed because it was unclear to the developers how to avoid the undefined behavior in the code and keep the same properties (e.g., resistance against a DoS attack). During that time, the library’s users were not informed about the bug.

The whole process could have been streamlined using GitHub’s private reporting feature. If project developers cannot address a vulnerability when it is reported privately, they can still notify Dependabot users about it with a single click. Releasing an actual fix is optional when reporting vulnerabilities privately on GitHub.

I reached out to the borsh-rs developers about notifying users while there was no fix available. The developers decided that it was best to notify users because only certain uses of the library caused undefined behavior. We filed the notification RUSTSEC-2023-0033, which created a GitHub advisory. A few months later, the developers fixed the bug, and the major release 1.0.0 was published. I then updated the RustSec advisory to reflect that it was fixed.

The following code contained the bug that caused undefined behavior:

impl<T> BorshDeserialize for Vec<T>
where
    T: BorshDeserialize,
{
    #[inline]
    fn deserialize<R: Read>(reader: &mut R) -> Result<Self, Error> {
        let len = u32::deserialize(reader)?;
        if size_of::<T>() == 0 {
            let mut result = Vec::new();
            result.push(T::deserialize(reader)?);

            let p = result.as_mut_ptr();
            unsafe {
                forget(result);
                let len = len as usize;
                let result = Vec::from_raw_parts(p, len, len);
                Ok(result)
            }
        } else {
            // TODO(16): return capacity allocation when we can safely do that.
            let mut result = Vec::with_capacity(hint::cautious::<T>(len));
            for _ in 0..len {
                result.push(T::deserialize(reader)?);
            }
            Ok(result)
        }
    }
}

Figure 1: Use of unsafe Rust (borsh-rs/borsh-rs/borsh/src/de/mod.rs#123–150)

The code in figure 1 deserializes bytes to a vector of some generic data type T. If the type T is a zero-sized type, then unsafe Rust code is executed. The code first reads the requested length for the vector as u32. After that, the code allocates an empty Vec type. Then it pushes a single instance of T into it. Later, it temporarily leaks the memory of the just-allocated Vec by calling the forget function and reconstructs it by setting the length and capacity of Vec to the requested length. As a result, the unsafe Rust code assumes that T is copyable.

The unsafe Rust code protects against a DoS attack where the deserialized in-memory representation is significantly larger than the serialized on-disk representation. The attack works by setting the vector length to a large number and using zero-sized types. An instance of this bug is described in our blog post Billion times emptiness.

Case 2: DoS vector in Rust libraries for parsing the Ethereum ABI

In July, I disclosed multiple DoS vulnerabilities in four Ethereum API–parsing libraries, which were difficult to report because I had to reach out to multiple parties.

The bug affected four GitHub-hosted projects. Only the Python project eth_abi had GitHub private reporting enabled. For the other three projects (ethabi, alloy-rs, and ethereumjs-abi), I had to research who was maintaining them, which can be error-prone. For instance, I had to resort to the trick of getting email addresses from maintainers by appending the suffix .patch to GitHub commit URLs. The following link shows the non-work email address I used for committing:

https://github.com/trailofbits/publications/commit/a2ab5a1cab59b52c4fa
71b40dae1f597bc063bdf.patch

In summary, as the group of affected vendors grows, the burden on the reporter grows as well. Because you typically need to synchronize between vendors, the effort does not grow linearly but exponentially. Having more projects use the GitHub private reporting feature, a security policy with contact information, or simply an email in the README file would streamline communication and reduce effort.

Read more about the technical details of this bug in the blog post Billion times emptiness.

Case 3: Missing limit on authentication tag length in Expo

In late 2022, Joop van de Pol, a security engineer at Trail of Bits, discovered a cryptographic vulnerability in expo-secure-store. In this case, the vendor, Expo, failed to follow up with us about whether they acknowledged or had fixed the bug, which left us in the dark. Even worse, trying to follow up with the vendor consumed a lot of time that could have been spent finding more bugs in open-source software.

When we initially emailed Expo about the vulnerability through the email address listed on its GitHub, [email protected], an Expo employee responded within one day and confirmed that they would forward the report to their technical team. However, after that response, we never heard back from Expo despite two gentle reminders over the course of a year.

Unfortunately, Expo did not allow private reporting through GitHub, so the email was the only contact address we had.

Now to the specifics of the bug: on Android above API level 23, SecureStore uses AES-GCM keys from the KeyStore to encrypt stored values. During encryption, the tag length and initialization vector (IV) are generated by the underlying Java crypto library as part of the Cipher class and are stored with the ciphertext:

/* package */ JSONObject createEncryptedItem(Promise promise, String plaintextValue, Cipher cipher, GCMParameterSpec gcmSpec, PostEncryptionCallback postEncryptionCallback) throws GeneralSecurityException, JSONException {

  byte[] plaintextBytes = plaintextValue.getBytes(StandardCharsets.UTF_8);
  byte[] ciphertextBytes = cipher.doFinal(plaintextBytes);
  String ciphertext = Base64.encodeToString(ciphertextBytes, Base64.NO_WRAP);

  String ivString = Base64.encodeToString(gcmSpec.getIV(), Base64.NO_WRAP);
  int authenticationTagLength = gcmSpec.getTLen();

  JSONObject result = new JSONObject()
    .put(CIPHERTEXT_PROPERTY, ciphertext)
    .put(IV_PROPERTY, ivString)
    .put(GCM_AUTHENTICATION_TAG_LENGTH_PROPERTY, authenticationTagLength);

  postEncryptionCallback.run(promise, result);

  return result;
}

Figure 2: Code for encrypting an item in the store, where the tag length is stored next to the cipher text (SecureStoreModule.java)

For decryption, the ciphertext, tag length, and IV are read and then decrypted using the AES-GCM key from the KeyStore.

An attacker with access to the storage can change an existing AES-GCM ciphertext to have a shorter authentication tag. Depending on the underlying Java cryptographic service provider implementation, the minimum tag length is 32 bits in the best case (this is the minimum allowed by the NIST specification), but it could be even lower (e.g., 8 bits or even 1 bit) in the worst case. So in the best case, the attacker has a small but non-negligible probability that the same tag will be accepted for a modified ciphertext, but in the worst case, this probability can be substantial. In either case, the success probability grows depending on the number of ciphertext blocks. Also, both repeated decryption failures and successes will eventually disclose the authentication key. For details on how this attack may be performed, see Authentication weaknesses in GCM from NIST.

From a cryptographic point of view, this is an issue. However, due to the required storage access, it may be difficult to exploit this issue in practice. Based on our findings, we recommended fixing the tag length to 128 bits instead of writing it to storage and reading it from there.

The story would have ended here since we didn’t receive any responses from Expo after the initial exchange. But in our second email reminder, we mentioned that we were going to publicly disclose this issue. One week later, the bug was silently fixed by limiting the minimum tag length to 96 bits. Practically, 96 bits offers sufficient security. However, there is also no reason not to go with the higher 128 bits.

The fix was created exactly one week after our last reminder. We suspect that our previous email reminder led to the fix, but we don’t know for sure. Unfortunately, we were never credited appropriately.

Case 4: DoS vector in the num-bigint Rust library

In July 2023, Sam Moelius, a security engineer at Trail of Bits, encountered a DoS vector in the well-known num-bigint Rust library. Even though the disclosure through email worked very well, users were never informed about this bug through, for example, a GitHub advisory or CVE.

The num-bigint project is hosted on GitHub, but GitHub private reporting is not set up, so there was no quick way for the library author or us to create an advisory. Sam reported this bug to the developer of num-bigint by sending an email. But finding the developer’s email is error-prone and takes time. Instead of sending the bug report directly, you must first confirm that you’ve reached the correct person via email and only then send out the bug details. With GitHub private reporting or a security policy in the repository, the channel to send vulnerabilities through would be clear.

But now let’s discuss the vulnerability itself. The library implements very large integers that no longer fit into primitive data types like i128. On top of that, the library can also serialize and deserialize those data types. The vulnerability Sam discovered was hidden in that serialization feature. Specifically, the library can crash due to large memory consumption or if the requested memory allocation is too large and fails.

The num-bigint types implement traits from Serde. This means that any type in the crate can be serialized and deserialized using an arbitrary file format like JSON or the binary format used by the bincode crate. The following example program shows how to use this deserialization feature:

use num_bigint::BigUint;
use std::io::Read;

fn main() -> std::io::Result<()> {
    let mut buf = Vec::new();
    let _ = std::io::stdin().read_to_end(&mut buf)?;
    let _: BigUint = bincode::deserialize(&buf).unwrap_or_default();
    Ok(())
}

Figure 3: Example deserialization format

It turns out that certain inputs cause the above program to crash. This is because implementing the Visitor trait uses untrusted user input to allocate a specific vector capacity. The following figure shows the lines that can cause the program to crash with the message memory allocation of 2893606913523067072 bytes failed.

impl<'de> Visitor<'de> for U32Visitor {
    type Value = BigUint;

    {...omitted for brevity...}

    #[cfg(not(u64_digit))]
    fn visit_seq<S>(self, mut seq: S) -> Result<Self::Value, S::Error>
    where
        S: SeqAccess<'de>,
    {
        let len = seq.size_hint().unwrap_or(0);
        let mut data = Vec::with_capacity(len);

        {...omitted for brevity...}
    }

    #[cfg(u64_digit)]
    fn visit_seq<S>(self, mut seq: S) -> Result<Self::Value, S::Error>
    where
        S: SeqAccess<'de>,
    {
        use crate::big_digit::BigDigit;
        use num_integer::Integer;

        let u32_len = seq.size_hint().unwrap_or(0);
        let len = Integer::div_ceil(&u32_len, &2);
        let mut data = Vec::with_capacity(len);

        {...omitted for brevity...}
    }
}

Figure 4: Code that allocates memory based on user input (num-bigint/src/biguint/serde.rs#61–108)

We initially contacted the author on July 20, 2023, and the bug was fixed in commit 44c87c1 on August 22, 2023. The fixed version was released the next day as 0.4.4.

Case 5: Insertion of MMKV database encryption key into Android system log with react-native-mmkv

The last case concerns the disclosure of a plaintext encryption key in the react-native-mmkv library, which was fixed in September 2023. During a secure code review for a client, I discovered a commit that fixed an untracked vulnerability in a critical dependency. Because there was no security advisory or CVE ID, neither I nor the client were informed about the vulnerability. The lack of vulnerability management caused a situation where attackers knew about a vulnerability, but users were left in the dark.

During the client engagement, I wanted to validate how the encryption key was used and handled. The commit fix: Don’t leak encryption key in logs in the react-native-mmkv library caught my attention. The following code shows the problematic log statement:

MmkvHostObject::MmkvHostObject(const std::string& instanceId, std::string path,
                               std::string cryptKey) {
  __android_log_print(ANDROID_LOG_INFO, "RNMMKV",
                      "Creating MMKV instance \"%s\"... (Path: %s, Encryption-Key: %s)",
                      instanceId.c_str(), path.c_str(), cryptKey.c_str());
  std::string* pathPtr = path.size() > 0 ? &path : nullptr;
  {...omitted for brevity...}

Figure 5: Code that initializes MMKV and also logs the encryption key

Before that fix, the encryption key I was investigating was printed in plaintext to the Android system log. This breaks the threat model because this encryption key should not be extractable from the device, even with Android debugging features enabled.

With the client’s agreement, I notified the author of react-native-mmkv, and the author and I concluded that the library users should be informed about the vulnerability. So the author enabled private reporting and together we published a GitHub advisory. The ID CVE-2024-21668 was assigned to the bug. The advisory now alerts developers if they use a vulnerable version of react-native-mmkv when running npm audit or npm install.

This case highlights that there is basically no way around GitHub advisories when it comes to npm packages. The only way to feed the output of the npm audit command is to create a GitHub advisory. Using private reporting streamlines that process.

Takeaways

GitHub’s private reporting feature contributes to securing the software ecosystem. If used correctly, the feature saves time for vulnerability reporters and software maintainers. The biggest impact of private reporting is that it is linked to the GitHub advisory database—a link that is missing, for example, when using confidential issues in GitLab. With GitHub’s private reporting feature, there is now a process for security researchers to publish to that database (with the approval of the repository maintainers).

The disclosure process also becomes clearer with a private report on GitHub. When using email, it is unclear whether you should encrypt the email and who you should send it to. If you’ve ever encrypted an email, you know that there are endless pitfalls.

However, you may still want to send an email notification to developers or a security contact, as maintainers might miss GitHub notifications. A basic email with a link to the created advisory is usually enough to raise awareness.

Step 1: Add a security policy

Publishing a security policy is the first step towards owning a vulnerability reporting process. To avoid confusion, a good policy clearly defines what to do if you find a vulnerability.

GitHub has two ways to publish a security policy. Either you can create a SECURITY.md file in the repository root, or you can create a user- or organization-wide policy by creating a .github repository and putting a SECURITY.md file in its root.

We recommend starting with a policy generated using the Policymaker by disclose.io (see this example), but replace the Official Channels section with the following:

We have multiple channels for receiving reports:

* If you discover any security-related issues with a specific GitHub project, click the *Report a vulnerability* button on the *Security* tab in the relevant GitHub project: https://github.com/%5BYOUR_ORG%5D/%5BYOUR_PROJECT%5D.
* Send an email to [email protected]

Always make sure to include at least two points of contact. If one fails, the reporter still has another option before falling back to messaging developers directly.

Step 2: Enable private reporting

Now that the security policy is set up, check out the referenced GitHub private reporting feature, a tool that allows discreet communication of vulnerabilities to maintainers so they can fix the issue before it’s publicly disclosed. It also notifies the broader community, such as npm, Crates.io, or Go users, about potential security issues in their dependencies.

Enabling and using the feature is easy and requires almost no maintenance. The only key is to make sure that you set up GitHub notifications correctly. Reports get sent via email only if you configure email notifications. The reason it’s not enabled by default is that this feature requires active monitoring of your GitHub notifications, or else reports may not get the attention they require.

After configuring the notifications, go to the “Security” tab of your repository and click “Enable vulnerability reporting”:

Emails about reported vulnerabilities have the subject line “(org/repo) Summary (GHSA-0000-0000-0000).” If you use the website notifications, you will get one like this:

If you want to enable private reporting for your whole organization, then check out this documentation.

A benefit of using private reporting is that vulnerabilities are published in the GitHub advisory database (see the GitHub documentation for more information). If dependent repositories have Dependabot enabled, then dependencies to your project are updated automatically.

On top of that, GitHub can also automatically issue a CVE ID that can be used to reference the bug outside of GitHub.

This private reporting feature is still officially in beta on GitHub. We encountered minor issues like the lack of message templates and the inability of reporters to add collaborators. We reported the latter as a bug to GitHub, but they claimed that this was by design.

Step 3: Get notifications via webhooks

If you want notifications in a messaging platform of your choice, such as Slack, you can create a repository- or organization-wide webhook on GitHub. Just enable the following event type:

After creating the webhook, repository_advisory events will be sent to the set webhook URL. The event includes the summary and description of the reported vulnerability.

How to make security researchers happy

If you want to increase your chances of getting high-quality vulnerability reports from security researchers and are already using GitHub, then set up a security policy and enable private reporting. Simplifying the process of reporting security bugs is important for the security of your software. It also helps avoid researchers becoming annoyed and deciding not to report a bug or, even worse, deciding to turn the vulnerability into an exploit or release it as a 0-day.

If you use GitHub, this is your call to action to prioritize security, protect the public software ecosystem’s security, and foster a safer development environment for everyone by setting up a basic security policy and enabling private reporting.

If you’re not a GitHub user, similar features also exist on other issue-tracking systems, such as confidential issues in GitLab. However, not all systems have this option; for instance, Gitea is missing such a feature. The reason we focused on GitHub in this post is because the platform is in a unique position due to its advisory database, which feeds into, for example, the npm package repository. But regardless of which platform you use, make sure that you have a visible security policy and reliable channels set up.

Trail of Bits Blog
Introducing Ruzzy, a coverage-guided Ruby fuzzerTrail of Bits 29 March 2024 at 13:30

Introducing Ruzzy, a coverage-guided Ruby fuzzer

Trail of Bits Blog

By: Trail of Bits

29 March 2024 at 13:30

By Matt Schwager

Trail of Bits is excited to introduce Ruzzy, a coverage-guided fuzzer for pure Ruby code and Ruby C extensions. Fuzzing helps find bugs in software that processes untrusted input. In pure Ruby, these bugs may result in unexpected exceptions that could lead to denial of service, and in Ruby C extensions, they may result in memory corruption. Notably, the Ruby community has been missing a tool it can use to fuzz code for such bugs. We decided to fill that gap by building Ruzzy.

Ruzzy is heavily inspired by Google’s Atheris, a Python fuzzer. Like Atheris, Ruzzy uses libFuzzer for its coverage instrumentation and fuzzing engine. Ruzzy also supports AddressSanitizer and UndefinedBehaviorSanitizer when fuzzing C extensions.

This post will go over our motivation behind building Ruzzy, provide a brief overview of installing and running the tool, and discuss some of its interesting implementation details. Ruby revelers rejoice, Ruzzy* is here to reveal a new era of resilient Ruby repositories.

* If you’re curious, Ruzzy is simply a portmanteau of Ruby and fuzz, or fuzzer.

Bringing fuzz testing to Ruby

The Trail of Bits Testing Handbook provides the following definition of fuzzing:

Fuzzing represents a dynamic testing method that inputs malformed or unpredictable data to a system to detect security issues, bugs, or system failures. We consider it an essential tool to include in your testing suite.

Fuzzing is an important testing methodology when developing high-assurance software, even in Ruby. Consider AFL’s extensive trophy case, rust-fuzz’s trophy case, and OSS-Fuzz’s claim that it’s helped find and fix over 10,000 security vulnerabilities and 36,000 bugs with fuzzing. As mentioned previously, Python has Atheris. Java has Jazzer. The Ruby community deserves a high-quality, modern fuzzing tool too.

This isn’t to say that Ruby fuzzers haven’t been built before. They have: kisaten, afl-ruby, FuzzBert, and perhaps some we’ve missed. However, all these tools appear to be either unmaintained, difficult to use, lacking features, or all of the above. To address these challenges, Ruzzy is built on three principles:

Fuzz pure Ruby code and Ruby C extensions
Make fuzzing easy by providing a RubyGems installation process and simple interface
Integrate with the extensive libFuzzer ecosystem

With that, let’s give this thing a test drive.

Installing and running Ruzzy

The Ruzzy repository is well documented, so this post will provide an abridged version of installing and running the tool. The goal here is to provide a quick overview of what using Ruzzy looks like. For more information, check out the repository.

First things first, Ruzzy requires a Linux environment and a recent version of Clang (we’ve tested back to version 14.0.0). Releases of Clang can be found on its GitHub releases page. If you’re on a Mac or Windows computer, then you can use Docker Desktop on Mac or Windows as your Linux environment. You can then use Ruzzy’s Docker development environment to run the tool. With that out of the way, let’s get started.

Run the following command to install Ruzzy from RubyGems:

MAKE="make --environment-overrides V=1" \
CC="/path/to/clang" \
CXX="/path/to/clang++" \
LDSHARED="/path/to/clang -shared" \
LDSHAREDXX="/path/to/clang++ -shared" \
    gem install ruzzy

These environment variables ensure the tool is compiled and installed correctly. They will be explored in greater detail later in this post. Make sure to update the /path/to portions to point to your clang installation.

Fuzzing Ruby C extensions

To facilitate testing the tool, Ruzzy includes a “dummy” C extension with a heap-use-after-free bug. This section will demonstrate using Ruzzy to fuzz this vulnerable C extension.

First, we need to configure Ruzzy’s required sanitizer options:

export ASAN_OPTIONS="allocator_may_return_null=1:detect_leaks=0:use_sigaltstack=0"

(See the Ruzzy README for why these options are necessary in this context.)

Next, start fuzzing:

LD_PRELOAD=$(ruby -e 'require "ruzzy"; print Ruzzy::ASAN_PATH') \
    ruby -e 'require "ruzzy"; Ruzzy.dummy'

LD_PRELOAD is required for the same reason that Atheris requires it. That is, it uses a special shared object that provides access to libFuzzer’s sanitizers. Now that Ruzzy is fuzzing, it should quickly produce a crash like the following:

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2527961537
...
==45==ERROR: AddressSanitizer: heap-use-after-free on address 0x50c0009bab80 at pc 0xffff99ea1b44 bp 0xffffce8a67d0 sp 0xffffce8a67c8
...
SUMMARY: AddressSanitizer: heap-use-after-free /var/lib/gems/3.1.0/gems/ruzzy-0.7.0/ext/dummy/dummy.c:18:24 in _c_dummy_test_one_input
...
==45==ABORTING
MS: 4 EraseBytes-CopyPart-CopyPart-ChangeBit-; base unit: 410e5346bca8ee150ffd507311dd85789f2e171e
0x48,0x49,
HI
artifact_prefix='./'; Test unit written to ./crash-253420c1158bc6382093d409ce2e9cff5806e980
Base64: SEk=

Fuzzing pure Ruby code

Fuzzing pure Ruby code requires two Ruby scripts: a tracer script and a fuzzing harness. The tracer script is required due to an implementation detail of the Ruby interpreter. Every tracer script will look nearly identical. The only difference will be the name of the Ruby script you’re tracing.

First, the tracer script. Let’s call it test_tracer.rb:

require 'ruzzy'

Ruzzy.trace('test_harness.rb')

Next, the fuzzing harness. A fuzzing harness wraps a fuzzing target and passes it to the fuzzing engine. In this case, we have a simple fuzzing target that crashes when it receives the input “FUZZ.” It’s a contrived example, but it demonstrates Ruzzy’s ability to find inputs that maximize code coverage and produce crashes. Let’s call this harness test_harness.rb:

require 'ruzzy'

def fuzzing_target(input)
  if input.length == 4
    if input[0] == 'F'
      if input[1] == 'U'
        if input[2] == 'Z'
          if input[3] == 'Z'
            raise
          end
        end
      end
    end
  end
end

test_one_input = lambda do |data|
  fuzzing_target(data) # Your fuzzing target would go here
  return 0
end

Ruzzy.fuzz(test_one_input)

You can start the fuzzing process with the following command:

LD_PRELOAD=$(ruby -e 'require "ruzzy"; print Ruzzy::ASAN_PATH') \
    ruby test_tracer.rb

This should quickly produce a crash like the following:

INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2311041000
...
/app/ruzzy/bin/test_harness.rb:12:in `block in ': unhandled exception
    from /var/lib/gems/3.1.0/gems/ruzzy-0.7.0/lib/ruzzy.rb:15:in `c_fuzz'
    from /var/lib/gems/3.1.0/gems/ruzzy-0.7.0/lib/ruzzy.rb:15:in `fuzz'
    from /app/ruzzy/bin/test_harness.rb:35:in `'
    from bin/test_tracer.rb:7:in `require_relative'
    from bin/test_tracer.rb:7:in `'
...
SUMMARY: libFuzzer: fuzz target exited
MS: 1 CopyPart-; base unit: 24b4b428cf94c21616893d6f94b30398a49d27cc
0x46,0x55,0x5a,0x5a,
FUZZ
artifact_prefix='./'; Test unit written to ./crash-aea2e3923af219a8956f626558ef32f30a914ebc
Base64: RlVaWg==

Ruzzy used libFuzzer’s coverage-guided instrumentation to discover the input (“FUZZ”) that produces a crash. This is one of Ruzzy’s key contributions: coverage-guided support for pure Ruby code. We will discuss coverage support and more in the next section.

Interesting implementation details

You don’t need to understand this section to use Ruzzy, but fuzzing can often be more art than science, so we wanted to share some details to help demystify this dark art. We certainly learned a lot from the blog posts describing Atheris and Jazzer, so we figured we’d pay it forward. Of course, there are many interesting details that go into creating a tool like this but we’ll focus on three: creating a Ruby fuzzing harness, compiling Ruby C extensions with libFuzzer, and adding coverage support for pure Ruby code.

Creating a Ruby fuzzing harness

One of the first things you need when embarking on a fuzzing campaign is a fuzzing harness. The Trail of Bits Testing Handbook defines a fuzzing harness as follows:

A harness handles the test setup for a given target. The harness wraps the software and initializes it such that it is ready for executing test cases. A harness integrates a target into a testing environment.

When fuzzing Ruby code, naturally we want to write our fuzzing harness in Ruby, too. This speaks to goal number 2 from the beginning of this post: make fuzzing Ruby simple and easy. However, a problem arises when we consider that libFuzzer is written in C/C++. When using libFuzzer as a library, we need to pass a C function pointer to LLVMFuzzerRunDriver to initiate the fuzzing process. How can we pass arbitrary Ruby code to a C/C++ library?

Using a foreign function interface (FFI) like Ruby-FFI is one possibility. However, FFIs are generally used to go the other direction: calling C/C++ code from Ruby. Ruby C extensions seem like another possibility, but we still need to figure out a way to pass arbitrary Ruby code to a C extension. After much digging around in the Ruby C extension API, we discovered the rb_proc_call function. This function allowed us to use Ruby C extensions to bridge the gap between Ruby code and the libFuzzer C/C++ implementation.

In Ruby, a Proc is “an encapsulation of a block of code, which can be stored in a local variable, passed to a method or another Proc, and can be called. Proc is an essential concept in Ruby and a core of its functional programming features.” Perfect, this is exactly what we needed. In Ruby, all lambda functions are also Procs, so we can write fuzzing harnesses like the following:

require 'json'
require 'ruzzy'

json_target = lambda do |data|
  JSON.parse(data)
  return 0
end

Ruzzy.fuzz(json_target)

In this example, the json_target lambda function is passed to Ruzzy.fuzz. Behind the scenes Ruzzy uses two language features to bridge the gap between Ruby code and a C interface: Ruby Procs and C function pointers. First, Ruzzy calls LLVMFuzzerRunDriver with a function pointer. Then, every time that function pointer is invoked, it calls rb_proc_call to execute the Ruby target. This allows the C/C++ fuzzing engine to repeatedly call the Ruby target with fuzzed data. Considering the example above, since all lambda functions are Procs, this accomplishes the goal of calling arbitrary Ruby code from a C/C++ library.

As with all good, high-level overviews, this is an oversimplification of how Ruzzy works. You can see the exact implementation in cruzzy.c.

Compiling Ruby C extensions with libFuzzer

Before we proceed, it’s important to understand that there are two Ruby C extensions we are considering: the Ruzzy C extension that hooks into the libFuzzer fuzzing engine and the Ruby C extensions that become our fuzzing targets. The previous section discussed the Ruzzy C extension implementation. This section discusses Ruby C extension targets. These are third-party libraries that use Ruby C extensions that we’d like to fuzz.

To fuzz a Ruby C extension, we need a way to compile the extension with libFuzzer and its associated sanitizers. Compiling C/C++ code for fuzzing requires special compile-time flags, so we need a way to inject these flags into the C extension compilation process. Dynamically adding these flags is important because we’d like to install and fuzz Ruby gems without having to modify the underlying code.

The mkmf, or MakeMakefile, module is the primary interface for compiling Ruby C extensions. The gem install process calls a gem-specific Ruby script, typically named extconf.rb, which calls the mkmf module. The process looks roughly like this:

gem install -> extconf.rb -> mkmf -> Makefile -> gcc/clang/CC -> extension.so

Unfortunately, by default mkmf does not respect common C/C++ compilation environment variables like CC, CXX, and CFLAGS. However, we can force this behavior by setting the following environment variable: MAKE="make --environment-overrides". This tells make that environment variables override Makefile variables. With that, we can use the following command to install Ruby gems containing C extensions with the appropriate fuzzing flags:

MAKE="make --environment-overrides V=1" \
CC="/path/to/clang" \
CXX="/path/to/clang++" \
LDSHARED="/path/to/clang -shared" \
LDSHAREDXX="/path/to/clang++ -shared" \
CFLAGS="-fsanitize=address,fuzzer-no-link -fno-omit-frame-pointer -fno-common -fPIC -g" \
CXXFLAGS="-fsanitize=address,fuzzer-no-link -fno-omit-frame-pointer -fno-common -fPIC -g" \
    gem install msgpack

The gem we’re installing is msgpack, an example of a gem containing a C extension component. Since it deserializes binary data, it makes a great fuzzing target. From here, if we wanted to fuzz msgpack, we would create an msgpack fuzzing harness and initiate the fuzzing process.

If you’d like to find more fuzzing targets, searching GitHub for extconf.rb files is one of the best ways we’ve found to identify good C extension candidates.

Adding coverage support for pure Ruby code

Instead of Ruby C extensions, what if we want to fuzz pure Ruby code? That is, Ruby projects that do not contain a C extension component. If modifying install-time functionality via lengthy, not-officially-supported environment variables is a hacky solution, then what follows is not for the faint of heart. But, hey, a working solution with a little artistic freedom is better than no solution at all.

First, we need to cover the motivation for coverage support. Fuzzers derive some of their “smarts” from analyzing coverage information. This is a lot like code coverage information provided by unit and integration tests. While fuzzing, most fuzzers prioritize inputs that unlock new code branches. This increases the likelihood that they will find crashes and bugs. When fuzzing Ruby C extensions, Ruzzy can punt coverage instrumentation for C code to Clang. With pure Ruby code, we have no such luxury.

While implementing Ruzzy, we discovered one supremely useful piece of functionality: the Ruby Coverage module. The problem is that it cannot easily be called in real time by C extensions. If you recall, Ruzzy uses its own C extension to pass fuzz harness code to LLVMFuzzerRunDriver. To implement our pure Ruby coverage “smarts,” we need to pass in Ruby coverage information to libFuzzer in real time as the fuzzing engine executes. The Coverage module is great if you have a known start and stop point of execution, but not if you need to continuously gather coverage information and pass it to libFuzzer. However, we know the Coverage module must be implemented somehow, so we dug into the Ruby interpreter’s C implementation to learn more.

Enter Ruby event hooking. The TracePoint module is the official Ruby API for listening for certain types of events like calling a function, returning from a routine, executing a line of code, and many more. When these events fire, you can execute a callback function to handle the event however you’d like. So, this sounds great, and exactly like what we need. When we’re trying to track coverage information, what we’d really like to do is listen for branching events. This is what the Coverage module is doing, so we know it must exist under the hood somewhere.

Fortunately, the public Ruby C API provides access to this event hooking functionality via the rb_add_event_hook2 function. This function takes a list of events to hook and a callback function to execute whenever one of those events fires. By digging around in the source code a bit, we find that the list of possible events looks very similar to the list in the TracePoint module:

 37    #define RUBY_EVENT_NONE      0x0000 /**
 38    #define RUBY_EVENT_LINE      0x0001 /**
 39    #define RUBY_EVENT_CLASS     0x0002 /**
 40    #define RUBY_EVENT_END       0x0004 /**
       ...

Ruby event hook types

If you keep digging, you’ll notice a distinct lack of one type of event: coverage events. But why? The Coverage module appears to be handling these events. If you continue digging, you’ll find that there are in fact coverage events, and that is how the Coverage module works, but you don’t have access to them. They’re defined as part of a private, internal-only portion of the Ruby C API:

 2182    /* #define RUBY_EVENT_RESERVED_FOR_INTERNAL_USE 0x030000 */ /* from vm_core.h */
 2183    #define RUBY_EVENT_COVERAGE_LINE                0x010000
 2184    #define RUBY_EVENT_COVERAGE_BRANCH              0x020000

Private coverage event hook types

That’s the bad news. The good news is that we can define the RUBY_EVENT_COVERAGE_BRANCH event hook ourselves and set it to the correct, constant value in our code, and rb_add_event_hook2 will still respect it. So we can use Ruby’s built-in coverage tracking after all! We can feed this data into libFuzzer in real time and it will fuzz accordingly. Discussing how to feed this data into libFuzzer is beyond the scope of this post, but if you’d like to learn more, we use SanitizerCoverage’s inline 8-bit counters, PC-Table, and data flow tracing.

There’s just one more thing.

During our testing, even though we added the correct event hook, we still weren’t successfully hooking coverage events. The Coverage module must be doing something we’re not seeing. If we call Coverage.start(branches: true), per the Coverage documentation, then things work as expected. The details here involve a lot of sleuthing in the Ruby interpreter source code, so we’ll cut to the chase. As best we can tell, it appears that calling Coverage.start, which effectively calls Coverage.setup, initializes some global state in the Ruby interpreter that allows for hooking coverage events. This initialization functionality is also part of a private, internal-only API. The easiest solution we could come up with was calling Coverage.setup(branches: true) before we start fuzzing. With that, we began successfully hooking coverage events as expected.

Having coverage events included in the standard library made our lives a lot easier. Without it, we may have had to resort to much more invasive and cumbersome solutions like modifying the Ruby code the interpreter sees in real time. However, it would have made our lives even easier if hooking coverage events were part of the official, public Ruby C API. We’re currently tracking this request at trailofbits/ruzzy#9.

Again, the information presented here is a slight oversimplification of the implementation details; if you’d like to learn more, then cruzzy.c and ruzzy.rb are great places to start.

Find more Ruby bugs with Ruzzy

We faced some interesting challenges while building this tool and attempted to hide much of the complexity behind a simple, easy to use interface. When using the tool, the implementation details should not become a hindrance or an annoyance. However, discussing them here in detail may spur the next fuzzer implementation or step forward in the fuzzing community. As mentioned previously, the Atheris and Jazzer posts were a great inspiration to us, so we figured we’d pay it forward.

Building the tool is just the beginning. The real value comes when we start using the tool to find bugs. Like Atheris for Python, and Jazzer for Java before it, Ruzzy is an attempt to bring a higher level of software assurance to the Ruby community. If you find a bug using Ruzzy, feel free to open a PR against our trophy case with a link to the issue.

If you’d like to read more about our work on fuzzing, check out the following posts:

Trail of Bits Blog
Why fuzzing over formal verification?Trail of Bits 22 March 2024 at 13:00

Why fuzzing over formal verification?

Trail of Bits Blog

By: Trail of Bits

22 March 2024 at 13:00

By Tarun Bansal, Gustavo Grieco, and Josselin Feist

We recently introduced our new offering, invariant development as a service. A recurring question that we are asked is, “Why fuzzing instead of formal verification?” And the answer is, “It’s complicated.”

We use fuzzing for most of our audits but have used formal verification methods in the past. In particular, we found symbolic execution useful in audits such as Sai, Computable, and Balancer. However, we realized through experience that fuzzing tools produce similar results but require significantly less skill and time.

In this blog post, we will examine why the two principal assertions in favor of formal verification often fall short: proving the absence of bugs is typically unattainable, and fuzzing can identify the same bugs that formal verification uncovers.

Proving the absence of bugs

One of the key selling points of formal verification over fuzzing is its ability to prove the absence of bugs. To do that, formal verification tools use mathematical representations to check whether a given invariant holds for all input values and states of the system.

While such a claim can be attainable on a simple codebase, it’s not always achievable in practice, especially with complex codebases, for the following reasons:

The code may need to be rewritten to be amenable to formal verification. This leads to the verification of a pseudo-copy of the target instead of the target itself. For example, the Runtime Verification team verified the pseudocode of the deposit contract for the ETH2.0 upgrade, as mentioned in this excerpt from their blog post:

Specifically, we first rigorously formalized the incremental Merkle tree algorithm. Then, we extracted a pseudocode implementation of the algorithm employed in the deposit contract, and formally proved the correctness of the pseudocode implementation.
Complex code may require a custom summary of some functionality to be analyzed. In these situations, the verification relies on the custom summary to be correct, which shifts the responsibility of correctness to that summary. To build such a summary, users might need to use an additional custom language, such as CVL, which increases the complexity.
Loops and recursion may require adding manual constraints (e.g., unrolling the loop for only a given amount of time) to help the prover. For example, the Certora prover might unroll some loops for a fixed number of iterations and report any additional iteration as a violation, forcing further involvement from the user.
The solver can time out. If the tool relies on a solver for equations, finding a solution in a reasonable time may not be possible. In particular, proving code with a high number of nonlinear arithmetic operations or updates to storage or memory is challenging. If the solver times out, no guarantee can be provided.

So while proving the absence of bugs is a benefit of formal verification methods in theory, it may not be the case in practice.

Finding bugs

When formally verifying the code is not possible, formal verification tools can still be used as bug finding tools. However, the question remains, “Can formal verification find real bugs that cannot be found by a fuzzer?” At this point, wouldn’t it just be easier to use a fuzzer?

To answer this question, we looked at two bugs found using formal verification in MakerDAO and Compound and then attempted to find these same bugs with only a fuzzer. Spoiler alert: we succeeded.

We selected these two bugs because they were widely advertised as having been discovered through formal verification, and they affected two popular protocols. To our surprise, it was difficult to find public issues discovered solely through formal verification, in contrast with the many bugs found by fuzzing (see our security reviews).

Our fuzzer found both bugs in a matter of minutes, running on a typical development laptop. The bugs we evaluated, as well as the formal verification and fuzz testing harnesses we used to discover them, are available on our GitHub page about fuzzing formally verified contracts to reproduce popular security issues.

Fundamental invariant of DAI

MakerDAO found a bug in its live code after four years. You can read more about the bug in When Invariants Aren’t: DAI’s Certora Surprise. Using the Certora prover, MakerDAO found that the fundamental invariant of DAI, which is that the sum of all collateral-backed debt and unbacked debt should equal the sum of all DAI balances, could be violated in a specific case. The core issue is that calling the init function when a vault’s Rate state variable is zero and its Art state variable is nonzero changes the vault’s total debt, which violates the invariant checking sum of total debt and total DAI supply. The MakerDAO team concluded that calling the init function after calling the fold function is a path to break the invariant.

function sumOfDebt() public view returns (uint256) {
    uint256 length = ilkIds.length;
    uint256 sum = 0;
    for (uint256 i=0; i < length; ++i){
        sum = sum + ilks[ilkIds[i]].Art * ilks[ilkIds[i]].rate;
    }
    return sum;
}

function echidna_fund_eq() public view returns (bool) {
    return debt == vice + sumOfDebt();
}

Figure 1: Fundamental equation of DAI invariant in Solidity

We implemented the same invariant in Solidity, as shown in figure 1, and checked it with Echidna. To our surprise, Echidna violated the invariant and found a unique path to trigger the violation. Our implementation is available in the Testvat.sol file of the repository. Implementing the invariant was easy because the source code under test was small and required only logic to compute the sum of all debts. Echidna took less than a minute on an i5 12-GB RAM Linux machine to violate the invariant.

Liquidation of collateralized account in Compound V3 Comet

The Certora team used their Certora Prover to identify an interesting issue in the Compound V3 Comet smart contracts that allowed a fully collateralized account to be liquidated. The root cause of this issue was using an 8-bit mask for a 16-bit vector. The mask remains zero for the higher bits in the vector, which skips assets while calculating total collateral and results in the liquidation of the collateralized account. More on this issue can be found in the Formal Verification Report of Compound V3 (Comet).

function echidna_used_collateral() public view returns (bool) {
    for (uint8 i = 0; i < assets.length; ++i) {
        address asset = assets[i].asset;
        uint256 userColl = sumUserCollateral(asset, true);
        uint256 totalColl = comet.getTotalCollateral(asset);
        if (userColl != totalColl) {
            return false;
        }
    }
    return true;
}

function echidna_total_collateral_per_asset() public view returns (bool) {
    for (uint8 i = 0; i < assets.length; ++i) {
        address asset = assets[i].asset;
        uint256 userColl = sumUserCollateral(asset, false);
        uint256 totalColl = comet.getTotalCollateral(asset);
        if (userColl != totalColl) {
            return false;
        }
    }
    return true;
}

Figure 2: Compound V3 Comet invariant in Solidity

Echidna discovered the issue with the implementation of the invariant in Solidity, as shown in figure 2. This implementation is available in the TestComet.sol file in the repository. Implementing the invariant was easy; it required limiting the number of users interacting with the test contract and adding a method to calculate the sum of all user collateral. Echidna broke the invariant within minutes by generating random transaction sequences to deposit collateral and checking invariants.

Is formal verification doomed?

Formal verification tools require a lot of domain-specific knowledge to be used effectively and require significant engineering efforts to apply. Grigore Rosu, Runtime Verification’s CEO, summarized it as follows:

Figure 3: A tweet from the founder of Runtime Verification Inc.

While formal verification tools are constantly improving, which reduces the engineering effort, none of the existing tools reach the ease of use of existing fuzzers. For example, the Certora Prover makes formal verification more accessible than ever, but it is still far less user-friendly than a fuzzer for complex codebases. With the rapid development of these tools, we hope for a future where formal verification tools become as accessible as other dynamic analysis tools.

So does that mean we should never use formal verification? Absolutely not. In some cases, formally verifying a contract can provide additional confidence, but these situations are rare and context-specific.

Consider formal verification for your code only if the following are true:

You are following an invariant-driven development approach.
You have already tested many invariants with fuzzing.
You have a good understanding of which remaining invariants and components would benefit from formal methods.
You have solved all the other issues that would decrease your code maturity.

Writing good invariants is the key

Over the years, we have observed that the quality of invariants is paramount. Writing good invariants is 80% of the work; the tool used to check/verify them is important but secondary. Therefore, we recommend starting with the easiest and most effective technique—fuzzing—and relying on formal verification methods only when appropriate.

If you’re eager to refine your approach to invariants and integrate them into your development process, contact us to leverage our expertise.

Trail of Bits Blog
Streamline your static analysis triage with SARIF ExplorerTrail of Bits 20 March 2024 at 13:30

Streamline your static analysis triage with SARIF Explorer

Trail of Bits Blog

By: Trail of Bits

20 March 2024 at 13:30

By Vasco Franco

Today, we’re releasing SARIF Explorer, the VSCode extension that we developed to streamline how we triage static analysis results. We make heavy use of static analysis tools during our audits, but the process of triaging them was always a pain. We designed SARIF Explorer to provide an intuitive UI inside VSCode, with features that make this process less painful:

Open multiple SARIF files: Triage all your results at once.
Browse results: Browse results by clicking on them to open their associated location in VSCode. You can also browse a result’s dataflow steps, if present.
Classify results: Add metadata to each result by classifying it as a “bug,” “false positive,” or “TODO” and adding a custom text comment. Keyboard shortcuts are supported.
Filter results: Filter results by keyword, path (to include or exclude), level (“error,” “warning,” “note,” or “none”), and status (“bug,” “false positive,” or “TODO”).
Open GitHub issues: Copy GitHub permalinks to locations associated with results and create GitHub issues directly from SARIF Explorer.
Send bugs to weAudit: Send all bugs to weAudit once you’ve finished triaging them and continue with the weAudit workflow.
Collaborate: Share the .sarifexplorer file with your colleagues (e.g., on GitHub) to share your comments and classified results.

You can install it through the VSCode marketplace and find its code in our vscode-sarif-explorer repo.

Why we built SARIF Explorer

Have you ever had to triage hundreds of static analysis results, many of which were likely to be false positives? At Trail of Bits, we extensively use static analysis tools such as Semgrep and CodeQL, sometimes with rules that produce many false positives, so this is an experience we’re all too familiar with. As security engineers, we use these low-precision rules because if there’s a bug we can detect automatically, we want to know about it, even if it means sieving through loads of false positive results.

Long ago, you would have found me triaging these results by painstakingly going over a text file or looking into a tiny terminal window. This was grueling work that I did not enjoy at all. You read the result’s description, you copy the path to the code, you go to that file, and you analyze the code. Then, you annotate your conclusions in some other text file, and you repeat.

A few years ago, we started using SARIF Viewer at Trail of Bits. This was a tremendous improvement, as it allowed us to browse a neat list of results organized by rule and click on each one to jump to the corresponding code. Still, it lacked several features that we wanted:

The ability to classify results as bugs or false positives directly in the UI
Better result filtering
The ability to export results as GitHub issues
Better integration with weAudit—our tool for bookmarking code regions, marking files as reviewed, and more (check out our recent blog post announcing the release of this tool!)

This is why we built SARIF Explorer!

SARIF Explorer was designed with user efficiency in mind, providing an intuitive interface so that users can easily access all of the features we built into it, as well as support for keyboard shortcuts to move through and classify results.

The SARIF Explorer static analysis workflow

But why did we want all these new features, and how do we use them? At Trail of Bits, we follow this workflow when using static analysis tools:

Run all static analysis tools (configured to output SARIF files).
Open SARIF Explorer and open all of the SARIF files generated in step 1.
Filter out the noisy results.
- Are there rules that you are not interested in seeing? Hide them!
- Are there folders for which you don’t care about the results (e.g., the ./third_party folder)? Filter them out!
Classify the results.
- Determine if each result is a false positive or a bug.
- Swipe left or right accordingly (i.e., click the left or right arrow).
- Add additional context with a comment if necessary.
Working with other team members? Share your progress by committing the .sarifexplorer file to GitHub.
Send all results marked as bugs to weAudit and proceed with the weAudit workflow.

SARIF Explorer features

Now, let’s take a closer look at the SARIF Explorer features that enable this workflow:

Open multiple SARIF files: You can open and browse the results of multiple SARIF files simultaneously. Use the “Sarif files” tab to browse the list of opened SARIF files and to close or reload any of them. If you open a SARIF file in your workspace, SARIF Explorer will also automatically open it.

Browse results: You can navigate to the locations of the results by clicking on them in the “Results” tab. The detailed view of the result, among other data, includes dataflow information, which you can navigate from source to sink (if available). In the GIF below, the user follows the XSS vulnerability from the source (an event message) to the sink (a DOM parser).

GIF showing how to browse results

Classify results: You can add metadata to each result by classifying it as a “bug,” “false positive,” or “TODO” and adding a custom text comment. You can use either the mouse or keyboard to do this:
- Using the mouse: With a result selected, click one of the “bug,” “false positive,” or “TODO” buttons to classify it as such. These buttons appear next to the result and in the result’s detailed view.
- Using the keyboard: With a result selected, press the right arrow key to classify it as a bug, the left arrow key to classify it as a false positive, and the backspace key to reset the classification to a TODO. This method is more efficient.

Filter results: You can filter results by keyword, path (to include or exclude), level (“error,” “warning,” “note,” or “none”), and status (“bug,” “false positive,” or “TODO”). You can also hide all results from a specific SARIF file or from a specific rule. For example, if you want to remove all results from the test and extensions folders and to see only results classified as TODOs, you should:
- Set “Exclude Paths Containing” to “/test/, /extensions/”
- Check the “Todo” box and uncheck the “Bug” and “False Positive” boxes in the “Status” section

Copy GitHub permalinks: You can copy a GitHub permalink to the location associated with a result. This requires having weAudit installed.

Create GitHub issues: You can create formatted GitHub issues for a specific result or for all unfiltered results under a given rule. This requires having weAudit installed.

Send bugs to weAudit: You can send all results classified as bugs to weAudit (results are automatically de-duplicated if you send them twice). This requires having weAudit installed.

Collaborate: You can share the .sarifexplorer file with your colleagues (e.g., on GitHub) to share your comments and classified results. The file is a prettified JSON file, which helps resolve conflicts if more than one person writes to the file in parallel.

You can find even more details about these features in our README.

Try it!

SARIF Explorer and weAudit greatly improved our efficiency when auditing code, and we hope it improves yours too.

Go try both of these tools out and let us know what you think! We welcome any bug reports, feature requests, and contributions in our vscode-sarif-explorer and vscode-weaudit repos.

If you’re interested in VSCode extension security, check out our “Escaping misconfigured VSCode extensions” and “Escaping well-configured VSCode extensions (for profit)” blog posts.

Trail of Bits Blog
Read code like a pro with our weAudit VSCode extensionTrail of Bits 19 March 2024 at 13:30

Read code like a pro with our weAudit VSCode extension

Trail of Bits Blog

By: Trail of Bits

19 March 2024 at 13:30

By Filipe Casal

Today, we’re releasing weAudit, the collaborative code-reviewing tool that we use during our security audits. With weAudit, we review code more efficiently by taking notes and tracking bugs in a codebase directly inside VSCode, reducing our reliance on external tools, ensuring we never lose track of bugs we find, and enabling us to share that information with teammates.

We designed weAudit with features that are crucial to our auditing process:

Bookmarks for findings and notes: Bookmark code regions to identify findings or add audit notes.
Tracking of audited files: Mark entire files as reviewed.
Collaboration: View and share findings with multiple users.
Creation of GitHub issues: Fill in detailed information about a finding and create a preformatted GitHub issue right from weAudit.

You can install it through the VSCode marketplace and find its code in our vscode-weaudit repo.

Why we built weAudit

When we review complex codebases, we often compile detailed notes about both the high-level structure and specific low-level implementation details to share with our project team. For high-level notes, standard document sharing tools more than suffice. But those tools are not ideal for sharing low-level, code-specific notes. For those, we need a tool that allows us to share notes that are more tightly coupled with the codebase itself, almost like using post-it notes to navigate through a complex book. Specifically, we need a tool that allows us to do the following:

Quickly navigate through areas of interest in the codebase
Visually highlight significant areas of the code
Add audit notes to certain parts of the codebase

For some time, I used a very simple extension for VSCode called “Bookmarks”, which allowed me to add basic notes to lines of code. However, I was never satisfied with this extension, as it was missing crucial features:

The highlighted code did not display the notes I had written next to the code.
I had no way of sharing code coverage information with my client or fellow engineers auditing the codebase.
I had no way of sharing my notes and bookmarks. During an audit with a team of engineers, I need to be able to share these things with my team so that my knowledge is their knowledge, and vice versa.

All of us engineers at Trail of Bits agreed that we needed a better tool for this purpose. We realized that if we wanted an extension tailored to our needs, we would need to create it. That is why we built weAudit.

weAudit’s main features

The features we built into weAudit streamline our process of bookmarking, annotating, and tracking code files under audit, sharing our notes, and creating GitHub issues for findings we discover.

Bookmarks

The extension supports two types of bookmarks: findings, which represent buggy or suspicious regions of code, and notes, which represent personal annotations about the code.

You can add findings and notes to the current code snippet selection by running the corresponding VSCode commands or using the keyboard shortcuts:

“weAudit: New Finding from Selection” (shortcut: Cmd + J)
“weAudit: New Note from Selection” (shortcut: Cmd + K)

These commands will highlight the code in the editor and create a new bookmark in the “List of Findings” view in the sidebar.

By clicking on an item in the “List of Findings” view, you can navigate to the corresponding region of code.

Files with a finding will have a “!” annotation next to the file name in both the file tree of VSCode’s default “Explorer” view and in the tab above the editor, making it immediately clear which files have findings.

The highlight colors can be customized in the extension settings.

Tracking audited files

After reviewing a file, you can mark it as audited by running the “weAudit: Mark File as Reviewed” command or its keyboard shortcut, Cmd + 7. The whole file will be highlighted, and the file name in both the file tree and the tab above the editor will be annotated with a ✓.

The highlight color can be customized in the extension settings.

Daily log

Have you ever had trouble remembering which files you reviewed the previous week? Or do you just really like meaningless statistics such as the number of lines of code you read in a single day? You can see these stats by showing the daily log, accessible from the “List of Findings” panel.

You can also view the daily log by running the “weAudit: Show Daily Log” command in the command palette.

Collaboration with multiple users

You can share weAudit files (located in the .vscode folder) with your co-auditors to share findings and notes about the code. In the “weAudit Files” panel, you can toggle to show or hide the findings from each user by clicking on each entry. The colors for other users’ findings and notes and for your own findings and notes are customizable in the extension settings.

Detailed findings

You can fill in detailed information about a finding by clicking on it in the “List of Findings” view in the sidebar, where you can add all the information we include in our audit reports: title, severity, difficulty, description, exploit scenario, and recommendations for resolving the issue.

This information is then used to prefill a template, allowing you to quickly open a GitHub issue with all of the relevant details for the finding.

You can find more details and information about other features in our README.

Try it out for yourself!

If you use VSCode to navigate through large codebases, we invite you to try weAudit—even if you are not looking for bugs—and let us know what you think!

We welcome any bug reports, feature requests, and contributions in our vscode-weaudit repo.

If you’re interested in VSCode extension security, check out our “Escaping misconfigured VSCode extensions” and “Escaping well-configured VSCode extensions (for profit)” blog posts.

Trail of Bits Blog
Releasing the Attacknet: A new tool for finding bugs in blockchain nodes using chaos testingTrail of Bits 18 March 2024 at 13:00

Releasing the Attacknet: A new tool for finding bugs in blockchain nodes using chaos testing

Trail of Bits Blog

By: Trail of Bits

18 March 2024 at 13:00

By Benjamin Samuels (@thebensams)

Today, Trail of Bits is publishing Attacknet, a new tool that addresses the limitations of traditional runtime verification tools, built in collaboration with the Ethereum Foundation. Attacknet is intended to augment the EF’s current test methods by subjecting their execution and consensus clients to some of the most challenging network conditions imaginable.

Blockchain nodes must be held to the highest level of security assurance possible. Historically, the primary tools used to achieve this goal have been exhaustive specification, tests, client diversity, manual audits, and testnets. While these tools have traditionally done their job well, they collectively have serious limitations that can lead to critical bugs manifesting in a production environment, such as the May 2023 finality incident that occurred on Ethereum mainnet. Attacknet addresses these limitations by subjecting devnets to a much wider range of network conditions and misconfigurations than is possible on a conventional testnet.

How Attacknet works

Attacknet uses chaos engineering, a testing methodology that proactively injects faults into a production environment to verify that the system is tolerant to certain failures. These faults reproduce real-world problem scenarios and misconfigurations, and can be used to create exaggerated scenarios to test the boundary conditions of the blockchain.

Attacknet uses Chaos Mesh to inject faults into a devnet environment generated by Kurtosis. By building on top of Kurtosis and Chaos Mesh, Attacknet can create various network topologies with ensembles of different kinds of faults to push a blockchain network to its most extreme edge cases.

Some of the faults include:

Clock skew, where a node’s clock is skewed forwards or backwards for a specific duration. Trail of Bits was able to reproduce the Ethereum finality incident using a clock skew fault, as detailed in our TrustX talk last year.
Network latency, where a node’s connection to the network (or its corresponding EL/CL client) is delayed by a certain amount of time. This fault can help reproduce global latency conditions or help detect unintentional synchronicity assumptions in the blockchain’s consensus.
Network partition, where the network is split into two or more segments that cannot communicate with each other. This fault can test the network’s fork choice rule, ability to re-org, and other edge cases.
Network packet drop/corruption, where gossip packets are dropped or have their contents corrupted by a certain amount. This fault can test a node’s gossip validation and test the robustness of the network under hostile network conditions.
Forced node crashes/offlining, where a certain client or type of client is ungracefully shut down. This fault can test the network’s resilience to validator inactivity, and test the ability of clients to re-sync to the network.
I/O disk faults/latency, where a certain amount of latency or error rate is applied to all I/O operations a node makes. This fault can help profile nodes to understand their resource requirements, as I/O is often the largest limiting factor of node performance.

Once the fault concludes, Attacknet performs a battery of health checks against each node in the network to verify that they were able to recover from the fault. If all nodes recover from the fault, Attacknet moves on to the next configured fault. If one or more nodes fail health checks, Attacknet will generate an artifact of logs and test information to allow debugging.

Future work

In this first release, Attacknet supports two run modes: one with a manually configured network topology and fault parameters, and a “planner mode” where a range of faults are run against a specific client with loosely defined topology parameters. In the future, we plan on adding an “Exploration mode” that will dynamically define fault parameters, inject them, and monitor network health repeatedly, similar to a fuzzer.

Attacknet is currently being used to test the Dencun hard fork, and is being regularly updated to improve coverage, performance, and debugging UX. However, Attacknet is not an Ethereum-specific tool, and was designed to be modular and easily extended to support other types of chains with drastically different designs and topologies. In the future, we plan on extending Attacknet to target other chains, including other types of blockchain systems such as L2s.

If you’re interested in integrating Attacknet with your chain/L2’s testing process, please contact us.

Trail of Bits Blog
Secure your blockchain project from the startTrail of Bits 13 March 2024 at 13:00

Secure your blockchain project from the start

Trail of Bits Blog

By: Trail of Bits

13 March 2024 at 13:00

Systemic security issues in blockchain projects often appear early in development. Without an initial focus on security, projects may choose flawed architectures or make insecure design or development choices that result in hard-to-maintain or vulnerable solutions. Traditional security reviews can be used to identify some security issues, but by the time they are complete, it may be too late to fix some of the issues that could have been addressed at the design and development stages.

To help clients identify and address potential security issues earlier in the project, Trail of Bits is rolling out a new service: Early Stage Security Review. The service, already requested by many of our clients, is ideal for early-stage projects seeking feedback, where code, documentation, testing, and technical solutions are still evolving. As part of the service, Trail of Bits engineers will perform a thorough review of a project, including:

Architectural components review
Risk mitigation analysis
Identification of gaps in security practices
Code maturity evaluation
Tailored design recommendations
Lightweight code review of critical project areas
Actionable advice, recommendations, and next steps to improve the project’s security

Fix potential issues before they become real problems

Early stage security review provides an all-encompassing security assessment of your project’s design and structure, designed to guide developers and security decisions throughout the project’s lifecycle. We leverage years of code review experience accumulated across various domains—including smart contracts, bridges, decentralized finance, and gaming applications—to guide your project’s development with security as a primary focus. We’ll also apply our deep expertise in blockchain nodes (L1 and L2), especially those based on geth.

Our early-stage review of your project will focus on identifying areas of improvement that will include:

Architectural components review. We will assess architectural choices for risks, review access controls for proper privilege separation, propose changes to simplify code complexity, ensure the advertised degree of decentralization is accurate, recommend on-chain/off-chain logic separation, and evaluate the upgradeability process, including migration and pausable mechanisms.
Risk mitigation analysis. We will identify existing risks and suggest mitigations, ensuring that MEV and Oracle risks are considered. We will assess the protocol’s reliance on blockchain risks (e.g., reorgs). We will examine the handling of common ERCs, and evaluate third-party component integration risks.
Identification of gaps in security practices. We will pinpoint security practice gaps, including issues identified in documentation, and assess whether the project’s testing is sufficient for the long-term health of the project. We will evaluate the monitoring plan, and recommend improvements in automated security tool usage.
Code maturity evaluation. Through our reviews, we will evaluate the maturity of the protocol and offer actionable security improvement recommendations.
Tailored design recommendations. We will adapt our review based on the project’s unique needs and requirements and provide recommendations tailored toward the protocol business logic.
Lightweight code review of critical project areas. We will review the code to understand and assess the technical solution for potential security issues or concerns. However, we won’t look for in-depth vulnerabilities during an early-stage review, as the code review is intended to identify surface-level bugs.

Clients using our Early Stage Security Review will get preferential scheduling and pricing for blockchain and other Trail of Bits services. Insights from the initial review will help reduce the effort required for a comprehensive review after substantial development completes.

Get ahead of security issues

The early-stage security review service will enable you to:

Set a strong security foundation. Early feedback sets your solutions on a path to success, minimizing potential security oversights.
Receive expert recommendations earlier. Tailored guidance for your unique codebase empowers you to make informed decisions and enhance your protocol’s security.
Reduce cost by preventing late refactoring. A proactive security approach from inception avoids costly late-stage refactoring and streamlines the development cycle.

Don’t wait until your project is code complete to prioritize security. Contact us to take advantage of our experience to help you secure your project from the start.

Trail of Bits Blog
DARPA awards $1 million to Trail of Bits for AI Cyber ChallengeTrail of Bits 11 March 2024 at 17:46

DARPA awards $1 million to Trail of Bits for AI Cyber Challenge

Trail of Bits Blog

By: Trail of Bits

11 March 2024 at 17:46

By Michael D. Brown

We’re excited to share that Trail of Bits has been selected as one of the seven exclusive teams to participate in the small business track for DARPA’s AI Cyber Challenge (AIxCC). Our team will receive a $1 million award to create a Cyber Reasoning System (CRS) and compete in the AIxCC Semifinal Competition later this summer. This recognition not only highlights our dedication to advancing cybersecurity but also marks a significant milestone in our journey in pioneering solutions that could shape the future of AI-driven security. Our involvement in the AIxCC represents a step forward in our commitment to pushing the boundaries of what’s possible, envisioning a future where cybersecurity challenges are met with innovative, AI-powered solutions.

It’s official: Trail of Bits was selected as one of the seven exclusive teams for the AIxCC small business track.

As we move beyond the initial phase of the competition, we’re eager to offer a sneak peek into the driving forces behind our approach, without spilling all of our secrets, of course. In a field where competitors often hold their cards close to their chests, we at Trail of Bits believe in the value of openness and sharing. Our motivation stems from more than just the desire to compete; it’s about contributing to a broader understanding and development within the cybersecurity community. While we navigate through this challenge with an eye on victory, our aim is also to foster a culture of transparency and collaboration, aligning with our deep-rooted open-source ethos.

For background on the challenge, see our two previous posts on the AIxCC:

*** Disclaimer: Information about AIxCC’s rules, structure, and events referenced in this document are subject to change. This post is NOT an authoritative document. Please refer to DARPA’s website and official documents for first-hand information. ***

Congrats to the 7 companies that will receive $1 million each to develop AI-enabled cyber reasoning systems that automatically find and fix software vulnerabilities as part of the #AIxCC Small Business Track! Full announcement: https://t.co/SC6yEFsooy. pic.twitter.com/MRt3eoNuJd

— DARPA (@DARPA) March 11, 2024

The guiding principles for building our CRS

In addition to competing in the AIxCC’s spiritual predecessor, the Cyber Grand Challenge (CGC), our team at Trail of Bits has been working to apply AI/ML techniques to critical cybersecurity problems for many years. These experiences have heavily influenced our approach to the AIxCC. While we’ll be waiting until later in the competition to share specific details, we would like to share the guiding principles for building our AI/ML-driven CRS that have come from this work:

CRS architecture is key to achieving scalability, resiliency, and versatility

DARPA’s CGC, like the AIxCC, tasked competitors with developing CRSs that find vulnerabilities at scale (i.e., that scan many challenge programs in a limited period of time) without any human intervention. The CRS Trail of Bits created to compete in the CGC, Cyberdyne, addressed these problems with a distributed system architecture. Cyberdyne provisioned many independent nodes, each capable of performing key tasks such as fuzzing and symbolic execution. Each node was tasked with one or more challenge problems, and could even cooperate with other nodes on the same challenge.

This design had several advantages. First, the CRS maximized coverage of the 131 challenges via parallel processing. This allowed the CRS to both achieve the scale needed to succeed in the competition and avoid being bogged down with particularly challenging problems. Second, the CRS was resilient to localized failures. If nodes experienced a catastrophic error while analyzing a challenge problem, the operation of other independent nodes was not affected, limiting the damage to the CRS’s overall score. The care taken in this design paid off in the competition: Cyberdyne ranked second among all CRSs in terms of the total number of verified bugs found!

The format of the AIxCC bears a strong resemblance to that of the CGC, so the CRS we build for the AIxCC will also need to be scalable and resilient to failures. However, the AIxCC has an additional wrinkle—challenge diversity. The AIxCC’s challenge problem set will include programs written in languages other than C/C++, including many interpreted languages such as Java and Python. This will require a successful CRS to be highly versatile. Fortunately, the distributed architecture used in Cyberdyne can be adapted for the AIxCC to address versatility in a manner similar to scalability and resiliency. The key difference is that problem-solving nodes used for AIxCC challenges will need to be specialized for different types of challenge problems.

AI/ML is best for complementing conventional techniques, not replacing them

I, along with my co-authors from Georgia Tech, recently presented work at the USENIX Security Symposium on an ML-based static analysis tool we built called VulChecker. VulChecker uses graph-based ML algorithms to locate and classify vulnerabilities in program source code. We evaluated VulChecker against a commercial static analysis tool and found that VulChecker outperformed the commercial tool at detecting certain vulnerability types that rule-based tools typically struggle with, such as integer overflow/underflow vulnerabilities. However, for vulnerabilities that are amenable to rule-based checks (e.g., stack buffer overflow vulnerabilities), VulChecker was effective but did not outperform conventional static analysis.

Considering that rule-based checks are generally less costly to implement than ML models, it doesn’t make sense to replace conventional analysis entirely with AI/ML. Rather, AI/ML is best suited to complement conventional approaches by addressing the problem instances that they struggle with. In the context of the AIxCC, our experience suggests that an AI/ML-only approach is a losing proposition due to high compute costs and the effect of compounding false positives, inaccuracies, and/or confabulations at each step. With that in mind, we plan to use AI/ML in our CRS only where it is best suited or where no conventional options exist. For now, we are planning to use AI/ML approaches primarily for vulnerability detection/classification, patch generation, and input generation tasks in our CRS.

Use the right AI/ML models for the job!

LLMs have been demonstrated to have many emergent capabilities due to the sheer size of their training sets. Among the tasks a CRS must complete in the AIxCC that are suitable for AI/ML, several are tailor-made for LLMs, such as generating code snippets and seed inputs for fuzzing. However, based on our past research, we’ve found that LLMs may not actually be the best option for such tasks.

Last fall, our team supported the United Kingdom’s Frontier AI Taskforce’s efforts to evaluate the risks posed by frontier AI models. We created a framework for rigorously assessing the offensive cyber capabilities of LLMs, which allowed us to 1) rate the model’s independent capabilities relative to human skill levels (i.e., novice, intermediate, expert) and 2) rate the model’s ability to upskill a novice or intermediate human operator. We used this framework to assess different LLMs’ abilities to handle several distinct tasks, including those highly relevant to AIxCC (e.g., vulnerability discovery and contextualization).

We found that LLMs could perform only as well as experts or significantly upskill novices for tasks that were reducible to natural language processing, such as writing phishing emails and conducting misinformation campaigns. For other cyber tasks (including those relevant to the AIxCC) such as creating malicious software, finding vulnerabilities in source code, and creating exploits, current-generation LLMs had novice-like capabilities and could only marginally upskill novice users. These results speak to the lack of reasoning and planning capabilities in LLMs, which has been well documented.

Because LLMs will struggle greatly with tasks that are reasoning-intensive, such as identifying novel instances of vulnerabilities in source code or classifying vulnerabilities, we’ll avoid their use in our CRS. Other types of AI/ML models with narrower scopes are a better option. Expecting LLMs to perform well on these tasks risks high levels of inaccuracy or false positives that can derail late tasks (e.g., generating patches).

What’s next?

Next month, DARPA will hold its AIxCC kickoff event where we should learn more about the infrastructure DARPA will provide for the competition. Once released, we expect this information will allow us (and other competing teams) to make more concrete progress toward building our CRS.

Trail of Bits Blog
Out of the kernel, into the tokensTrail of Bits 8 March 2024 at 14:00

Out of the kernel, into the tokens

Trail of Bits Blog

By: Trail of Bits

8 March 2024 at 14:00

By Max Ammann and Emilio López

We’re digging up the archives of vulnerabilities that Trail of Bits has reported over the years. This post shares the story of two such issues: a denial-of-service (DoS) vulnerability hidden in JSON Web Tokens (JWTs), and an oversight in the Linux kernel that could enable circumvention of critical kernel security mechanisms (KASLR).

Unraveling a DoS vulnerability in JOSE libraries

JWT and JSON Object Signing and Encoding (JOSE) are expansive standards that describe the creation and use of encrypted and/or signed JSON-based tokens. While these standards are widely used and represent a significant improvement over previous solutions for identity claims, they are not without drawbacks, and have several well-known footguns, like the JWT “none” signature algorithm.

Our finding concerns an attack that was part of a lineup of new JWT attacks presented by Tom Tervoort at BlackHat USA 2023: “Three New Attacks Against JSON Web Tokens.” The “billion hashes attack”, which results in denial-of-service due to a lack of validation in JWT key encryption, caught our colleague Matt Schwager’s attention. Upon further examination, he discovered it applied to several more libraries in the Go and Rust ecosystems: go-jose, jose2go, square/go-jose, and josekit-rs.

These libraries all support key encryption with PBES2, a feature meant to allow for password-based encryption of the Content Encryption Key (CEK) in JSON Web Encryption (JWE). A key is first derived from a password by using PBES2 schemes, which execute a number of PBKDF2 iterations. Then that key is used to encrypt and decrypt the token contents.

This wouldn’t normally be an issue, but unfortunately, the number of iterations is contained as part of the token, on the p2c header parameter, which an attacker can easily manipulate. Consider, for example, the token header shown below:

Figure 1: A JWE token header indicating PBES2 key encryption with a large number of iterations

By using a very large iteration count in the p2c field, an attacker can cause a DoS on any application that attempts to process this token. Whoever receives and attempts to verify this token will first need to perform 2,147,483,647 PBKDF2 iterations to derive the CEK before they can even verify if the token is valid, costing significant amounts of compute time.

We reported the issue to the go-jose, jose2go, and josekit-rs library maintainers, and it has been fixed by limiting the maximum value usable for p2c in go-jose/go-jose on version 3.0.1 (commit 65351c27657d); on dvsekhvalnov/jose2go on version 1.6.0 (commits a4584e9dd712 and 8e9e0d1c6b39); and on hidekatsu-izuno/josekit-rs on version 0.8.5 (commits 1f3278a33f0e, 8b60bd0ea8ce, and 7e448ce66c1c). square/go-jose remains unfixed, as the library is deprecated, and users are encouraged to migrate to go-jose/go-jose.

Alternatively, the risk can also be mitigated by not relying purely on the token’s alg parameter. After all, if your application does not expect to receive a token using PBES2 or any lesser-used algorithm, there is no reason to try to process one. jose2go allows implementing opt-in stricter validation of alg and enc parameters today, and go-jose’s next major version will require passing a list of acceptable algorithms when processing a token, allowing developers to explicitly list a set of expected algorithms.

KASLR bypass in privilege-less containers

Next is a vulnerability that has been fixed since 2020, but never got a CVE assigned by the Linux kernel maintainers. In the following paragraphs, we’ll go into the details of a previously unknown but fixed KASLR bypass.

Back in 2020, Trail of Bits engineer Dominik Czarnota (aka disconnect3d) discovered a vulnerability in the Linux kernel that could expose internal pointer addresses within unprivileged Docker containers, allowing a malicious actor to bypass Kernel Address Space Layout Randomization (KASLR) for kernel modules.

KASLR is an important defense mechanism in operating systems, primarily used to deter exploit attempts. It is a security technique that randomizes the kernel memory address locations between reboots. On top of that, kernel addresses must be hidden from userspace; otherwise, the mitigation would make no sense, as such kernel address disclosure would effectively bypass the KASLR mitigation.

While there are places where kernel addresses are shown to userspace programs, on many systems they should be available only when the user has the CAP_SYSLOG Linux capability. (Capabilities split root user privileges so it is possible to be the root user, or a user with uid 0, while having a limited set of privileges.) In particular, the manual page for the CAP_SYSLOG capability reads: “View kernel addresses exposed via /proc and other interfaces when /proc/sys/kernel/kptr_restrict has the value 1.” This means that only processes that are executed with the capability CAP_SYSLOG should be able to read kernel addresses.

However, Dominik discovered that this was not the case from within a Docker container where processes that are run from the root user without CAP_SYSLOG were able to observe kernel addresses. By default, Docker containers are unprivileged, which means that root users are restricted in what they can do (e.g., they cannot perform actions that require CAP_SYSLOG). This can also be demonstrated without Docker by using the capsh tool run from the root user to remove the CAP_SYSLOG capability:

The underlying cause of the issue was that the credentials were checked incorrectly. The sysctl toggle kernel.kptr_restrict indicates whether restrictions are placed on exposing kernel addresses: the value “2” means that the addresses are always hidden; “1” means that they are shown only if the user has CAP_SYSLOG; and “0” means that they are always shown. Instead of ensuring that the user had the CAP_SYSLOG capability before showing the addresses, only the value of kptr_restrict was being considered to decide whether to show or hide the addresses. The addresses were always exposed if kptr_restrict was 1, while they should have been hidden if the user did not have CAP_SYSLOG. The issue was fixed in commit b25a7c5af905.

After discovering this vulnerability, we followed a coordinated disclosure process with Docker and Linux kernel security team. Dominik initially notified the Docker team about this, since he thought the vulnerability originated from Docker, and also reported other sysfs filesystem leaks (where other sysfs paths leaked information such as the names of services run outside of the container, other container IDs, and information about devices). The disclosure timeline is provided at the end of this post.

Although we received only silence from Docker despite multiple requests for updates, the Linux community swiftly rectified the issue in the kernel. The KASLR bypass bug fix was backported to various Ubuntu LTS versions, while the other sysfs leaks from Docker were not fixed at all. However, Linux kernel releases before 4.19 are vulnerable to the KASLR bypass. Ubuntu 18, which uses kernel 4.15, is still vulnerable because the fix was not backported.

Disclosure timeline for KASLR bypass in privilege-less containers

June 6, 2020: Reported the vulnerability to Docker.
June 11, 2020: Docker replied that they would probably block the sysfs paths that leaks information via the “masked paths” feature, and that the memory address disclosure should be reported to the Linux kernel developers.
June 11, 2020: Informed the intent to contact [email protected] about the kASLR bypass.
June 11 to June 18, 2020: Performed a deeper analysis of the kASLR bypass.
June 18, 2020: Reported the bug to [email protected].
June 18, 2020: Bug confirmed by Kees Cook.
June 19 to June 21, 2020: Kernel developers discuss how to patch the issue.
June 30, 2020: Requested an update from Docker.
July 3 to July 14, 2020: Patches that fix the issue land in the Linux kernel.
July 11, 2020: Requested an update from Docker again about other sysfs leaks, and informed them that the KASLR bypass issue has been fixed in Linux 4.19, 5.4 and 5.7 kernels.
December 3, 2020: Requested an update from Docker once again, and informed the intent to disclose the issues publicly. Docker did not reply.

Do you need audits in 2024?

These two vulnerabilities are quite different: The DoS issue relates to parsing and interpreting user input, while the kernel vulnerability is an information leak (strictly speaking, it is an access control vulnerability). These differences affect the detectability of bugs: if you cause a DoS, you’ll likely notice right away because the availability of your service will be compromised. By contrast, if an attacker exploits an access control vulnerability, you probably won’t notice when your service is exploited.

This difference in detectability is important for automated testing. For instance, fuzzing, as showcased in the Trail of Bits Testing Handbook, typically requires the program to crash or hang. Therefore, we mostly find DoS bugs in the memory-safe programs we fuzz. Automatically finding access control bugs through fuzzing is more challenging because it requires the implementation of fuzzing invariants.

Security audits are still indispensable tools for finding vulnerabilities, just like fuzzing is! Our audits integrate fuzzing whenever possible, and we look for opportunities to enforce invariants to catch nasty logic bugs.

Trail of Bits Blog
Cryptographic design review of OckamTrail of Bits 5 March 2024 at 14:00

Cryptographic design review of Ockam

Trail of Bits Blog

By: Trail of Bits

5 March 2024 at 14:00

By Marc Ilunga, Jim Miller, Fredrik Dahlgren, and Joop van de Pol

In October 2023, Ockam hired Trail of Bits to review the design of its product, a set of protocols that aims to enable secure communication (i.e., end-to-end encrypted and mutually authenticated channels) across various heterogeneous networks. A secure system starts at the design phase, which lays the foundation for secure implementation and deployment, particularly in cryptography, where a secure design can prevent entire vulnerabilities.

In this blog post, we give some insight into our cryptographic design review of Ockam’s protocols, highlight several positive aspects of the initial design, and describe the recommendations we made to further strengthen the system’s security. For anyone considering working with us to improve their design, this blog post also gives a general behind-the-scenes look at our cryptographic design review offerings, including how we use formal modeling to prove that a protocol satisfies certain security properties.

Here is what Ockam’s CTO, Mrinal Wadhwa, had to say about working with Trail of Bits:

Trail of Bits brought tremendous protocol design expertise, careful scrutiny, and attention to detail to our review. In depth and nuanced discussions with them helped us further bolster our confidence in our design choices, improve our documentation, and ensure that we’ve carefully considered all risks to our customers’ data.

Overview of the Ockam system and Ockam Identities

Ockam is a set of protocols and managed infrastructure enabling secure communication. Users may also deploy Ockam on their premises, removing the need to trust Ockam’s infrastructure completely. Our review was based on two use cases of Ockam:

TCP portals: secure TCP communication spanning various networks and traversing NATs
Kafka portals: secure data streaming through Apache Kafka

A key design feature of Ockam is that secure channels are established using an instantiation of the Noise framework’s XX pattern in a way that is agnostic to the networking layer (i.e., the channels can be established for both TCP and Kafka networking, as well as others).

A major component of an Ockam deployment is the concept of Ockam Identities. Identities uniquely identify a node in an Ockam deployment. Each node has a self-generated identifier and an associated primary key pair that is rotated over time. Each rotation is cryptographically attested to with the current and next primary keys, thereby creating a change history. An identity is therefore defined by an identifier and the associated signed change history. The concrete constructions are shown in figure 1.

Diagram of an Ockam Identity showing an example of a signed change history with three blocks

Figure 1: Ockam Identities

Primary keys are not used directly for authentication or session key establishment in the Noise protocol. Rather, they are used to attest to purpose keys used for secure channel establishment and credential issuance. These credentials play a role akin to certificates in traditional PKI systems to enable mutual trust and enforce attribute-based access control policies.

The manual assessment process

We conducted a manual review of the Ockam design specification, including the secure channels, routing and transports, identities, and credentials, focusing on potential cryptographic threats that we see in similar communication protocols. The manual review process identified five issues, mostly related to the insufficient documentation for assumptions and the expected security guarantees. These findings indicate that insufficient information in the specifications, such as threat modeling, may lead Ockam users to make security-critical decisions based on an incomplete understanding of the protocol.

We also raised a few issues related to discrepancies between the specifications and the implementation that we identified from a cursory review of the implementation. Even though the implementation was not in scope for this review, we often find that it serves as a ground truth in cases when the design documentation is unclear and can be interpreted in different ways.

Formal verification with Verifpal and CryptoVerif

In addition to reviewing the Ockam design manually, we used formal modeling tools to verify specific security properties automatically. Our formal modeling efforts primarily focused on Ockam Identities, a critical element of the Ockam system. To achieve comprehensive automated analysis, we used the protocol analyzers Verifpal and CryptoVerif.

Verifpal works in the symbolic model, whereas CryptoVerif works in the computational model, making them a complementary set of tools. Verifpal finds potential high-level attacks against protocols, enabling quick iterations on a protocol until a secure design is found, while CryptoVerif provides more low-level analysis and can more precisely relate the security of the protocol to the cryptographic security guarantees of the individual primitives used in the implementation.

Using Verifpal’s convenient modeling capabilities and built-in primitives, we modeled a (simplified) scenario for Ockam Identities where Alice proves to Bob that she owns the primary key associated with the peer identifier Bob is currently trying to verify. We also modeled a scenario where Bob verifies a new change initiated by Alice.

Modeling the protocol using Verifpal shows that the design of Ockam Identities achieves the expected security guarantees. For a given identifier, only the primary key holder may produce a valid initial change block that binds the public key to the identifier. Any subsequent changes are guaranteed to be generated by an entity holding the previous and current primary keys. Despite the ease of modeling, proving security guarantees with Verifpal requires a few tricks to prevent the tool from identifying trivial or invalid attacks. We discuss these considerations in our comprehensive report.

The current implementation of Ockam Identities can be instantiated with either of two signature schemes, ECDSA or Ed25519, which have different security properties. CryptoVerif highlighted that ECDSA and Ed25519 will not necessarily provide the same security guarantees, depending on what is expected from the protocol. However, this is not explicitly mentioned in the documentation.

Ed25519 is the preferred scheme, but ECDSA is also accepted because it is currently supported by the majority of cloud hardware security modules (HSMs). For the current design of Ockam Identities, ECDSA and Ed25519 theoretically offer the same guarantees. However, future changes to Ockam Identities may require other security guarantees that are provided only by Ed25519.

Occasionally, protocols require stronger properties than what is usually expected from the signature schemes’ properties (see Seems Legit: Automated Analysis of Subtle Attacks on Protocols that Use Signatures). Therefore, from a design perspective, it is desirable that properties expected from a protocol’s building blocks be well understood and explicitly stated.

Our recommendations for strengthening Ockam

Our review did not uncover any issues in the in-scope use cases that would pose an immediate risk to the confidentiality and integrity of data handled by Ockam. But we made several recommendations to strengthen the security of Ockam’s protocols. Our recommendations aim at enabling defense in depth, future-proofing the protocols, improving threat modeling, expanding documentation, and clearly defining the security guarantees of Ockam’s protocols. For example, one of our recommendations describes important considerations for protecting against “store now, decrypt later” attacks from future quantum computers.

We also worked with the Ockam team to flesh out information missing from the specification, such as documenting the exact meaning of certain primary key fields and creating a formal threat model. This information is important to allow Ockam users to make sound decisions when deploying Ockam’s protocols.

Generally, we recommended that Ockam explicitly document the assumptions made about cryptographic protocols and the expected security guarantees of each component of the Ockam system. Doing so will ensure that future development of the protocols builds upon well-understood and explicit assumptions. Good examples of assumptions and expected security guarantees that should be documented are the theoretical issue around ECDSA vs. EdDSA that we identified with CryptoVerif and how using primitives with lower security margins will not significantly impact security.

Ockam’s CTO responded to the above recommendations with the following statement:

We believe that easy to understand and open documentation of Ockam’s protocols and implementation is essential to continuously improve the security and privacy offered by our products. Trail of Bits’ thorough third-party review of our protocol documentation and formal modeling of our protocols has helped make our documentation much more approachable for continuous scrutiny and improvement by our open source community.

Lastly, we strongly recommended an (internal or external) assessment of the Ockam protocols implementation, as a secure design does not imply a secure implementation. Issues in the deployment of a protocol may arise from discrepancies between the design and the implementation, or from specific implementation choices that violate the assumptions in the design.

Security is an ongoing process

At the start of the assessment, we observed that the Ockam design follows best practices, such as using robust primitives that are well accepted in the industry (e.g., the Noise XX protocol with AES-GCM and ChachaPoly1305 as AEADs and with Ed25519 and ECDSA for signatures). Furthermore, the design reflects that Ockam considered many aspects of the system’s security and reliability, including, for instance, various relevant threat models and the root of trust for identities. Moreover, by open-sourcing its implementation and publishing the assessment result, the Ockam team creates a transparent environment and invites further scrutiny from the community.

Our review identified some areas for improvement, and we provided recommendations to strengthen the security of the product, which already stands on a good foundation. You can find more detailed information about the assessment, our findings, and our recommendations in the comprehensive report.

This project also demonstrates that security is an ongoing process, and including security considerations early in the design phase establishes a strong footing that the implementation can safely rely on. But it is always necessary to continuously work on improving the system’s security posture while responding adequately to newer threats. Assessing the design and the implementation are two of the most crucial steps in ensuring a system’s security.

Please contact us if you want to work with our cryptography team to help improve your design—we’d love to work with you!

Trail of Bits Blog
Relishing new Fickling features for securing ML systemsTrail of Bits 4 March 2024 at 14:00

Relishing new Fickling features for securing ML systems

Trail of Bits Blog

By: Trail of Bits

4 March 2024 at 14:00

By Suha S. Hussain

We’ve added new features to Fickling to offer enhanced threat detection and analysis across a broad spectrum of machine learning (ML) workflows. Fickling is a decompiler, static analyzer, and bytecode rewriter for the Python pickle module that can help you detect, analyze, or create malicious pickle files.

While the ML community has seen the rise of safer serialization methods such as the safetensors file format, the security risk posed by the prevalence of pickle is far from resolved. The persistent widespread adoption of pickle in the ML ecosystem allows ML model files to be attack vectors for backdoors, ransomware, reverse shells, and other malicious payloads, making it important that we effectively identify and mitigate this issue.

To that end, we’ve added the following new features:

Modular analysis API: Generate detailed results analyzing pickle files for malicious behaviors, with convenient JSON outputs.
PyTorch module: Statically analyze and inject code into PyTorch files.
Polyglot module: Differentiate, identify, and create polyglots for the different PyTorch file formats.

ICYMI: To our knowledge, Fickling was the first pickle security tool tailored for ML use cases. Our original blog post detailed why ML pickle files are exploitable and how Fickling specifically addresses this issue. We highlighted that Fickling is safe to run on potentially malicious files because it symbolically executes code using its own implementation of the Pickle Machine (PM). This enables Fickling to be used and deployed by incident response and ML infrastructure engineers to integrate novel ML threat detection and analysis into their pipelines. For instance, Fickling has been used to analyze malicious ML models found in the wild.

Modular analysis API

Malicious pickle files can incorporate obfuscation mechanisms to bypass direct scanning. However, Fickling facilitates a thorough analysis of such files by performing static analysis on the decompiled representations using its modular analysis API.

This API offers a detailed, systematic approach that dissects the analysis into specific categories of malicious behavior so that it’s easy to determine how and why a file was flagged. This makes Fickling an effective tool for inspecting and evaluating model artifacts whether you want to examine a model before using it in a project or investigate artifacts post-compromise.

The analysis is encapsulated in an easy-to-use JSON output format, accessible from both the CLI and Python API. The output details the severity of the file, provides a rationale for its assessment, and pinpoints specific analysis classes that were triggered, along with any relevant artifacts. This unified output format improves the usability of the modular analysis API, making it easy to customize and integrate the detection process across different tools and workflows.

Take, for example, the output from a sample malicious pickle file, generated by Fickling’s Numpy PoC (figure 1):

The severity field indicates that Fickling has labeled this file LIKELY_OVERTLY_MALICIOUS.
The analysis field explains why: Fickling detected both an unsafe import and an unused variable. The former is a much stronger determinant of severity than the latter. However, not only does detecting the unused variable provide more insight into the use of the unsafe import, but including more granular elements in analysis is especially useful for artifacts that are designed to evade detection.
The detailed_results field, expanding upon the analysis field, clearly indicates that the UnsafeImports and UnusedVariables analysis classes were triggered by this file and includes the artifact that triggered both classes. This information can help users make informed decisions based on Fickling’s analysis.

{
    "severity": "LIKELY_OVERTLY_MALICIOUS",
    "analysis": "`from posix import system` is suspicious and indicative of an overtly malicious pickle file. Variable `_var0` is assigned value `system(...)` but unused afterward; this is suspicious and indicative of a malicious pickle file",
    "detailed_results": {
        "AnalysisResult": {
            "UnsafeImports": "from posix import system",
            "UnusedVariables": [
                "_var0",
                "system(...)"
            ]
        }
    }
}

Figure 1: The JSON output of Fickling’s analysis of the malicious pickle file from numpy_poc.py

PyTorch module

PyTorch, one of the most popular frameworks for ML, is an integral component of ML workflows. This framework is dependent on pickle, which makes Fickling an excellent choice for carrying the torch. Fickling’s PyTorch module can help you dill with these files. More concretely, this module extends Fickling’s decompilation, static analysis, and injection capabilities to PyTorch files so you can apply the modular analysis API and other features. This broadens Fickling’s capacity to assess the impact of pickles in production systems.

In figure 2, we demonstrate how this PyTorch module can be used. An ML model saved as a PyTorch file is transformed and serialized into a malicious file using Fickling. This example illustrates just one of the many use cases made possible by this module—injections.

import torch
import torchvision.models as models

from fickling.pytorch import PyTorchModelWrapper

# Load example PyTorch model
model = models.mobilenet_v2()
torch.save(model, "mobilenet.pth")

# Wrap model file into fickling
result = PyTorchModelWrapper("mobilenet.pth")

# Inject payload, overwriting the existing file instead of creating a new one
temp_filename = "temp_filename.pt"
result.inject_payload(
    "print('!!!!!!Never trust a pickle!!!!!!')",
    temp_filename,
    injection="insertion",
    overwrite=True,
)

# Load file with injected payload
# This outputs “!!!!!!Never trust a pickle!!!!!!”. 
torch.load("mobilenet.pth")

Figure 2: Fickling injects arbitrary code into a PyTorch model file.

Polyglot module

What are PyTorch files?

Before we dive into the Polyglot module, let’s talk a bit more about PyTorch files. PyTorch files encompass multiple different file formats. It is a common misconception, however, that a PyTorch file refers to only one specific file format. Improper differentiation between formats hampers detection and analysis efforts and aids exploits that use these files. Fickling can differentiate these formats so that they can be effectively analyzed when used in real-world deployments.

Fickling can identify the following file formats:

PyTorch v0.1.1: Tar file with the sys_info, pickle, storages, and tensors directories
PyTorch v0.1.10: Stacked pickle files
TorchScript v1.0: ZIP file with the model.json file
TorchScript v1.1: ZIP file with the model.json and attributes.pkl files (one pickle file)
TorchScript v1.3: ZIP file with the data.pkl and constants.pkl files (two pickle files)
TorchScript v1.4: ZIP file with the data.pkl, constants.pkl, and version files set at 2 or higher (two pickle files)
PyTorch v1.3: ZIP file containing data.pkl (one pickle file)
PyTorch model archive format [ZIP]: ZIP file that includes Python code files and pickle files

This list is subject to change and we’re continually adding more file formats as needed. If you’re interested in exploring the space of ML file formats beyond PyTorch files, check out our comprehensive list of ML file formats.

The PyTorch file formats differ both in structure and in the contexts where they appear. The Polyglot module’s file format identification feature can help you ensure that the correct files are being used in the correct contexts:

The torch.load function parses PyTorch v1.3, TorchScript v1.4, PyTorch v0.1.10, and PyTorch v0.1.1 files. The PyTorch v1.3 file format is the most common format of these and is typically deemed the canonical file format.
Meanwhile, TorchServe systems rely on the PyTorch model archive format.
Deprecated file formats such as TorchScript v1.1 are deliberately included in Fickling because these formats can still be compatible with external parsers and potentially exploitable.

Figure 3 showcases how Fickling can identify different PyTorch file formats. We used torch.save to serialize a PyTorch model as a PyTorch v1.3 file and a PyTorch v0.1.10 file. Fickling can clearly distinguish these two different formats.

> import torch
> import torchvision.models as models
> import fickling.polyglot as polyglot
> model = models.mobilenet_v2()
> torch.save(model, "mobilenet.pth")
> polyglot.identify_pytorch_file_format("mobilenet.pth", print_results=True)
Your file is most likely of this format:  PyTorch v1.3 
> torch.save(model, "legacy_mobilenet.pth", _use_new_zipfile_serialization=False)
> polyglot.identify_pytorch_file_format("legacy_mobilenet.pth",
print_results=True)
Your file is most likely of this format:  PyTorch v0.1.10

Figure 3: Fickling distinguishes between a PyTorch v1.3 file and a PyTorch v0.1.10 file.

Polyglots? In my PyTorch? It’s more likely than you think

Polyglot files are files that can be validly interpreted as more than one file format. They have been used to bypass code-signing checks and distribute malware, among many other unwanted behaviors. You can learn more about polyglot files and other byproducts of unruly parsers in our blog post on PolyFile and PolyTracker. Fickling’s identification of PyTorch file formats is polyglot-aware because you can make polyglots between these files. This raises the question: Why should we care about polyglots for ML model files?

Polyglot ML model files can bypass checks in ML tools and infiltrate model hubs to mislead consumers of that model. Specifically, in the context of ML model files, polyglot files can be a vector for backdoored ML models. You can construct a polyglot file so that it is a benign model when parsed as one file format but a backdoored model when parsed as another file format. During our audit of safetensors, a now resolved finding allowed us to create multiple polyglots with safetensors files (fun fact: the report itself is a PDF/ZIP polyglot with the ZIP file containing the polyglots from the audit).

It’s important that this threat can be identified whether you’re analyzing a model artifact post-compromise for polyglottery, or building strict, well-defined parsers for MLOps tools that deal with model files. Broadly, Fickling’s Polyglot module can help us begin to determine the potential impact of polyglot files on the ML ecosystem.

Fickling also supports the creation of these polyglot files for testing and demonstration. For instance, we can use Fickling to make a file that can be validly interpreted as both a PyTorch v0.1.10 file and a PyTorch model archive (MAR) file.

Since pickle is a streaming format that stops parsing as soon as it reaches the STOP opcode, we can append arbitrary data to a pickle file without disrupting valid parsing. In a similar vein, many ZIP parsers don’t enforce the specified magic to start at offset 0, which allows us to prepend data to a ZIP file while preserving valid parsing. These two capabilities, when combined, allow us to construct a file that is both a valid pickle file and a valid ZIP file—a pickle/ZIP polyglot!

Recall that a PyTorch v0.1.10 file is composed of stacked pickles. The PyTorch MAR parser is one of many ZIP parsers that accepts files with prepended data. This means that we can build on the pickle/ZIP polyglot to make a PyTorch v0.1.10 / PyTorch MAR polyglot by appending the MAR file to the PyTorch v0.1.10 file. This process is captured in Fickling, as shown in this example:

> import fickling.polyglot as polyglot 
> polyglot.create_polyglot("mar_example.mar","legacy_example.pt") 
Making a PyTorch v0.1.10/PyTorch MAR polyglot
The polyglot is contained in polyglot.mar.pt

Figure 4: Fickling creates a PyTorch v0.1.10 / PyTorch MAR polyglot.

The resulting file can be accurately identified using Fickling, as shown below:

> import fickling.polyglot as polyglot 
> polyglot.identify_pytorch_file_format('polyglot.mar.pt',print_results=True)
Your file is most likely of this format: PyTorch v0.1.10
It is also possible that your file can be validly interpreted as: [‘PyTorch model archive format’]

Figure 5: Fickling identifies a PyTorch v0.1.10 / PyTorch MAR polyglot.

Contribute to Fickling

We are actively maintaining and adding new capabilities to Fickling, including new injection methods, analysis classes, and polyglot combinations. We want Fickling to be a usable tool for both offensive and defensive security, so we invite you to share your feedback by raising an issue on our GitHub or reaching out directly on our Contact us page.

Beyond Fickling

While Fickling can help you identify threats to ML systems caused by malicious pickle files, we recommend moving away from pickle entirely. Restricted unpicklers may seem useful, but they are not a foolproof solution. To help the ecosystem move forward from pickles, we’ve audited a safer alternative, safetensors; reported pickle vulnerabilities in open-source codebases; and written Semgrep rules to catch instances of pickling under the hood in ML libraries.

We’re dedicated to improving the overall security and integrity of the ML ecosystem. Keep an eye out for upcoming blog posts on securing ML systems.

Trail of Bits Blog
How we applied advanced fuzzing techniques to cURLTrail of Bits 1 March 2024 at 14:30

How we applied advanced fuzzing techniques to cURL

Trail of Bits Blog

By: Trail of Bits

1 March 2024 at 14:30

By Shaun Mirani

Near the end of 2022, Trail of Bits was hired by the Open Source Technology Improvement Fund (OSTIF) to perform a security assessment of the cURL file transfer command-line utility and its library, libcurl. The scope of our engagement included a code review, a threat model, and the subject of this blog post: an engineering effort to analyze and improve cURL’s fuzzing code.

We’ll discuss several elements of this process, including how we identified important areas of the codebase lacking coverage, and then modified the fuzzing code to hit these missed areas. For example, by setting certain libcurl options during fuzzer initialization and introducing new seed files, we doubled the line coverage of the HTTP Strict Transport Security (HSTS) handling code and quintupled it for the Alt-Svc header. We also expanded the set of fuzzed protocols to include WebSocket and enabled the fuzzing of many new libcurl options. We’ll conclude this post by explaining some more sophisticated fuzzing techniques the cURL team could adopt to increase coverage even further, bring fuzzing to the cURL command line, and reduce inefficiencies intrinsic to the current test case format.

How is cURL fuzzed?

OSS-Fuzz, a free service provided by Google for open-source projects, serves as the continuous fuzzing infrastructure for cURL. It supports C/C++, Rust, Go, Python, and Java codebases, and uses the coverage-guided libFuzzer, AFL++, and Honggfuzz fuzzing engines. OSS-Fuzz adopted cURL on July 1, 2017, and the incorporated code lives in the curl-fuzzer repository on GitHub, which was our focus for this part of the engagement.

The repository contains the code (setup scripts, test case generators, harnesses, etc.) and corpora (the sets of initial test cases) needed to fuzz cURL and libcurl. It’s designed to fuzz individual targets, which are protocols supported by libcurl, such as HTTP(S), WebSocket, and FTP. curl-fuzzer downloads the latest copy of cURL and its dependencies, compiles them, and builds binaries for these targets against them.

Each target takes a specially structured input file, processes it using the appropriate calls to libcurl, and exits. Associated with each target is a corpus directory that contains interesting seed files for the protocol to be fuzzed. These files are structured using a custom type-length-value (TLV) format that encodes not only the raw protocol data, but also specific fields and metadata for the protocol. For example, the fuzzer for the HTTP protocol includes options for the version of the protocol, custom headers, and whether libcurl should follow redirects.

First impressions: HSTS and Alt-Svc

We’d been tasked with analyzing and improving the fuzzer’s coverage of libcurl, the library providing curl’s internals. The obvious first question that came to mind was: what does the current coverage look like? To answer this, we wanted to peek at the latest coverage data given in the reports periodically generated by OSS-Fuzz. After some poking around at the URL for the publicly accessible oss-fuzz-coverage Google Cloud Storage bucket, we were able to find the coverage reports for cURL (for future reference, you can get there through the OSS-Fuzz introspector page). Here’s a report from September 28, 2022, at the start of our engagement.

Reading the report, we quickly noticed that several source files were receiving almost no coverage, including some files that implemented security features or were responsible for handling untrusted data. For instance, hsts.c, which provides functions for parsing and handling the Strict-Transport-Security response header, had only 4.46% line coverage, 18.75% function coverage, and 2.56% region coverage after over five years on OSS-Fuzz:

The file responsible for processing the Alt-Svc response header, altsvc.c, was similarly coverage-deficient:

An investigation of the fuzzing code revealed why these numbers were so low. The first problem was that the corpora directory was missing test cases that included the Strict-Transport-Security and Alt-Svc headers, which meant there was no way for the fuzzer to quickly jump into testing these regions of the codebase for bugs; it would have to use coverage feedback to construct these test cases by itself, which is usually a slow(er) process.

The second issue was that the fuzzer never set the CURLOPT_HSTS option, which instructs libcurl to use an HSTS cache file. As a result, HSTS was never enabled during runs of the fuzzer, and most code paths in hsts.c were never hit.

The final impediment to achieving good coverage of HSTS was an issue with its specification, which tells user agents to ignore the Strict-Transport-Security header when sent over unencrypted HTTP. However, this creates a problem in the context of fuzzing: from the perspective of our fuzzing target, which never stood up an actual TLS connection, every connection was unencrypted, and Strict-Transport-Security was always ignored. For Alt-Svc, libcurl already included a workaround to relax the HTTPS requirement for debug builds when a certain environment variable was set (although curl-fuzzer did not set this variable). So, resolving this issue was just a matter of adding a similar feature for HSTS to libcurl and ensuring that curl-fuzzer set all necessary environment variables.

Our changes to address these issues were as follows:

We added seed files for Strict-Transport-Security and Alt-Svc to curl-fuzzer (ee7fad2).
We enabled CURLOPT_HSTS in curl-fuzzer (0dc42e4).
We added a check to allow debug builds of libcurl to bypass the HTTPS restriction for HSTS when the CURL_HSTS_HTTP environment variable is set, and we set the CURL_HSTS_HTTP and CURL_ALTSVC_HTTP environment variables in curl-fuzzer (6efb6b1 and 937597c).

The day after our changes were merged upstream, OSS-Fuzz reported a significant bump in coverage for both files:

A little over a year of fuzzing later (on January 29, 2024), our three fixes had doubled the line coverage for hsts.c and nearly quintupled it for altsvc.c:

Sowing the seeds of bugs

Exploring curl-fuzzer further, we saw a number of other opportunities to boost coverage. One low-hanging fruit we spotted was the set of seed files found in the corpora directory. While libcurl supports numerous protocols (some of which surprised us!) and features, not all of them were represented as seed files in the corpora. This is important: as we alluded to earlier, a comprehensive set of initial test cases, touching on as much major functionality as possible, acts as a shortcut to attaining coverage and significantly cuts down on the time spent fuzzing before bugs are found.

The functionality we created new seed files for, with the hope of promoting new coverage, included (ee7fad2):

CURLOPT_LOGIN_OPTIONS: Sets protocol-specific login options for IMAP, LDAP, POP3, and SMTP
CURLOPT_XOAUTH2_BEARER: Specifies an OAuth 2.0 Bearer Access Token to use with HTTP, IMAP, LDAP, POP3, and SMTP servers
CURLOPT_USERPWD: Specifies a username and password to use for authentication
CURLOPT_USERAGENT: Specifies the value of the User-Agent header
CURLOPT_SSH_HOST_PUBLIC_KEY_SHA256: Sets the expected SHA256 hash of the remote server for an SSH connection
CURLOPT_HTTPPOST: Sets POST request data. curl-fuzzer had been using only the CURLOPT_MIMEPOST option to achieve this, while the similar but deprecated CURLOPT_HTTPPOST option wasn’t exercised. We also added support for this older method.

Certain other CURLOPTs, as with CURLOPT_HSTS in the previous section, made more sense to set globally in the fuzzer’s initialization function. These included:

CURLOPT_COOKIEFILE: Points to a filename to read cookies from. It also enables fuzzing of the cookie engine, which parses cookies from responses and includes them in future requests.
CURLOPT_COOKIEJAR: Allows fuzzing the code responsible for saving in-memory cookies to a file
CURLOPT_CRLFILE: Specifies the certificate revocation list file to read for TLS connections

Where to go from here

As we started to understand more about curl-fuzzer’s internals, we drew up several strategic recommendations to improve the fuzzer’s efficacy that the timeline of our engagement didn’t allow us to implement ourselves. We presented these recommendations to the cURL team in our final report, and expand on a few of them below.

Dictionaries

Dictionaries are a feature of libFuzzer that can be especially useful for the text-based protocols spoken by libcurl. The dictionary for a protocol is a file enumerating the strings that are interesting in the context of the protocol, such as keywords, delimiters, and escape characters. Providing a dictionary to libFuzzer may increase its search speed and lead to the faster discovery of new bugs.

curl-fuzzer already takes advantage of this feature for the HTTP target, but currently supplies no dictionaries for the numerous other protocols supported by libcurl. We recommend that the cURL team create dictionaries for these protocols to boost the fuzzer’s speed. This may be a good use case for an LLM; ChatGPT can generate a starting point dictionary in response to the following prompt (replace with the name of the target protocol):

A dictionary can be used to guide the fuzzer. A dictionary is passed as a file to the fuzzer. The simplest input accepted by libFuzzer is an ASCII text file where each line consists of a quoted string. Strings can contain escaped byte sequences like "\xF7\xF8". Optionally, a key-value pair can be used like hex_value="\xF7\xF8" for documentation purposes. Comments are supported by starting a line with #. Write me an example dictionary file for a <PROTOCOL> parser.

argv fuzzing

During our first engagement with curl, one of us joked, “Have we tried curl AAAAAAAAAA… yet?” There turned out to be a lot of wisdom behind this quip; it spurred us to fuzz curl’s command-line interface (CLI), which yielded multiple vulnerabilities (see our blog post, cURL audit: How a joke led to significant findings).

This CLI fuzzing was performed using AFL++’s argv-fuzz-inl.h header file. The header defines macros that allow a target program to build the argv array containing command-line arguments from fuzzer-provided data on standard input. We recommend that the cURL team use this feature from AFL++ to continuously fuzz cURL’s CLI (implementation details can be found in the blog post linked above).

Structure-aware fuzzing

One of curl-fuzzer’s weaknesses is intrinsic to the way it currently structures its inputs, which is with a custom Type-length-value (TLV) format. A TLV scheme (or something similar) can be useful for fuzzing a project like libcurl, which supports a wealth of global and protocol-specific options and parameters that need to be encoded in test cases.

However, the brittleness of this binary format makes the fuzzer inefficient. This is because libFuzzer has no idea about the structure that inputs are supposed to adhere to. curl-fuzzer expects input data in a strict format: a 2-byte field for the record type (of which only 52 were valid at the time of our engagement), a 4-byte field for the length of the data, and finally the data itself. Because libFuzzer doesn’t take this format into account, most of the mutations it generates wind up being invalid at the TLV-unpacking stage and have to be thrown out. Google’s fuzzing guidance warns about using TLV inputs for this reason.

As a result, the coverage feedback used to guide mutations toward interesting code paths performs much worse than it would if we dealt only with raw data. In fact, libcurl may contain bugs that will never be found with the current naive TLV strategy.

So, how can the cURL team address this issue while keeping the flexibility of a TLV format? Enter structure-aware fuzzing.

The idea with structure-aware fuzzing is to assist libFuzzer by writing a custom mutator. At a high level, the custom mutator’s job comprises just three steps:

Try to unpack the input data coming from libFuzzer as a TLV.
If the data can’t be parsed into a valid TLV, instead of throwing it away, return a syntactically correct dummy TLV. This can be anything, as long as it can be successfully unpacked.
If the data does constitute a valid TLV, mutate the fields parsed out in step 1 by calling the LLVMFuzzerMutate function. Then, serialize the mutated fields and return the resultant TLV.

With this approach, no time is wasted discarding inputs because every input is valid; the mutator only ever creates correctly structured TLVs. Performing mutations at the level of the decoded data (rather than at the level of the encoding scheme) allows better coverage feedback, which leads to a faster and more effective fuzzer.

An open issue on curl-fuzzer proposes several changes, including an implementation of structure-aware fuzzing, but there hasn’t been any movement on it since 2019. We strongly recommend that the cURL team revisit the subject, as it has the potential to significantly improve the fuzzer’s ability to find bugs.

Our 2023 follow-up

At the end of 2023, we had the chance to revisit cURL and its fuzzing code in another audit supported by OSTIF. Stay tuned for the highlights of our follow-up work in a future blog post.

Trail of Bits Blog
When try, try, try again leads to out-of-order execution bugsTrail of Bits 1 March 2024 at 12:00

When try, try, try again leads to out-of-order execution bugs

Trail of Bits Blog

By: Trail of Bits

1 March 2024 at 12:00

By Troy Sargent

Have you ever wondered how a rollup and its base chain—the chain that the rollup commits state checkpoints to—communicate and interact? How can a user with funds only on the base chain interact with contracts on the rollup?

In Arbitrum Nitro, one way to call a method on a contract deployed on the rollup from the base chain is by using retryable transactions (a.k.a. retryable tickets). While this feature enables these interactions, it does not come without its pitfalls. During our reviews of Arbitrum and contracts integrating with it, we identified footguns in the use of retryable tickets that are not widely known and should be considered when creating such transactions. In this post, we’ll share how using retryable tickets may allow unexpected race conditions and result in out-of-order execution bugs. What’s more, we’ve created a new Slither detector for this issue. Now you’ll be able to not only recognize these footguns in your code, but test for them too.

Retryable tickets

In Arbitrum Nitro, retryable tickets facilitate communication between the Ethereum mainnet, or Layer 1 (L1), and the Arbitrum Nitro rollup, or Layer 2 (L2). To create retryable tickets, users can call createRetryableTicket on the L1 Inbox contract of the Arbitrum rollup, as shown in the code snippet below. When retryable tickets are created and queued, ArbOS will attempt to automatically “redeem” them by executing them one after another on L2.

/**
 * @notice Put a message in the L2 inbox that can be reexecuted for some fixed amount of time if it reverts
 * @dev all msg.value will deposited to callValueRefundAddress on L2
 * @dev Gas limit and maxFeePerGas should not be set to 1 as that is used to trigger the RetryableData error
 * @param to destination L2 contract address
 * @param l2CallValue call value for retryable L2 message
 * @param maxSubmissionCost Max gas deducted from user's L2 balance to cover base submission fee
 * @param excessFeeRefundAddress gasLimit x maxFeePerGas - execution cost gets credited here on L2 balance
 * @param callValueRefundAddress l2Callvalue gets credited here on L2 if retryable txn times out or gets cancelled
 * @param gasLimit Max gas deducted from user's L2 balance to cover L2 execution. Should not be set to 1 (magic value used to trigger the RetryableData error)
 * @param maxFeePerGas price bid for L2 execution. Should not be set to 1 (magic value used to trigger the RetryableData error)
 * @param data ABI encoded data of L2 message
 * @return unique message number of the retryable transaction
 */
function createRetryableTicket(
    address to,
    uint256 l2CallValue,
    uint256 maxSubmissionCost,
    address excessFeeRefundAddress,
    address callValueRefundAddress,
    uint256 gasLimit,
    uint256 maxFeePerGas,
    bytes calldata data
) external payable returns (uint256);

The createRetryableTicket function interface

Assuming the gas costs are covered by the sender and no failures occur, the transactions will be executed sequentially, and the final state results from applying transaction B immediately following transaction A.

Figure 1: The happy path is when the transactions are all executed in order.

Wait, what does “retryable” mean?

Because any transaction may fail (e.g., the L2 gas price rises significantly following the creation of a transaction, and the user has insufficient gas to cover the new cost), Arbitrum created these types of transactions so that users can “retry” them by supplying additional gas. Failing retryable tickets will be persisted in memory and may be re-executed by any user who manually calls the redeem method of the ArbRetryableTx precompiled contract, sponsoring the gas costs. A retryable ticket that fails is different from a normal transaction that reverts, in that it does not require a new transaction to be signed to be executed again.

Additionally, retryable tickets in memory can be redeemed up to one week after they are created. A retryable ticket’s lifetime can be extended for another week by paying an additional fee for storing it; otherwise, it will be discarded after its expiration date.

Where things go wrong

While these types of transactions are useful—in that they facilitate L2-to-L1 communication and allow users to retry their transactions if failures occur—they come with pitfalls, risks that users and developers may not be aware of. Specifically, retryable tickets are expected to execute in the order they are submitted, but this is not always guaranteed to happen.

In scenario 1, both transactions A and B fail and enter the memory region. The state of the application is left unchanged.

Consider the three scenarios below in which two retryable tickets are created within the same transaction.

Figure 2: Two retryable tickets are created in the same transaction, but both fail and enter the memory region.

However, anyone can manually redeem transaction B before transaction A, which means that the transactions will be executed out of order unexpectedly.

Figure 3: Anyone can manually redeem transactions in the memory region out of order.

In scenario 2, transaction A fails and enters the memory region, but transaction B succeeds. Once again, the transactions are executed out of order (i.e., transaction A is not executed at all), and the final state is not what was expected.

Figure 4: Only transaction B is included in the final state.

In scenario 3, transaction A succeeds, but transaction B does not. That means transaction B must be re-executed manually. Transactions can be created more than once, which means that a second set of transactions A and B could be submitted before the first transaction B is re-executed. If developers of a protocol using the Arbitrum rollup system don’t account for the possibility that the protocol could receive a second transaction A prior to transaction B’s success, the protocol may not handle this case correctly.

Figure 5: Only transaction A is included in the final state.

The out-of-order execution vulnerability

In light of these scenarios, developers should consider that transactions may execute out of order. For instance, if the second transaction in a queue relies on completion of the first, but it executes before the first executes due to an insufficient gas failure, it may revert or not work correctly. It’s important that the callee, or message recipient, on the rollup can robustly handle situations such as the receipt of transactions in a different order than they were created and smaller subsets of transactions due to failures. If a protocol does not anticipate cases of reorderings and failures of retryable tickets, the protocol could break or be hacked.

Let’s consider the following L2 contract, which users can call to claim rewards based on some staked tokens. When they decide to unstake their tokens, any rewards that they haven’t yet claimed are lost:

function claim_rewards(address user) public onlyFromL1 {
    // rewards is computed based on balance and staking period
    uint unclaimed_rewards = _compute_and_update_rewards(user);
    token.safeTransfer(user, unclaimed_rewards);
}


// Call claim_rewards before unstaking, otherwise you lose your rewards
function unstake(address user) public onlyFromL1 {
    _free_rewards(user); // clean up rewards related variables
    balance = balance[user];
    balance[user] = 0;
    staked_token.safeTransfer(user, balance);
}

Users can submit retryable tickets for such operations with the following logic in the L1 handler:

// Retryable A
IInbox(inbox).createRetryableTicket({
    to: l2contract,
    l2CallValue: 0,
    maxSubmissionCost: maxSubmissionCost,
    excessFeeRefundAddress: msg.sender,
    callValueRefundAddress: msg.sender,
    gasLimit: gasLimit,
    maxFeePerGas: maxFeePerGas,
    data: abi.encodeCall(l2contract.claim_rewards, (msg.sender))
});
// Retryable B
IInbox(inbox).createRetryableTicket({
    to: l2contract,
    l2CallValue: 0,
    maxSubmissionCost: maxSubmissionCost,
    excessFeeRefundAddress: msg.sender,
    callValueRefundAddress: msg.sender,
    gasLimit: gasLimit,
    maxFeePerGas: maxFeePerGas,
    data: abi.encodeCall(l2contract.unstake, (msg.sender))
});

Here it is expected that claim_rewards will be called before unstake. However, as we’ve seen, the claim_rewards transaction is not guaranteed to execute before the unstake transaction. As covered in scenario 1 and shown in figure 3, an attacker can make it so that unstake is executed before claim_rewards if both transactions fail, causing the user to lose their rewards. It’s also possible that only the second transaction, unstake, succeeds, as shown in scenario 2.

To mitigate such risks, it’s essential to design protocols in a way that retryable tickets have an independent ordering, where the success of each transaction does not depend on the order or outcome of others. How independent ordering is implemented depends on the protocol and the given operations. In this example, claim_rewards could be called within unstake.

Slither to the rescue

As security researchers, we always try to find ways to automatically find these sorts of issues and flag them early in the development cycle, such as during code review. To that end, we’ve written a Slither detector that will flag functions that create multiple retryable tickets via the Arbitrum Nitro Inbox contract to alert developers of this pitfall. Following its release, you can use this detector by installing Slither and running the following command in the root of a Solidity project: python3 -m pip install slither-analzyer==0.10.1 && slither . –detect out-of-order-retryable. On our example contract, Slither provides the following diagnostic:

Multiple retryable tickets created in the same function:
         -IInbox(inbox).createRetryableTicket({to:address(l2contract),l2CallValue:0,maxSubmissionCost:maxSubmissionCost,excessFeeRefundAddress:msg.sender,callValueRefundAddress:msg.sender,gasLimit:gasLimit,maxFeePerGas:maxFeePerGas,data:abi.encodeCall(l2contract.claim_rewards,(msg.sender))}) (out_of_order_retryable.sol#25-34)
         -IInbox(inbox).createRetryableTicket({to:address(l2contract),l2CallValue:0,maxSubmissionCost:maxSubmissionCost,excessFeeRefundAddress:msg.sender,callValueRefundAddress:msg.sender,gasLimit:gasLimit,maxFeePerGas:maxFeePerGas,data:abi.encodeCall(l2contract.unstake,(msg.sender))}) (out_of_order_retryable.sol#36-45)
Reference: https://github.com/crytic/slither/wiki/Detector-Documentation#out-of-order-retryable-transactions
INFO:Slither:out_of_order_retryable.sol analyzed (3 contracts with 1 detectors), 1 result(s) found

Conclusion

If you are developing a protocol that uses retryable tickets, ensure that your protocol is equipped to handle the scenarios we’ve outlined here. Specifically, the use of retryable tickets shouldn’t rely on their order or on successful execution. You can spot potential out-of-order execution bugs using our new Slither detector!

If your application interacts with Arbitrum Nitro components or you’re building software that features rollup–base chain communication, contact us to see how we help.

Trail of Bits Blog
Our response to the US Army’s RFI on developing AIBOM toolsTrail of Bits 28 February 2024 at 16:30

Our response to the US Army’s RFI on developing AIBOM tools

Trail of Bits Blog

By: Trail of Bits

28 February 2024 at 16:30

By Michael Brown and Adelin Travers

The US Army’s Program Executive Office for Intelligence, Electronic Warfare and Sensors (PEO IEW&S) recently issued a request for information (RFI) on methods to implement and automate production of an artificial intelligence bill of materials (AIBOM) as part of Project Linchpin. The RFI describes the AIBOM as a detailed list of the components necessary to build, train, validate, and configure AI models and their supply chain relationships. As with the software bill of materials (SBOM) concept, the goal of the AIBOM concept is to allow providers and consumers of AI models to effectively address supply chain vulnerabilities. In this blog post, we summarize our response, which includes our recommendations for improving the concept, ensuring AI model security, and effectively implementing an AIBOM tool.

Background details and initial impressions

While the US Army is leading research efforts into adopting this technology, our responses to this RFI could be useful to any organization that is using AI/ML models and is looking to assess the security of these models, their components and architecture, and their supply chains.

Project Linchpin is a PEO IEW&S initiative to create an operational pipeline for developing and deploying AI/ML capabilities to intelligence, cyber, and electronic warfare systems. The proposed AIBOM concept for Project Linchpin will detail the components and supply chain relationships involved in creating AI/ML models and will be used to assess such models for vulnerabilities. As currently proposed, the US Army’s AIBOM concept includes the following:

An SBOM detailing the components used to build and validate the given AI model
A component for detailing the model’s properties, architecture, training data, hyperparameters, and intended use
A component for detailing the lineage and pedigree of the data used to create the model

AIBOMs are a natural extension of bill of materials (BOM) concepts used to document and audit the software and hardware components that make up complex systems. As AI/ML models become more widespread, developing effective AIBOM tools presents an opportunity to proactively ensure the security and performance of such systems before they become pervasive. However, we argue that the currently proposed AIBOM concept has drawbacks that need to be addressed to ensure that the unique aspects of AI/ML systems are taken into account when implementing AIBOM tools.

Pros and cons of the AIBOM concept

An AIBOM would be ideal for enumerating an AI/ML model’s components that SBOM tools would miss, such as raw data sets, interfaces to ML frameworks and traditional software, and AI/ML model types, hyperparameters, algorithms, and loss functions. However, the proposed AIBOM concept has some significant shortcomings.

First, it would not be able to provide a complete security audit of the given AI model because certain aspects of model training and usage cannot be captured statically; the AIBOM tool would have to be complemented by other security auditing approaches. (We cover these approaches in more detail in the next section.) Second, the proposed concept does not account for AI/ML-specific hardware components via a hardware bill of materials (HBOM). Like the rest of the ML supply chain, specialized hardware components that are commonly used in deployed AI/ML systems like GPUs may have unique vulnerabilities like data leakage and should thus be captured by the AIBOM.

Additionally, the AIBOM tool would miss important AI/ML-specific downstream system dependencies and supply chain paradigms like machine-learning-as-a-service prediction APIs (common with LLMs). For instance, an AI model provider may be subject to attack vectors that would be difficult or impossible to detect, such as poisoning of web-scale training datasets and “sleeper agents” within LLMs.

Ensuring AI/ML model security

Many aspects of AI/ML model training and use cannot be captured statically and thus would limit the proposed AIBOM concept’s ability to provide a complete security audit. For example, it would not capture whether attackers had control over the order in which a model ingests training data, a potential data poisoning attack vector. To ensure that the given AI model has strong supply chain security, the AIBOM concept should be complemented by other security techniques, such as data cleaning/normalization tools, anomaly detection and integrity checks, and verification of training and inference environment configurations.

Additionally, we recommend extending the AIBOM concept to account for data and model transformation components in AI/ML models. For example, the AIBOM concept should be able to obtain detailed information about the data labels and labeling procedure in use, data transformation procedures in the model pipeline, model construction process, and infrastructure security configuration for the data pipeline. Capturing such items could help detect and address vulnerabilities in the AI/ML model supply chain.

Implementing the AIBOM concept

There are several barriers to building effective and automated tools to conduct AIBOM-based security audits today. First, a robust database of weaknesses and vulnerabilities specific to AI/ML models and their data pipelines (e.g., model hyperparameters, data transformation procedures) is sorely needed. Proposed databases do not provide a strong definition of what an AI/ML vulnerability is and thus do not provide the ground truth needed for security auditing. This database should define a unique abstraction for AI/ML weaknesses and enforce a machine-readable format so that the abstraction can be used as a data source for AIBOM security auditing.

AIBOM tools must be used during the data collection/transformation and model configuration/creation stages of the AI/ML operations pipeline. In many cases (e.g., tools built on ChatGPT), these stages may be controlled by a third party. We advocate for third-party AI/ML-as-a-service providers to adopt transparent, open-source principles for their models to help ensure the safety and security of tools built using their platforms.

Finally, further research and development is needed to create tools for automatically tracing data lineage and provenance. Security and safety concerns with advanced AI/ML models have started to highlight the need for such capabilities, but practical tools are still years away.

Once these key research problems are solved, we anticipate that implementing AIBOM tools and auditing programs will require similar effort to implementing SBOM tools and programs. There will, however, be several key differences that will require specialized knowledge and skills. Today’s developers, security engineers, and IT teams will need to upskill in technical domains such as data science, data management, and AI/ML-specific frameworks and hardware.

Final thoughts

We’re excited to continue discussing and developing techniques and automated tools that support high-fidelity AIBOM-based security auditing. We plan to continue engaging with the community and invite you to read our full response for more details.