Normal view

There are new articles available, click to refresh the page.
Today — 14 June 2024Vulnerabily Research

Understanding Apple’s On-Device and Server Foundation Models release

14 June 2024 at 20:49

By Artem Dinaburg

Earlier this week, at Apple’s WWDC, we finally witnessed Apple’s AI strategy. The videos and live demos were accompanied by two long-form releases: Apple’s Private Cloud Compute and Apple’s On-Device and Server Foundations Models. This blog post is about the latter.

So, what is Apple releasing, and how does it compare to the current open-source ecosystem? We integrate the video and long-form releases and parse through the marketing speak to bring you the nuggets of information within.

The sound of silence

No NVIDIA/CUDA Tax. What’s unsaid is as important as what is, and those words are CUDA and NVIDIA. Apple goes out of its way to specify that it is not dependent on NVIDIA hardware or CUDA APIs for anything. The training uses Apple’s AXLearn (which runs on TPUs and Apple Silicon), Server model inference runs on Apple Silicon (!), and the on-device APIs are CoreML and Metal.

Why? Apple hates NVIDIA with the heat of a thousand suns. Tim Cook would rather sit in a data center and do matrix multiplication with an abacus than spend millions on NVIDIA hardware. Aside from personal enmity, it is a good business idea. Apple has its own ML stack from the hardware on up and is not hobbled by GPU supply shortages. Apple also gets to dogfood its hardware and software for ML tasks, ensuring that it’s something ML developers want.

What’s the downside? Apple’s hardware and software ML engineers must learn new frameworks and may accidentally repeat prior mistakes. For example, Apple devices were originally vulnerable to LeftoverLocals, but NVIDIA devices were not. If anyone from Apple is reading this, we’d love to audit AXLearn, MLX, and anything else you have cooking! Our interests are in the intersection of ML, program analysis, and application security, and your frameworks pique our interest.

The models

There are (at least) five models being released. Let’s count them:

  1. The ~3B parameter on-device model used for language tasks like summarization and Writing Tools.
  2. The large Server model is used for language tasks too complex to do on-device.
  3. The small on-device code model built into XCode used for Swift code completion.
  4. The large Server code model (“Swift Assist”) that is used for complex code generation and understanding tasks.
  5. The diffusion model powering Genmoji and Image Playground.

There may be more; these aren’t explicitly stated but plausible: a re-ranking model for working with Semantic Search and a model for instruction following that will use app intents (although this could just be the normal on-device model).

The ~3B parameter on-device model. Apple devices are getting an approximately 3B parameter on-device language model trained on web crawl and synthetic data and specially tuned for instruction following. The model is similar in size to Microsoft’s Phi-3-mini (3.8B parameters) and Google’s Gemini Nano-2 (3.25B parameters). The on-device model will be continually updated and pushed to devices as Apple trains it with new data.

What model is it? A reasonable guess is a derivative of Apple’s OpenELM. The parameter count fits (3B), the training data is similar, and there is extensive discussion of LoRA and DoRA support in the paper, which only makes sense if you’re planning a system like Apple has deployed. It is almost certainly not directly OpenELM since the vocabulary sizes do not match and OpenELM has not undergone safety tuning.

Apple’s on-device and server model architectures.

A large (we’re guessing 130B-180B) Mixture-of-Experts Server model. For tasks that can’t be completed on a device, there is a large model running on Apple Silicon Servers in their Private Compute Cloud. This model is similar in size and capability to GPT-3.5 and is likely implemented as a Mixture-of-Experts. Why are we so confident about the size and MoE architecture? The open-source comparison models in cited benchmarks (DBRX, Mixtral) are MoE and approximately of that size; it’s too much for a mere coincidence.

Apple’s Server model compared to open source alternatives and the GPT series from OpenAI.

The on-device code model is cited in the platform state of the union; several examples of Github Copilot-like behavior integrated into XCode are shown. There are no specifics about the model, but a reasonable guess would be a 2B-7B code model fine-tuned for a specific task: fill-in-middle for Swift. The model is trained on Swift code and Apple SDKs (likely both code and documentation). From the demo video, the integration into XCode looks well done; XCode gathers local symbols and proper context for the model to better predict the correct text.

Apple’s on-device code model doing FIM completions for Swift code via XCode.

The server code model is branded as “Swift Assist” and also appears in the platform state of the union. It looks to be Apple’s answer to GitHub Copilot Chat. Not much detail is given regarding the model, but looking at its demo output, we guess it’s a 70B+ parameter model specifically trained on Swift Code, SDKs, and documentation. It is probably fine-tuned for instruction following and code generation tasks using human-created and synthetically generated data. Again, there is tight integration with XCode regarding providing relevant context to the model; the video mentions automatically identifying and using image and audio assets present in the project.

Swift Assist completing a description to code generation task, integrated into XCode.

The Image Diffusion Model. This model is discussed in the Platforms State of the Union and implicitly shown via Genmoji and Image Playground features. Apple has considerable published work on image models, more so than language models (compare the amount of each model type on Apple’s HF page). Judging by their architecture slide, there is a base model with a selection of adapters to provide fine-grained control over the exact image style desired.

Image Playground showing the image diffusion model and styling via adapters.

Adapters: LoRAs (and DoRAs) galore

The on-device models will come with a set of LoRAs and/or DoRAs (Adapters, in Apple parlance) that specialize the on-device model to be very good at specific tasks. What’s an adapter? It’s effectively a diff against the original model weights that makes the model good at a specific task (and conversely, worse at general tasks). Since adapters do not have to modify every weight to be effective, they can be small (10s of megabytes) compared to a full model (multiple gigabytes). Adapters can also be dynamically added or removed from a base model, and multiple adapters can stack onto each other (e.g., imagine stacking Mail Replies + Friendly Tone).

For Apple, shipping a base model and adapters makes perfect sense: the extra cost of shipping adapters is low, and due to complete control of the OS and APIs, Apple has an extremely good idea of the actual task you want to accomplish at any given time. Apple promises continued updates of adapters as new training data is available and we imagine new adapters can fill specific action niches as needed.

Some technical details: Apple says their adapters modify multiple layers (likely equivalent to setting target_modules=”all-linear” in HF’s transformers). Adapter rank determines how strong an effect it has against the base model; conversely, higher-rank adapters take up more space since they modify more weights. At rank=16 (which from a vibes/feel standpoint is a reasonable compromise between effect and adapter size), the adapters take up 10s of megabytes each (as compared to gigabytes for a 3B base model) and are kept in some kind of warm cache to optimize for responsiveness.

Suppose you’d like to learn more about adapters (the fundamental technology, not Apple’s specific implementation) right now. In that case, you can try via Apple-native MLX examples or HF’s transformers and PEFT packages.

A selection of Apple’s language model adapters.

A vector database?

Apple doesn’t explicitly state this, but there’s a strong implication that Siri’s semantic search feature is a vector database; there’s an explicit comparison that shows Siri now searches based on meaning instead of keywords. Apple allows application data to be indexed, and the index is multimodal (images, text, video). A local application can provide signals (such as last accessed time) to the ranking model used to sort search results.

Siri now searches by semantic meaning, which may imply there is a vector database underneath.

Delving into technical details

Training and data

Let’s talk about some of the training techniques described. They are all ways to parallelize training very large language models. In essence, these techniques are different means to split & replicate the model to train it using an enormous amount of compute and data. Below is a quick explanation of the techniques used, all of which seem standard for training such large models:

  • Data Parallelism: Each GPU has a copy of the full model but is assigned a chunk of the training data. The gradients from all GPUs are aggregated and used to update weights, which are synchronized across models.
  • Tensor Parallelism: Specific parts of the model are split across multiple GPUs. PyTorch docs say you will need this once you have a big model or GPU communication overhead becomes an issue.
  • Sequence Parallelism was the hardest topic to find; I had to dig to page 6 of this paper. Parts of the transformer can be split to process multiple data items at once.
  • FSDP shards your model across multiple GPUs or even CPUs. Sharding reduces peak GPU memory usage since the whole model does not have to be kept in memory, at the expense of communication overhead to synchronize state. FDSP is supported by PyTorch and is regularly used for finetuning large models.

Surprise! Apple has also crawled the web for training with AppleBot. A raw crawl naturally contains a lot of garbage, sensitive data, and PII, which must be filtered before training. Ensuring data quality is hard work! HuggingFace has a great blog post about what was needed to improve the quality of their web crawl, FineWeb. Apple had to do something similar to filter out their crawl garbage.

Apple also has licensed training data. Who the data partners are is not mentioned. Paying for high-quality data seems to be the new normal, with large tech companies striking deals with big content providers (e.g., StackOverflow, Reddit, NewsCorp).

Apple also uses synthetic data generation, which is also fairly standard practice. However, it begs the question: How does Apple generate the synthetic data? Perhaps the partnership with OpenAI lets them legally launder GPT-4 output. While synthetic data can do wonders, it is not without its downside—there are forgetfulness issues with training on a large synthetic data corpus.

Optimization

This section describes how Apple optimizes its device and server models to be smaller and enable faster inference on devices with limited resources. Many of these optimizations are well known and already present in other software, but it’s great to see this level of detail about what optimizations are applied in production LLMs.

Let’s start with the basics. Apple’s models use GQA (another match with OpenELM). They share vocabulary embedding tables, which implies that some embedding layers are shared between the input and the output to save memory. The on-device model has a 49K token vocabulary (a key difference from OpenELM). The hosted model has a 100K token vocabulary, with special tokens for language and “technical tokens.” The model vocabulary means how many letters and short sequences of words (or tokens) the model recognizes as unique. Some tokens are also used for signaling special states to the model, for instance, the end of the prompt, a request to fill in the middle, a new file being processed, etc. A large vocabulary makes it easier for the model to understand certain concepts and specific tasks. As a comparison, Phi-3 has a vocabulary size of 32K, Llama3 has a vocabulary of 128K tokens, and Qwen2 has a vocabulary of 152K tokens. The downside of a large vocabulary is that it results in more training and inference time overhead.

Quantization & palletization

The models are compressed via palletization and quantization to 3.5 bits-per-weight (BPW) but “achieve the same accuracy as uncompressed models.” What does “achieve the same accuracy” mean? Likely, it refers to an acceptable quantization loss. Below is a graph from a PR to llama.cpp with state-of-the-art quantization losses for different techniques as of February 2024. We are not told what Apple’s acceptable loss is, but it’s doubtful a 3.5 BPW compression will have zero loss versus a 16-bit float base model. Using “same accuracy” seems misleading, but I’d love to be proven wrong. Compression also affects metrics beyond accuracy, so the model’s ability may be degraded in ways not easily captured by benchmarks.

Quantization error compared with bits per weight, from a PR to llama.cpp. The loss at 3.5 BPW is noticeably not zero.

What is Low Bit Palletization? It’s one of Apple’s compression strategies, described in their CoreML documentation. The easiest way to understand it is to use its namesake, image color pallets. An uncompressed image stores the color values of each pixel. A simple optimization is to select some number of colors (say, 16) that are most common to the image. The image can then be encoded as indexes into the color palette and 16 full-color values. Imagine the same technique applied to model weights instead of pixels, and you get palletization. How good is it? Apple publishes some results for the effectiveness of 2-bit and 4-bit palletization. The two-bit palletization looks to provide ~6-7x compression from float16, and 4-bit compression measures out at ~3-4x, with only a slight latency penalty. We can ballpark and assume the 3.5 BPW will compress ~5-6x from the original 16-bit-per-weight model.

Palletization graphic from Apple’s CoreML documentation. Note the similarity to images and color pallets.

Palletization only applies to model weights; when performing inference, a source of substantial memory usage is runtime state. Activations are the outputs of neurons after applying some kind of transformation function, storing these in deep models can take up a considerable amount of memory, and quantizing them is a way to fit a bigger model for inference. What is quantization? It’s a way to map intervals of a large range (like 16 bits) into a smaller range (like 4 or 8 bits). There is a great graphical demonstration in this WWDC 2024 video.

Quantization is also applied to embedding layers. Embeddings map inputs (such as words or images) into a vector that the ML model can utilize. The amount/size of embeddings depends on the vocabulary size, which we saw was 49K tokens for on-device models. Again, quantizing this lets us fit a bigger model into less memory at the cost of accuracy.

How does Apple do quantization? The CoreML docs reveal the algorithms are GPTQ and QAT.

Faster inference

The first optimization is caching previously computed values via the KV Cache. LLMs are next-token predictors; they always generate one token at a time. Repeated recomputation of all prior tokens through the model naturally involves much duplicate effort, which can be saved by caching previous results! That’s what the KV cache does. As a reminder, cache management is one of the two hard problems of computer science. KV caching is a standard technique implemented in HF’s transformers package, llama.cpp, and likely all other open-source inference solutions.

Apple promises a time-to-first-token of 0.6ms per prompt token and an inference speed of 30 tokens per second (before other optimizations like token speculation) on an iPhone 15. How does this compare to current open-source models? Let’s run some quick benchmarks!

On an M3 Max Macbook Pro, phi3-mini-4k quantized as Q4_K (about 4.5 BPW) has a time-to-first-token of about 1ms/prompt token and generates about 75 tokens/second (see below).

Apple’s 40% latency reduction on time-to-first-token on less powerful hardware is a big achievement. For token generation, llama.cpp does ~75 tokens/second, but again, this is on an M3 Max Macbook Pro and not an iPhone 15.

The speed of 30 tokens per second doesn’t provide much of an anchor to most readers; the important part is that it’s much faster than reading speed, so you aren’t sitting around waiting for the model to generate things. But this is just the starting speed. Apple also promises to deploy token speculation, a technique where a slower model guides how to get better output from a larger model. Judging by the comments in the PR that implemented this in llama.cpp, speculation provides 2-3x speedup over normal inference, so real speeds seen by consumers may be closer to 60 tokens per second.

Benchmarks and marketing

There’s a lot of good and bad in Apple’s reported benchmarks. The models are clearly well done, but some of the marketing seems to focus on higher numbers rather than fair comparisons. To start with a positive note, Apple evaluated its models on human preference. This takes a lot of work and money but provides the most useful results.

Now, the bad: a few benchmarks are not exactly apples-to-apples (pun intended). For example, the graph comparing human satisfaction summarization compares Apple’s on-device model + adapter against a base model Phi-3-mini. While the on-device + adapter performance is indeed what a user would see, a fair comparison would have been Apple’s on-device model + adapter vs. Phi-3-mini + a similar adapter. Apple could have easily done this, but they didn’t.

A benchmark comparing an Apple model + adapter to a base Phi-3-mini. A fairer comparison would be against Phi-3-mini + adapter.

The “Human Evaluation of Output Harmfulness” and “Human Preference Evaluation on Safety Prompts” show that Apple is very concerned about the kind of content its model generates. Again, the comparison is not exactly apples-to-apples: Mistral 7B was specifically released without a moderation mechanism (see the note at the bottom). However, the other models are fair game, as Phi-3-mini and Gemma claim extensive model safety procedures.

Mistral-7B does so poorly because it is explicitly not trained for harmfulness reduction, unlike the other competitors, which are fair game.

Another clip from one of the WWDC videos really stuck with us. In it, it is implied that macOS Sequoia delivers large ML performance gains over macOS Sonoma. However, the comparison is really a full-weight float16 model versus a quantized model, and the performance gains are due to quantization.

The small print shows full weights vs. 4-bit quantization, but the big print makes it seem like macOS Sonoma versus macOS Sequoia.

The rest of the benchmarks show impressive results in instruction following, composition, and summarization and are properly done by comparing base models to base models. These benchmarks correspond to high-level tasks like composing app actions to achieve a complex task (instruction following), drafting messages or emails (composition), and quickly identifying important parts of large documents (summarization).

A commitment to on-device processing and vertical integration

Overall, Apple delivered a very impressive keynote from a UI/UX perspective and in terms of features immediately useful to end-users. The technical data release is not complete, but it is quite good for a company as secretive as Apple. Apple also emphasizes that complete vertical integration allows them to use AI to create a better device experience, which helps the end user.

Finally, an important part of Apple’s presentation that we had not touched on until now is its overall commitment to maintaining as much AI on-device as possible and ensuring data privacy in the cloud. This speaks to Apple’s overall position that you are the customer, not the product.

If you enjoyed this synthesis of Apple’s machine learning release, consider what we can do for your machine learning environment! We specialize in difficult, multidisciplinary problems that combine application and ML security. Please contact us to know more.

PCC: Bold step forward, not without flaws

14 June 2024 at 19:46

By Adelin Travers

Earlier this week, Apple announced Private Cloud Compute (or PCC for short). Without deep context on the state of the art of Artificial Intelligence (AI) and Machine Learning (ML) security, some sensible design choices may seem surprising. Conversely, some of the risks linked to this design are hidden in the fine print. In this blog post, we’ll review Apple’s announcement, both good and bad, focusing on the context of AI/ML security. We recommend Matthew Green’s excellent thread on X for a more general security context on this announcement:

https://x.com/matthew_d_green/status/1800291897245835616

Disclaimer: This breakdown is based solely on Apple’s blog post and thus subject to potential misinterpretations of wording. We do not have access to the code yet, but we look forward to Apple’s public PCC Virtual Environment release to examine this further!

Review summary

This design is excellent on the conventional non-ML security side. Apple seems to be doing everything possible to make PCC a secure, privacy-oriented solution. However, the amount of review that security researchers can do will depend on what code is released, and Apple is notoriously secretive.

On the AI/ML side, the key challenges identified are on point. These challenges result from Apple’s desire to provide additional processing power for compute-heavy ML workloads today, which incidentally requires moving away from on-device data processing to the cloud. Homomorphic Encryption (HE) is a big hope in the confidential ML field but doesn’t currently scale. Thus, Apple’s choice to process data in its cloud at scale requires decryption. Moreover, the PCC guarantees vary depending on whether Apple will use a PCC environment for model training or inference. Lastly, because Apple is introducing its own custom AI/ML hardware, implementation flaws that lead to information leakage will likely occur in PCC when these flaws have already been patched in leading AI/ML vendor devices.

Running commentary

We’ll follow the release post’s text in order, section-by-section, as if we were reading and commenting, halting on specific passages.

Introduction


When I first read this post, I’ll admit that I misunderstood this passage as Apple starting an announcement that they had achieved end-to-end encryption in Machine Learning. This would have been even bigger news than the actual announcement.

That’s because Apple would need to use Homomorphic Encryption to achieve full end-to-end encryption in an ML context. HE allows computation of a function, typically an ML model, without decrypting the underlying data. HE has been making steady progress and is a future candidate for confidential ML (see for instance this 2018 paper). However, this would have been a major announcement and shift in the ML security landscape because HE is still considered too slow to be deployed at the cloud scale and in complex functions like ML. More on this later on.

Note that Multi-Party Computation (MPC)—which allows multiple agents, for instance the server and the edge device, to compute different parts of a function like an ML model and aggregate the result privately—would be a distributed scheme on both the server and edge device which is different from what is presented here.

The term “requires unencrypted access” is the key to the PCC design challenges. Apple could continue processing data on-device, but this means abiding by mobile hardware limitations. The complex ML workloads Apple wants to offload, like using Large Language Models (LLM), exceed what is practical for battery-powered mobile devices. Apple wants to move the compute to the cloud to provide these extended capabilities, but HE doesn’t currently scale to that level. Thus to provide these new capabilities of service presently, Apple requires access to unencrypted data.

This being said, Apple’s design for PCC is exceptional, and the effort required to develop this solution was extremely high, going beyond most other cloud AI applications to date.

Thus, the security and privacy of ML models in the cloud is an unsolved and active research domain when an auditor only has access to the model.

A good example of these difficulties can be found in Machine Unlearning—a privacy scheme that allows removing data from a model—that was shown to be impossible to formally prove by just querying a model. Unlearning must thus be proven at the algorithm implementation level.

When the underlying entirely custom and proprietary technical stack of Apple’s PCC is factored in, external audits become significantly more complex. Matthew Green notes that it’s unclear what part of the stack and ML code and binaries Apple will release to audit ML algorithm implementations.

This is also definitely true. Members of the ML Assurance team at Trail of Bits have been releasing attacks that modify the ML software stack at runtime since 2021. Our attacks have exploited the widely used pickle VM for traditional RCE backdoors and malicious custom ML graph operators on Microsoft’s ONNXRuntime. Sleepy Pickles, our most recent attack, uses a runtime attack to dynamically swap an ML model’s weights when the model is loaded.

This is also true; the design later introduced by Apple is far better than many other existing designs.

Designing Private Cloud compute

From an ML perspective, this claim depends on the intended use case for PCC, as it cannot hold true in general. This claim may be true if PCC is only used for model inference. The rest of the PCC post only mentions inference which suggests that PCC is not currently used for training.

However, if PCC is used for training, then data will be retained, and stateless computation that leaves no trace is likely impossible. This is because ML models retain data encoded in their weights as part of their training. This is why the research field of Machine Unlearning introduced above exists.

The big question that Apple needs to answer is thus whether it will use PCC for training models in the future. As others have noted, this is an easy slope to slip into.

Non-targetability is a really interesting design idea that hasn’t been applied to ML before. It also mitigates hardware leakage vulnerabilities, which we will see next.

Introducing Private Cloud Compute nodes

As others have noted, using Secure Enclaves and Secure Boot is excellent since it ensures only legitimate code is run. GPUs will likely continue to play a large role in AI acceleration. Apple has been building its own GPUs for some time, with its M series now in the third generation rather than using Nvidia’s, which are more pervasive in ML.

However, enclaves and attestation will provide only limited guarantees to end-users, as Apple effectively owns the attestation keys. Moreover, enclaves and GPUs have had vulnerabilities and side channels that resulted in exploitable leakage in ML. Apple GPUs have not yet been battle-tested in the AI domain as much as Nvidia’s; thus, these accelerators may have security issues that their Nvidia counterparts do not have. For instance, Apple’s custom hardware was and remains affected by the LeftoverLocals vulnerability when Nvidia’s hardware was not. LeftoverLocals is a GPU hardware vulnerability released by Trail of Bits earlier this year. It allows an attacker collocated with a victim on a vulnerable device to listen to the victim’s LLM output. Apple’s M2 processors are still currently impacted at the time of writing.

This being said, the PCC design’s non-targetability property may help mitigate LeftoverLocals for PCC since it prevents an attacker from identifying and achieving collocation to the victim’s device.

This is important as Swift is a compiled language. Swift is thus not prone to the dynamic runtime attacks that affect languages like Python which are more pervasive in ML. Note that Swift would likely only be used for CPU code. The GPU code would likely be written in Apple’s Metal GPU programming framework. More on dynamic runtime attacks and Metal in the next section.

Stateless computation and enforceable guarantees

Apple’s solution is not end-to-end encrypted but rather an enclave-based solution. Thus, it does not represent an advancement in HE for ML but rather a well-thought-out combination of established technologies. This is, again, impressive, but the data is decrypted on Apple’s server.

As presented in the introduction, using compiled Swift and signed code throughout the stack should prevent attacks on ML software stacks at runtime. Indeed, the ONNXRuntime attack defines a backdoored custom ML primitive operator by loading an adversary-built shared library object, while the Sleepy Pickle attack relies on dynamic features of Python.

Just-in-Time (JIT) compiled code has historically been a steady source of remote code execution vulnerabilities. JIT compilers are notoriously difficult to implement and create new executable code by design, making them a highly desirable attack vector. It may surprise most readers, but JIT is widely used in ML stacks to speed up otherwise slow Python code. JAX, an ML framework that is the basis for Apple’s own AXLearn ML framework, is a particularly prolific user of JIT. Apple avoids the security issues of JIT by not using it. Apple’s ML stack is instead built in Swift, a memory safe ahead-of-time compiled language that does not need JIT for runtime performance.

As we’ve said, the GPU code would likely be written in Metal. Metal does not enforce memory safety. Without memory safety, attacks like LeftoverLocals are possible (with limitations on the attacker, like machine collocation).

No privileged runtime access

This is an interesting approach because it shows Apple is willing to trade off infrastructure monitoring capabilities (and thus potentially reduce PCC’s reliability) for additional security and privacy guarantees. To fully understand the benefits and limits of this solution, ML security researchers would need to know what exact information is captured in the structured logs. A complete analysis thus depends on Apple’s willingness or unwillingness to release the schema and pre-determined fields for these logs.

Interestingly, limiting the type of logs could increase ML model risks by preventing ML teams from collecting adequate information to manage these risks. For instance, the choice of collected logs and metrics may be insufficient for the ML teams to detect distribution drift—when input data no longer matches training data and the model performance decreases. If our understanding is correct, most of the collected metrics will be metrics for SRE purposes, meaning that data drift detection would not be possible. If the collected logs include ML information, accidental data leakage is possible but unlikely.

Non-targetability

This is excellent as lower levels of the ML stack, including the physical layer, are sometimes overlooked in ML threat models.

The term “metadata” is important here. Only the metadata can be filtered away in the manner Apple describes. However, there are virtually no ways of filtering out all PII in the body content sent to the LLM. Any PII in the body content will be processed unencrypted by the LLM. If PCC is used for inference only, this risk is mitigated by structured logging. If PCC is also used for training, which Apple has yet to clarify, we recommend not sharing PII with systems like these when it can be avoided.

It might be possible for an attacker to obtain identifying information in the presence of side channel vulnerabilities, for instance, linked to implementation flaws, that leak some information. However, this is unlikely to happen in practice: the cost placed on the adversary to simultaneously exploit both the load balancer and side channels will be prohibitive for non-nation state threat actors.

An adversary with this level of control should be able to spoof the statistical distribution of nodes unless the auditing and statistical analysis are done at the network level.

Verifiable transparency


This is nice to see! Of course, we do not know if these will need to be analyzed through extensive reverse engineering, which will be difficult, if not impossible, for Apple’s custom ML hardware. It is still a commendable rare occurrence for projects of this scale.

PCC: Security wins, ML questions

Apple’s design is excellent from a security standpoint. Improvements on the ML side are always possible. However, it is important to remember that those improvements are tied to some open research questions, like the scalability of homomorphic encryption. Only future vulnerability research will shed light on whether implementation flaws in hardware and software will impact Apple. Lastly, only time will tell if Apple continuously commits to security and privacy by only using PCC for inference rather than training and implementing homomorphic encryption as soon as it is sufficiently scalable.

Announcing the Burp Suite Professional chapter in the Testing Handbook

14 June 2024 at 13:00

By Maciej Domanski

Based on our security auditing experience, we’ve found that Burp Suite Professional’s dynamic analysis can uncover vulnerabilities hidden amidst the maze of various target components. Unpredictable security issues like race conditions are often elusive when examining source code alone.

While Burp is a comprehensive tool for web application security testing, its extensive features may present a complex barrier. That’s where we, Trail of Bits, stand ready with our new Burp Suite guide in the Testing Handbook. This chapter aims to cut through this complexity, providing a clear and concise roadmap for running Burp Suite and achieving quick and tangible results.

The new chapter starts with an essential discussion on where Burp can support you. This section provides in-depth insights into how Burp can enhance your ability to conduct security testing, especially in the face of challenges like obfuscated front-end code, intricate infrastructural components, variations in deployment environments, or client-side data handling issues.

The chapter provides a step-by-step guide to setting up Burp for your specific application quickly and effectively. It guides you through minimizing setup errors and ensuring potential vulnerabilities are not overlooked—a game-changer in terms of your security auditing outcomes. We also explore using key Burp extensions to supercharge your application testing processes and discover more vulnerabilities.

Our Burp chapter concludes with numerous professional tips and tricks to empower you to perform advanced practices and to reveal hidden Burp characteristics that could revolutionize your security testing routine.

Real-world knowledge, real-world results

The Testing Handbook series encapsulates our extensive real-world knowledge and experience. Our insights go beyond mere documentation recitations, offering tried-and-tested strategies from the Trail of Bits team’s security auditing experience.

With this new chapter, we hope to impart the knowledge and confidence you need to dive into Burp Suite and truly harness its potential to secure your web applications.

Ready to supercharge your security testing with Burp Suite? Dive into the chapter now.

❌
❌