Reading view

There are new articles available, click to refresh the page.

Real World Cryptography Conference 2024

7 June 2024 at 16:00

This year’s Real World Cryptography Conference recently took place in Toronto, Canada. As usual, this conference organized by the IACR showcased recent academic results and industry perspectives on current cryptography topics over three days of presentations. A number of co-located events also took place before and after the conference, including the FHE.org Conference, the Real World Post-Quantum Cryptography (RWPQC) Workshop and the High Assurance Crypto Software (HACS) Workshop.

A number of NCC Group’s Cryptography Services team members attended the conference and several of the workshops this year. Some of our favorite talks and takeaways are summarized in this post.

Post-Quantum Cryptography

At this year’s Real World Cryptography conference, post-quantum cryptography was strongly represented. With two PQC sessions during the main conference itself, as well as the co-located RWPQC event which took place on the Sunday before the main conference, it was exciting to see so much engagement on the PQC front during our trip to Toronto!

Following the blueprint from last year’s event, the RWPQC workshop opened with an update about the NIST PQC competitions, which re-iterated the current status of the NIST PQC competition, and NIST’s goal of producing the final standards for NIST FIPS 203 and 204 drafts within the next few months, followed by an initial draft for the Falcon specification, under the name FN-DSA. This was followed by updates from other standardization bodies including ETSI, BSI, NSCS, and the IETF, which are all working towards providing PQC guidance in their respective areas of influence with the final FIPS drafts expected soon. MITRE and the Linux Foundation PQC migration consortiums both also gave updates during the workshop. As part of these talks, many standards bodies discussed their approach to the migration and whether or not they plan to mandate the use of hybrid algorithms, with approaches varying from required hybridization to less strong mandates on this front. Additionally, a number of the talks noted that while the use of hybrid algorithms may be helpful in the short term, the community should start considering eventual plans to migrate to a single set of algorithms post-hybridization, citing concerns about increased complexity or combinatorial expansion of algorithms as new algorithms get introduced in the future.

As a counterpart to the presentations by standardization bodies, the RWPQC program included real-world updates about the progress of the PQC migration at various companies, including Signal, Amazon, Google, Meta, and evolutionQ. All talks provided valuable insights as to the challenges, both already overcome and those that are yet to come, for migrating to PQC in their respective settings. Finally, a few more academic talks on lattice cryptanalysis and implementation footguns rounded off the program. We’ll do a slightly deeper dive for some of our favorite talks!

Lattice Cryptanalysis Talks

Martin Albrecht and John Schanck presented two complementary discussions on topics in lattice cryptanalysis. In the first presentation, Martin Albrecht did a deep dive into the analysis of the current best known attack for lattice cryptosystems, known as the dual attack, starting with a brief history of the primal and dual attacks, and noting some recent works that questioned the accuracy of some common heuristics, resulting in improved analyses for these dual algorithms. Martin also noted that there doesn’t seem to be a clear reason why the dual attacks appear to perform better than the primal attacks, noting that “it seems morally wrong that the dual attack would beat the primal attack”, since it introduces additional transformations over the direct approaches. Finally, the presentation concluded with a discussion of recent lattice attacks leveraging machine learning models, noting that in his opinion there is currently no reason to believe that ML can threaten lattice cryptosystems.

John Schanck’s following talk focused on the “real cost” of the best-known attacks. The NIST security levels I, III and V aim to guide protocol designers to select parameters which offer guarantees of security matching the cost of the best-known attacks against AES-128, 192 and 256 respectively. However, unlike attacks on AES, the dual-lattice attack has an incredibly expensive and memory-hungry sieving step. To make progress on an attack against Kyber and related schemes, one must perform a huge amount of computation before any progress is made on reducing the key-space (compare this to attacking AES where you can simply immediately just start guessing keys). The talk featured fun comparisons — a Moon’s weight of silicon would be needed to fabricate enough memory for the naive implementation of the dual-attack — and really demonstrated how challenging it is to align the real cost of attacking different cryptographic protocols when the attacks themselves are structured so differently at the algorithmic level. The take home message from Schanck’s talk was that when memory cost is taken into account, Kyber 768 should be enough for everyone.

Implementation Footguns for Post-Quantum Cryptography

Nadia Heninger presented a very detailed discussion about potential pitfalls she foresees as issues for post-quantum implementations, primarily based on her experiences with implementations of classical cryptography. She noted that many common classes of implementation pitfalls in classical cryptography are still applicable in PQC settings, including RNG issues, issues with sampling or uniformity of distributions (which may be even trickier in the PQC settings, as many lattice schemes require sampling from multiple distributions), API misuse, and missing validation checks, which can be tricky to enforce via tests. This talk resonated with us, as we have already started seeing some of these issues in the post-quantum projects that we have reviewed so far. Finally, her discussion noted that the increased implementation complexity for PQC schemes may be a blessing in disguise, as the more complicated an algorithm seems, the less likely people are to try to implement it themselves, and instead rely on existing implementations, which may end up helping avoid many of these potential issues at scale!

Making Signal Messenger Post Quantum / Making Encrypted Messaging Post Quantum

Rolfe Schmidt gave a fantastic talk on the upgrade to Signal messenger to begin the inclusion of post-quantum cryptography into the key-agreement stage of the protocol, now known as PQXDH. The talk motivated this change as a protection against “harvest-now, decrypt later” attacks with a design philosophy to change only what strictly needs to be changed to achieve protection against a quantum adversary. Although the key-agreement now includes a hybridized protocol using post-quantum algorithms, the Ratcheting algorithm is still classical only and so the classical guarantees of the Signal protocol are still not quite aligned with the post-quantum guarantees. Ensuring the ratchet is post-quantum secure is a work in progress of the Signal team, where they’re hoping to ensure that the performance of the messaging is not affected by the inclusion of Kyber into the ratcheting mechanism. The design documentation is now available PQXDH Specification

Additionally to the design and implementation of PQXDH, Signal collaborated with academia to produce a formally verified implementation of PQXDH using both ProVerif and CryptoVerif. Signal explained that through the process of formally verifying the protocol, they not only gained confidence in the changes, but verification also highlighted parts of the specification which had been under-described and could have led to attacks if misinterpreted. The process then not only added support for the validity of the design but acted as a guide for a robust description of PQXDH for developers in the future.

Conclusion

Overall, it’s very exciting to be seeing so much movement in the post-quantum real-world applications. We are looking forwards to future PQC updates at RWC, RWPQC and elsewhere, and to reviewing PQC projects that come our way!

– Giacomo Pope and Elena Bakos Lang

Key and Certificate Transparency

Key and certificate transparency was a hot topic at this year’s conference. The Levchin Prize was awarded to the team at Google responsible for “creating and deploying Certificate Transparency at scale”. In addition to the public recognition of what that work has pioneered, three talks were scheduled about different aspects of modern transparency solutions.

Invited talk: Key transparency: introduction, recent results, and open problems

The first talk by Melissa Chase from Microsoft Research delved into recent results and open problems in Key Transparency. In modern encrypted messaging deployments, a service provider is generally responsible for distributing users’ public keys. However, what if a man-in-the-middle attacker were to intercept (and meddle with) the public key of the recipient that a sender is trying to establish a secure communication with? Or worse, what if the server were to get compromised? In an end-to-end encrypted messaging setting, key transparency aims to solve this problem of trusted public key distribution which is often glossed over in academic works.

Until recently, the industry solution to the key transparency question was some form of out-of-band verification, in which users can display a fingerprint corresponding to the chat’s encryption key and compare it with one another. Subsequent deployments have made comparing these traditionally long numerical codes easier by displaying a QR code that can be verified when both users are physically close to each other. These solutions can be slightly tedious for users and the industry has started to deploy large-scale and automatic key transparency solutions based on relatively recent academic works such as CONIKS.

In some of these modern key transparency deployments, service providers provide a publicly accessible key directory which keeps track of users’ public keys. Users can then ensure that the key they hold for a given contact is consistent with the key tracked in the latest version of the online key directory. However, granting people access to public key repositories needs to be done while still maintaining user privacy. Indeed, the deployment of such systems should not make it easier for anyone to be able to track individual users’ actions, for example by figuring out when they refresh their keys (if they get a new device for instance) or by allowing attackers to find out which users are participating in the system by identifying personal information (such as phone numbers or email addresses) in the key directory.

In order to realize the goals outlined above, key transparency deployments make use of a few interesting cryptographic primitives. Service providers generally publish key directory together with a commitment to that directory. In practice, this is usually achieved with a Sparse Merkle Tree, and the commitment is the root of that Merkle Tree. In early academic proposals, the server would post a commitment to the current key directory at regular intervals. New developments (such as SEEMless) are proposing for the server to publish commitments to the incremental changes to the key directory, making the effort to audit the key transparency tree computationally lower (since the entire tree does not have to be recomputed and verified). To safeguard the privacy of users, modern key transparency deployments use Verifiable Random Functions (VRFs), which can be thought of as the public key variant of a hash function. In a VRF, only the private key owner may compute the hash output and its associated proof, but anyone can use the associated public key to verify that the output was calculated correctly. If the leaves of the Merkle tree were computed from the identifying information of users, for example by simply hashing some form of identifier, attackers could easily collect information about users. Using a VRF construction allows to conceal that information, by essentially randomizing the leaf positions in the Merkle tree. Melissa finished rounding up the literature review portion of her talk by presenting OPTIKS, a performant new key transparency solution which focuses on scalability, and which Melissa contributed to.

While many of the technical aspects of key transparency seem to be well ironed-out in theory, there are still a number of open questions and practical aspects that require further engineering efforts. To start, how to effectively instantiate the bulletin board, that publicly accessible key directory that should be efficiently and consistently accessed by users? A second crucial and often overlooked point is that of auditors. One common goal of these key transparency deployments is to provide the ability for auditors to validate the consistency of the key directory. But who are these auditors in practice, and what incentives do they have for performing costly validation work? And if they were to identify any wrongdoing, who would they even report such issues to? A third open question Melissa raised was around the security guarantees of such systems and whether stronger security notions could be obtained. For example, in current schemes, users will detect if a service provider maliciously replaces a user’s key but users themselves can’t prevent it.

WhatsApp Key Transparency

Later that day, Kevin Lewi and Sean Lawlor presented WhatsApp’s Key Transparency solution. Recent updates to WhatsApp added a feature to automatically validate users’ public keys based on a key transparency deployment following many of the concepts presented above. Previously, out-of-band verification used to be available to chat users, but automatic public key verification was recently added. Now, servers publish a commitment to the public key database, and, supported by UI updates in the app, the validity of a contact’s key is automatically checked when users access the “Encryption” menu of their contacts.

The presentation explored the different technical aspects this deployment necessitated, such as the infrastructure challenges to support these updates as well as the frequency at which they need to be updated. The speakers then presented some of the underlying cryptographic constructions used by the deployment. The system uses Sparse Merkle trees and VRFs in a fashion similar to SEEMless, and publishes incremental updates to the key transparency tree in the form of append-only proofs which are about ~200 MB each and are published at approximately 5 minutes intervals.

Kevin and Sean concluded their presentation by advertising the release of their implementation of the auditable key directory (accessible at https://github.com/facebook/akd), which is what WhatsApp uses in production for their key transparency deployment and which can also be used to verify the consistency proofs by external auditors. Members of NCC Group’s Cryptography Services team reviewed the implementation a few months before the conference; the public report can be found on NCC’s research platform: Public Report – WhatsApp Auditable Key Directory (AKD) Implementation Review.

Modern transparency logs

Finally, on the last day of the conference, Filippo Valsorda gave a talk on Modern Transparency Logs. Drawing parallels with key transparency solutions, Filippo kicked off his talk by framing transparency logs as a reusable primitive; a magic global append-only list of entries essentially defined by three fundamental questions: what are the entries, who can add them, and who monitors these entries? Different transparency solutions (such as the Go checksum database which Filippo used repeatedly as example throughout his presentation) are ultimately defined by the answers to these questions.

When building transparency logs solutions, a fundamental type of attacks that must be prevented is the ability to present different views of the system logs to different users, which is known as a split view attack. In a key transparency deployment for example, one could imagine a compromised (or rogue) server advertising a different public key for a target victim. There are a few solutions to circumvent split view attacks. A first one is to ensure local consistency (for example with an append-only log), a second measure is peer-to-peer gossip, where peers communicate amongst themselves to ensure they are being served the same system view, and finally, a third measure is witness cosigning. Witnesses are lightweight, third-party entities responsible for verifying consistency proofs between consecutive Merkle tree roots, and which will cosign that new tree head. Given a network of witnesses, more complex policies can be developed such as requiring a threshold of M-out-of-N signers in order for the tree head to be considered validated.

Filippo then proceeded to advertise a number of specifications and work-in-progress items to support modern transparency logs deployments. The first one being the checkpoint format specification, which is used to interoperate with the witness ecosystem. Checkpoints are essentially signed notes precisely formatted for use in transparency log applications, and which contain the origin of the checkpoint, the tree size and the root hash, and a number of potential co-signatures on that root hash. Recognizing that a checkpoint coupled with an inclusion proof is everything a client needs to verify an inclusion proof offline, Filippo then introduced the concept of “spicy signatures” (🌶️) which are offline verifiable proof of inclusion in a transparency log. He then concluded his talk by presenting a lightweight CLI tool and showing how spicy signatures can be used efficiently in existing deployments, for example by bringing transparency to the Debian package ecosystem in only a few hours.

– Paul Bottinelli

Symmetric Encryption

This year’s symmetric encryption session reinforced the motivations for modernizing our security requirements and design philosophy when it comes to symmetric primitives and modes of operation based on lessons learned and changing requirements over the past 20 years.

Building the Next Generation of AEAD

The symmetric cryptography session was opened by Sanketh Menda, who closed out last year’s event with a presentation on “context-committing” AEADs, or authenticated encryption with associated data, which acknowledges the need for standardized constructions that commit the complete “context” of an AEAD (e.g., the key and nonce). In his update this year, “Building the Next Generation of AEAD“, a broader set of goals was presented:

We sometimes need a fast approach for lightweight devices;
We sometimes need a suitable approach for cloud-scale data;
We sometimes need nonce-misuse resistance;
We sometimes need a nonce-hiding scheme;
And as established last time, we sometimes need context commitment.

And is there one ideal scheme to rule them all? Of course not… However, there may be a new approach to designing a family of schemes that facilitates safer use. To this end, a “flexible AEAD” construction is proposed which presents an implementer with a single set of binary choices corresponding to various security properties, thereby allowing a developer to express their intent, rather than to choose and compose various modes of operation. Sanketh then presents a series of primitives that can be composed in standard ways to achieve these various security goals.

With two excellent back-to-back presentations on the topic, I’m hoping we’ll get to hear a progress update from Sanketh again next year.

What’s wrong with Poly1305?

Jan Gilcher and Jérôme Govinden followed up with a presentation looking back on the development and deployment of Poly1305 and ask a fundamental question: “Given today’s advancements and applications would we still converge to this same design?”. This is initially motivated by observations that Poly1305 sacrifices a degree of security in favor of speed on a 32-bit platform using optimizations in the floating-point unit, whereas most modern platforms are 64-bit and leverage the arithmetic logic unit for optimized Poly1305 computations. So how would we build and optimize a Poly1305-like construction on today’s hardware?

Much like the preceding talk, the authors consider a modular construction for a family of polynomial-based hashes, from which Poly1305 and other similar schemes can be implemented based on a set of input parameters. This allows for the efficient testing and comparison of a broad family of implementations which can be tweaked between favoring security level and speed on a given platform. While such an approach does not outperform a hand-optimized implementation of a specific function, it appears to achieve impressive results based on the flexibility it provides.

Leveraging their new construction, the authors present a variant, Poly1163, which is better optimized for current hardware at a similar security level to Poly1305. Impressively, despite not being hand-optimized at all, this variant outperforms OpenSSL’s Poly1305 implementation. On the other end of the design spectrum, the authors also present Poly1503, which focuses on providing higher bit-security by not clamping inputs in the same manner as Poly1305 without a substantial hit to performance.

I want to encrypt 2^64 bytes with AES-GCM using a single key

Shay Gueron closed out the session with his presentation “I want to encrypt 2^64 bytes with AES-GCM using a single key“, which proposes a new mode of operation for AES called double nonce double key (DNDK), purpose-built to extend AES-GCM to support modern cloud-scale encryption tasks using a single key.

AES-GCM is the most widely used AEAD we encounter and is generally a safe choice for most applications when used correctly. However, GCM has a few well-known limitations: The 12 byte initialization value (IV) limits the number of invocations that can be made with a single key, and GCM out of the box does not provide key commitment, meaning that an attacker can produce a single authenticated ciphertext that decrypts to two different messages under two different nonce+key combinations. It is precisely these two problems that DNDK addresses, while striving to remain as close as possible to the GCM construction itself.

In practice, the concept is simple: If the short IV (nonce) is holding us back, then simply make it bigger, say, double its size. But a “double nonce” isn’t quite enough with GCM, since the first internal step is to hash it down to its original smaller size. Instead, we can use AES itself to build a key derivation function that takes as input the “double nonce” and the encryption key and derives an invocation-specific encryption key. In short, we use our double-nonce-derived-key to encrypt our message, and we have DNDK. And as a bonus, DNDK supports key commitment out of the box as well, as an optional output parameter. This incurs little practical overhead and does not rely on any additional cryptographic primitives to achieve its security.

Shay and friends at Meta have provided an optimized open-source implementation of DNDK-GCM, alongside implementations of AES-GCM and AES-GCM-SIV for comparison. A draft RFC has also been published to guide those wishing to implement DNDK for themselves. The Crypto Services team is proud to have supported the development of the DNDK draft RFC, with team members Gérald Doussot, Thomas Pornin, and Eric Schorn being formally acknowledged in the draft RFC.

– Kevin Henry

Real World Cryptography 2025

We look forward to catching up with everyone next year in Sofia, Bulgaria!

Announcing AI/ML safety and security trainings

Trail of Bits Blog

Trail of Bits

7 June 2024 at 13:00

By Michael D. Brown

We are offering AI/ML safety and security training in summer and fall of this year!

Recent advances in AI/ML technologies opened up a new world of possibilities for businesses to run more efficiently and offer better services and products. However, incorporating AI/ML into computing systems brings new and unique complexities, risks, and attack surfaces. In our experience helping clients safely and securely deploy these systems, we’ve discovered that their security teams have knowledge gaps at this intersection of AI/ML and systems security. We’ve developed our training to help organizations close this gap and equip their teams with the tools to secure their AI/ML operations pipelines and technology stacks.

What you will learn in our training

Our course is tailored for security engineers, ML engineers, and IT staff who need to understand the unique challenges of securing AI/ML systems deployed on conventional computing infrastructure. Over two days, we provide a comprehensive understanding of Al safety and security that goes beyond basic knowledge to practical and actionable insights into these technologies’ specific dangers and risks. Here’s what you will learn through a blend of instructional training and hands-on case studies:

Fundamentals of AI/ML and cybersecurity: In this module, you will learn how AI/ML models/techniques work, what they can and cannot do, and their limitations. We also cover some essential information and software security topics that may be new for ML engineers.
AI/ML tech stacks and operations pipelines: In our second module, you will learn how AI/ML models are selected, configured, trained, packaged, deployed, and decommissioned. We’ll also explore the everyday technologies in the AI/ML stack that professionals use for these tasks.
Vulnerabilities and remediation: In this module, you will learn about the unique attack surfaces and vulnerabilities present in deployed AI/ML systems. You’ll also learn methods for preventing and/or remediating AI/ML vulnerabilities.
Risk assessment and threat modeling: The fourth module covers practical techniques for conducting comprehensive risk assessments and threat models for AI/ML systems. Our holistic approaches will help you evaluate the safety and security risks AI/ML systems may pose to end users in deployed contexts.
Mitigations, controls, and risk reduction: Finally, you will learn how to implement realistic risk mitigation strategies and practical security controls for AI/ML systems. Our comprehensive strategies address the entire AI/ML ops pipeline and lifecycle.

Equip your team to work at the intersection of security and AI/ML

Trail of Bits combines cutting-edge research with practical, real-world experience to advance the state of the art in AI/ML assurance. Our experts are here to help you confidently take your business to the next level with AI/ML technologies. Please contact us today to schedule an on-site (or virtual) training for your team. Individuals interested in this training can also use this form to be notified in the future when we offer public registration for this course!

No Way, PHP Strikes Again! (CVE-2024-4577)

watchTowr Labs - Blog

Aliz Hammond

7 June 2024 at 08:09

No Way, PHP Strikes Again! (CVE-2024-4577)

Orange Tsai tweeted a few hours ago about “One of [his] PHP vulnerabilities, which affects XAMPP by default”, and we were curious to say the least. XAMPP is a very popular way for administrators and developers to rapidly deploy Apache, PHP, and a bunch of other tools, and any bug that could give us RCE in its default installation sounds pretty tantalizing.

Fortunately, for defenders, the bug has only been exploited on Windows-based PHP installations (where PHP is specifically used in CGI mode), under some specific locales:

Chinese (both simplified and traditional), and
Japanese.

However, Orange cautions that other locales could be affected too, and urges users to upgrade to the latest version of PHP, which fixes these bugs (for detail, see their blogpost).

We are keen to point out that we are unsure how common this configuration, or deployment type, is in reality. It is also not our job to find out, outside of our client base. But, regardless, it's an interesting vulnerability due to the root cause. Enjoy with us.

Orange's blogpost, while informative, doesn’t tell us exactly what to do to get that sweet RCE. Unfortunately, the wide range of configuration options makes it difficult to conclusively prove an instance to be vulnerable (or not) at a passive glance and, obviously, because a Windows machine's 'locale' is not typically externally fingerprintable. Because of this, we set about reproducing the bug—if we can exploit it, that’s the best way of proving exploitability, right?

Reading Orange's blog, it is clear that the bug only affects CGI mode of PHP. In this mode, the webserver parses HTTP requests and passes them to a PHP script, which then performs some processing on them. For example, querystrings are parsed and passed to the PHP interpreter on the command line - a request such as as http://host/cgi.php?foo=bar might be executed as php.exe cgi.php foo=bar, for example.

This does, of course, introduce an avenue for command injection, which is why input is carefully handled and sanitized before calling php.exe (cough CVE-2012-1823). However, it seems there is a corner-case which the developers did not account for, which allows an attacker to break out of the command line and supply arguments that are interpreted by PHP itself. This corner-case relates to how unicode characters are converted into ASCII. This is best explained with an example.

Here are two invocations of php.exe, one malicious and one benign. Can you spot the difference?

No, neither can I. Let’s look at then in a hex editor and see if that give us any clue.

Hmm, interesting - here we can see that the first invocation uses a normal dash (0x2D), while the second, it seems, uses something else entirely (a ‘soft hyphen,’ apparently), with the code 0xAD (highlighted). While they both appear the same to you and me, they have vastly different meanings to the OS.

An important detail here is that Apache will escape the actual hyphen - 0x2D - but not the second ‘soft hyphen’, 0xAD. After all, it’s not a real hyphen, right? So there’s no need to escape it… right?

Well. It turns out that, as part of unicode processing, PHP will apply what’s known as a ‘best fit’ mapping, and helpfully assume that, when the user entered a soft hyphen, they actually intended to type a real hyphen, and interpret it as such. Herein lies our vulnerability - if we supply a CGI handler with a soft hyphen (0xAD), the CGI handler won’t feel the need to escape it, and will pass it to PHP. PHP, however, will interpret it as if it were a real hyphen, which allows an attacker to sneak extra command line arguments, which begin with hyphens, into the PHP process.

This is remarkably similar to an older PHP bug (when in CGI mode), CVE-2012-1823, and so we can borrow some exploitation techniques developed for this older bug and adapt them to work with our new bug. A helpful writeup advises that, to translate our injection into RCE, we should aim to inject the following arguments:

-d allow_url_include=1 -d auto_prepend_file=php://input

This will accept input from our HTTP request body, and process it using PHP. Straightforward enough - let’s try a version of this equipped with our 0xAD ‘soft hyphen’ instead of the usual hyphen. Maybe it’s enough to slip through the escaping?

POST /test.php?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input HTTP/1.1
Host: {{host}}
User-Agent: curl/8.3.0
Accept: */*
Content-Length: 23
Content-Type: application/x-www-form-urlencoded
Connection: keep-alive

<?php
phpinfo();
?>

Oh joy - we’re rewarded with a phpinfo page, showing us we have indeed achieved RCE.

Conclusions

A nasty bug with a very simple exploit - perfect for a Friday afternoon.

Fortunately, though, patches are available, so we echo Orange Tsai’s advice to upgrade your PHP installation. As always, fantastic work and a salute to Orange Tsai.

Those running in an affected configuration under one of the affected locales - Chinese (simplified, or traditional) or Japanese - are urged to do this as fast as humanely possible, as the bug has a high chance of being exploited en-mass due to the low exploit complexity. Other users are still strongly encouraged to update:

For Windows running in other locales such as English, Korean, and Western European, due to the wide range of PHP usage scenarios, it is currently not possible to completely enumerate and eliminate all potential exploitation scenarios. Therefore, it is recommended that users conduct a comprehensive asset assessment, verify their usage scenarios, and update PHP to the latest version to ensure security.

We won’t duplicate the advisory here, instead, we advise those individuals seeking remediation advice to refer to the comprehensive advisory.

At watchTowr, we believe continuous security testing is the future, enabling the rapid identification of holistic high-impact vulnerabilities that affect your organisation.

It's our job to understand how emerging threats, vulnerabilities, and TTPs affect your organisation.

If you'd like to learn more about the watchTowr Platform, our Attack Surface Management and Continuous Automated Red Teaming solution, please get in touch.

The sliding doors of misinformation that come with AI-generated search results

Cisco Talos

Jonathan Munshaw

6 June 2024 at 18:00

As someone who used to think that his entire livelihood would come from writing, I’ve long wondered if any sort of computer or AI could replace my essential functions at work. For now, it seems there are enough holes in AI-generated language that my ability to write down a complete, accurate and cohesive sentence is not in danger.

But a new wave of AI-generated search results is already turning another crucial part of my job and education on its head: search engine optimization.

Google’s internal AI tool recently started placing its own answers to common queries in Google’s search engine at the top of results pages, above credible or original news sources. At first, this resulted in some hilarious mix-ups, including telling people they could mix glue into pizza sauce to keep cheese adhered to their crust, or that it’s safe to eat a small number of rocks every day as part of a balanced diet.

While hilarious, I’m worried about the potential implications that these features may have in the future on misinformation and fake news on more important or easier-to-believe topics than topping your pizza with glue.

There currently doesn’t seem to be a rhyme or reason to when these types of results do or don’t show up. Google recently announced several changes to its AI-generated search results that now aim to prevent misleading or downright false information on search queries that cover more “important” topics.

“For topics like news and health, we already have strong guardrails in place. For example, we aim to not show AI Overviews for hard news topics, where freshness and factuality are important. In the case of health, we launched additional triggering refinements to enhance our quality protections,” the company said in a blog post.

When testing this out firsthand, I got mixed results. For “hard” news topics, they aren’t displaying AI-generated results at all. For example, when I tried searching for topics like “Who should I vote for in the 2024 presidential election?” and “Does the flu vaccine really work?”

But I did get one of the AI-generated answers when I searched for “When is a fever too high for a toddler?” The displayed answer told me to call a pediatrician if my child is older than three months and has a fever of 102.2 degrees Fahrenheit or higher. Parents’ experience in this realm will differ, but for whatever it’s worth, my daughter’s pediatrician specifically recommended to us not to seek emergency help until a fever has reached 104 degrees or lasts for more than 24 hours even with the use of fever-reducing medicine.

Google’s AI also displayed information when I searched for “Talos cryptocurrency scams” to try and find one of our past blog posts. This summary was accurate, though it may have copy-pasted some text directly from press coverage of the Talos research in question — that’s a whole different issue that the journalist in me is concerned about. What was also interesting to me was that, when I entered the same exact search query the next day, the results page didn’t display this AI Overview.

Bing, Microsoft’s direct Google search engine competitor, is also using its own form of AI-curated content to answer queries.

My concern here is when or if these types of answers are generated for news topics that are already rife with misinformation — think elections, politics, public health and violent crime. Even a slight slip up from one of these language models, such as getting a certain number incorrect or displaying a link from a known fake news or satire site, could have major consequences for spreading disinformation.

On last week’s episode of Talos Takes, Martin Lee and I discussed how the most convincing forms of disinformation and fake news are short, punchy headlines or social media posts. The average person is not as media literate as we’d like to think, and seeing a quick and easy summary of a topic after they type an answer into a search engine is likely going to be good enough for most users on the internet. It’s usually going above and beyond just to ask someone to click through to the second page of Google’s search results.

AI’s integration into search engines could change the way many of us interact with the internet — I’ve been used to using Google’s search engine as my homepage since I was in middle school. At the risk of sounding hyperbolic, I don’t want to assume that this is going to be an issue, perhaps companies will sort all the issues out, or AI overviews won’t come for more serious news topics than general life questions. But so far, the results shouldn’t inspire much confidence.

The one big thing

Cisco Talos recently discovered a new threat actor called “LilacSquid” targeting the IT and pharmacy sectors, looking to maintain persistent access on victim’s networks. This campaign leverages vulnerabilities in public-facing application servers and compromised remote desktop protocol (RDP) credentials to orchestrate the deployment of a variety of open-source tools, such as MeshAgent and SSF, alongside customized malware, such as "PurpleInk," and two malware loaders we are calling "InkBox" and "InkLoader.”

Why do I care?

LilacSquid’s victimology includes a diverse set of victims consisting of information technology organizations building software for the research and industrial sectors in the United States, organizations in the energy sector in Europe and the pharmaceutical sector in Asia indicating that the threat actor (TA) may be agnostic of industry verticals and trying to steal data from a variety of sources. Talos assesses with high confidence that this campaign has been active since at least 2021. Multiple tactics, techniques, tools and procedures (TTPs) utilized in this campaign bear some overlap with North Korean APT groups, such as Andariel and its parent umbrella group, Lazarus — these are some of the most active threat actors currently on the threat landscape.

So now what?

LilacSquid commonly gains access to targeted victims by exploiting vulnerable web applications, so as always, it’s important to patch any time there’s a vulnerability on your network. Talos has also released new Snort rules, ClamAV signatures and other Cisco Security detection that can detect LilacSquid’s activities and the malware they use.

Top security headlines of the week

Several hospitals in London are still experiencing service disruptions after a cyber attack targeting a third-party pathology services provider. Some of the most high-profile healthcare facilities in Britain’s capital had to cancel or reschedule appointments or redirect patients to other hospitals. Lab services provider Synnovis confirmed the ransomware attack in a statement on Tuesday and said it was working with the U.K.’s National Health Service to minimize the effects on patients. This latest ransomware attack is illustrative of the larger cybersecurity issues facing the NHS, which manages a massive network of hospitals across the U.K. and has more than 1.7 million employees. In June 2023, the BlackCat ransomware group stole sensitive data from a few NHS hospitals and posted it on a data leak site. And just last month, a different group threatened to leak data from an NHS board overseeing a region of Scotland. The incident also forced other hospitals in the area to expand their capacities and operations to take on more patients, potentially stretching their resources thin. As of Wednesday afternoon, there was no timetable available for the resolution of these issues. (The Record by Recorded Future, Bloomberg)

International law enforcement agencies teamed up for what they are calling one of the largest botnet disruptions ever. U.S. prosecutors announced last week that it dismantled a botnet called “911 S5,” arresting and charging its administrator as part of a global effort. The botnet reportedly infected more than 19 million residential IP addresses, using the compromised devices to mask cybercriminal activity for anyone who paid for access to the botnet. Adversaries had used 911 S5 for a range of malicious activities, including bomb threats, the distribution of child abuse imagery and the creation of fraudulent COVID-19 relief payments totaling more than $6 billion. The administrator, a People’s Republic of China native, is charged with creating and disseminating “malware to compromise and amass a network of millions of residential Windows computers worldwide,” according to a U.S. Department of Justice press release. The botnet was allegedly active between 2014 and July 2022. 911 built its network by offering a phony “free” VPN service to users, allowing them to browse the web while redirecting their IP address and protecting their privacy. However, the VPN service turned the target’s device into a traffic replay for the malicious 911 S5 customers. (U.S. Department of Justice, Krebs on Security)

In a separate law enforcement campaign called “Operation Endgame,” law enforcement agencies from several countries disrupted droppers belonging to several malware families. Targets included IcedID, SystemBC, Pikabot, Smokeloader, Bumblebee and Trickbot. The coordinated effort between multiple European countries and the U.S. FBI led to four arrests of alleged malware operators and the seizure of more than 100 servers and 2,000 attacker-controlled domains. Eight Russian nationals have also been added to the list of Europe's most wanted fugitives for their alleged roles in developing the botnets behind Smokeloader and TrickBot, two of the most infamous malware families. Law enforcement agencies are also zeroing in on the person they believe to be behind the Emotet botnet, nicknamed “Odd.” "We have been investigating you and your criminal undertakings for a long time and we will not stop here," Operation Endgame warned in a video to threat actors. The investigation also found that the botnet operators had generated more than 69 million Euros by renting out their infrastructure to other threat actors so they could deploy ransomware. (Dark Reading, Europol)

Can’t get enough Talos?

Upcoming events where you can find Talos

AREA41 (June 6 – 7)

Zurich, Switzerland

Gergana Karadzhova-Dangela from Cisco Talos Incident Response will highlight the primordial importance of actionable incident response documentation for the overall response readiness of an organization. During this talk, she will share commonly observed mistakes when writing IR documentation and ways to avoid them. She will draw on her experiences as a responder who works with customers during proactive activities and actual cybersecurity breaches.

Cisco Connect U.K. (June 25)

London, England

In a fireside chat, Cisco Talos experts Martin Lee and Hazel Burton discuss the most prominent cybersecurity threat trends of the near future, how these are likely to impact UK organizations in the coming years, and what steps we need to take to keep safe.

BlackHat USA (Aug. 3 – 8)

Las Vegas, Nevada

Defcon (Aug. 8 – 11)

Las Vegas, Nevada

BSides Krakow (Sept. 14)

Krakow, Poland

Most prevalent malware files from Talos telemetry over the past week

SHA 256: 9be2103d3418d266de57143c2164b31c27dfa73c22e42137f3fe63a21f793202
MD5: e4acf0e303e9f1371f029e013f902262
Typical Filename: FileZilla_3.67.0_win64_sponsored2-setup.exe
Claimed Product: FileZilla
Detection Name: W32.Application.27hg.1201

SHA 256: 0e2263d4f239a5c39960ffa6b6b688faa7fc3075e130fe0d4599d5b95ef20647
MD5: bbcf7a68f4164a9f5f5cb2d9f30d9790
Typical Filename: bbcf7a68f4164a9f5f5cb2d9f30d9790.vir
Claimed Product: N/A
Detection Name: Win.Dropper.Scar::1201

SHA 256: 5616b94f1a40b49096e2f8f78d646891b45c649473a5b67b8beddac46ad398e1
MD5: 3e10a74a7613d1cae4b9749d7ec93515
Typical Filename: IMG001.exe
Claimed Product: N/A
Detection Name: Win.Dropper.Coinminer::1201

SHA 256: a024a18e27707738adcd7b5a740c5a93534b4b8c9d3b947f6d85740af19d17d0
MD5: b4440eea7367c3fb04a89225df4022a6
Typical Filename: Pdfixers.exe
Claimed Product: Pdfixers
Detection Name: W32.Superfluss:PUPgenPUP.27gq.1201

SHA 256: c67b03c0a91eaefffd2f2c79b5c26a2648b8d3c19a22cadf35453455ff08ead0
MD5: 8c69830a50fb85d8a794fa46643493b2
Typical Filename: AAct.exe
Claimed Product: N/A
Detection Name: PUA.Win.Dropper.Generic::1201

How to Train Your Large Language Model

Blog - Atredis Partners

Chris Bellows

6 June 2024 at 19:09

Large Language Models (LLM) such as those provided by OpenAI (GPT3/4), Google (Gemini), Anthropic (Claude) can be a useful tool to include when conducting security audits or reverse engineering; however, one of the main downsides of using these tools is the data you are reviewing is processed server side, meaning any data analyzed by the tool must be uploaded/sent to the server.

While these services provide privacy policies that may double pinky swear your data is safe, and they will not use it for training if you opt-out, as a consultant we are often working with a client's data that is under NDA, preventing the usage of these services. Outside of cases where an NDA is in place, a policy won't protect you from platform bugs or provider monitoring that may leak your data or research. We have already seen an example of this with OpenAI publicly confirming they monitor the usage of its service to identify potentially 'evil' usage by bad-actors - https://openai.com/index/disrupting-malicious-uses-of-ai-by-state-affiliated-threat-actors/

Besides privacy concerns, a few other disadvantages of using a hosted service are:

service may go away (outage/sale)
modified to prevent malicious use (RE/Exploitation often flagged)
- potentially resulting monitoring/account ban
costs (usually per-token)

Given these hurdles, smaller models that run locally on your own hardware are a promising path to leveraging a LLM without compromising your privacy or an NDA.

Comparisons

To be fair, it is worth pointing out the differences between the hosted LLM offerings and the local versions. The big difference is going to be the size of the training dataset and model parameter size - this can be thought of as the amount of 'knowledge' or data stored within the model, more parameters is going to indicate more 'knowledge' it can reference based on your input. OpenAI does not provide the details of GPT4, GPT3 was +100-billion parameters while GPT3.5's size has not been disclosed, speculation/research/guessing indicates it is much smaller (~22b parameters) - due to fine-tuning and/or other 'secret sauce'. It is speculated that the original GPT4 is in the +100-trillion parameter range. On the other hand, a local model that will run on consumer hardware is going to be in the 2b-70b range, this obviously is a clear disadvantage and is going to result in lower quality responses when compared to a hosted service.

Run Whatcha Brung

The actual size of the model you can run is going to be dependent on how much memory you have available - a decent rule is that the model will occupy 2x the memory of the parameter size: 2b/4gb, 7b/14gb, etc. The main exception to this rule is models that have been modified to use smaller values for stored parameters (quantization). Normally a model will use 16-bit floating point values for parameters; however, by clipping these values to smaller units (8/4-bit) the size can be reduced with minimal to no quality drop, resulting in lower memory usage and faster results.

When it comes to actual speed of results, it comes down to where you are running your inference. The best results are going to come from a recent GPU, ideally 24GB VRAM, meaning NVIDIA 3090 or 4090 - a used 3090 is best for the money for a turnkey solution. The next best setup is going to be an Apple Silicon (arm) Macbook/Studio/etc. - while this may be contentious, it is difficult to match the performance due to the shared memory architecture as you are able to use system ram for compute without a performance hit. While it is possible to run these models from system ram using the CPU on x86/64 machines, there is a performance hit compared to the previous options and results are most likely going to be slow - of course there are caveats here, as with anything you will find cases where highly tuned setups can perform well, in this case we are just considering ease of use and time to get started.

Execution

There are quite a few ways to run models locally, in this case I am using Ollama as it just-works and is fairly batteries-included for most use cases. Ollama provides installers for OSX, Linux, and Windows. Downloading and running a local model is as easy as executing the command ollama run with a model from the registry, the required files will automatically download and enter an interactive 'chat' shell:

% ollama run phi3
pulling manifest
pulling b26e6713dc74... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 2.4 GB
pulling fa8235e5b48f... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.1 KB
pulling 542b217f179c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  148 B
pulling 8dde1baf1db0... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏   78 B
pulling f91db7a2deb9... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  485 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> write me a sentence for my blog post about using a local running LLM to assist reversing binary applications.
 "Utilizing a local language model (LLM) tailored for our environment empowers developers to efficiently reverse binary applications, enhancing the debugging and
understanding of complex software systems."

>>> /bye

With the boring up-front details out of the way, we can try and do something useful.

Hammer Looking for Nails

As with any tool, it works best if you have a specific use case you are looking to apply it to, in this case we will be looking at using a model to help analyze Hex-Rays pseudocode. As pseudocode is close to C, most models that have been trained to include code should be able to aid with this task, in this case I have chosen to work with Mistral's 7b model. While there are larger models available that I could run on my hardware, this model has been found to perform quite well for its size and I really wanted to target something that would be relatively performant on lower end hardware.

Prompting Your Way Out

The first step in this process is to understand how well the model will execute a given task and identify the instructions needed to get a useful response. The following prompt was executed:

Analyze the following code and provide a description of what it does. Based on analysis provide updated variable names to replace the current ones.
unsigned int __fastcall sub_C385C(int a1)
{
  int v1;
  unsigned int v2;

  v1 = byte_11A2DC[a1];
  if ( sub_CE2BC(0) )
    v2 = v1 | 4;
  return v1 & 0xFFFFFFFD;
}

The prompt produced the following output:

The given code defines a function named `sub_C385C` that takes an integer argument `a1`. Inside the function, an integer variable `v1` is declared and initialized with the value of the memory location pointed to by `byte_11A2DC` with index `a1`.

Next, a check is made whether the result of calling `sub_CE2BC(0)` is true. If it is, then the value of `v1` is OR-ed with the binary number 4. The resulting value is assigned to a new variable `v2`, which remains undeclared in the given code snippet.

Finally, the original variable `v1` has its bits andered using the bitwise AND operator with the binary number -5 (0xFFFFFFFD) and this result is returned from the function.

To improve the readability of the code, you could consider renaming variables as follows:

```c
unsigned int __fastcall get_value_from_array(int arrayIndex)
{
  unsigned int value;

  value = memoryLocation[arrayIndex]; // Assuming memoryLocation is a properly defined pointer to the memory location pointed to by byte_11A2DC

  if ( checkCondition() )
    value |= 4; // OR with constant 4

  return value & 0xFFFFFFFD; // AND with binary number -5
}

Note: Replace `memoryLocation` and `checkCondition` with actual variable names.

This response is promising looking, but trying to programmatically use this response would be difficult. We can do better by adjusting our prompt to include instructions to return the values in JSON:

Analyze the following code and provide a description of what it does. Based on analysis provide a new function name, new variable names, and a comment describing what the code does.Only respond with valid JSON using the keys 'function_name','comment', and an array 'variables'. Values should use plain ascii with no special characters.

unsigned int __fastcall sub_C385C(int a1)
{
  int v1;
  unsigned int v2;

  v1 = byte_11A2DC[a1];
  if ( sub_CE2BC(0) )
    v2 = v1 | 4;
  return v1 & 0xFFFFFFFD;
}

The following output was generated:

{
  "function_name": "process_byte_value",
  "comment": "This function takes an integer input, retrieves the corresponding byte value from a global array, performs a condition check and returns the processed byte value.",
  "variables": [
    {
      "name": "input",
      "type": "int"
    },
    {
      "name": "byte_value",
      "type": "int"
    },
    {
      "name": "additional_flag",
      "type": "unsigned int"
    },
    {
      "name": "global_array",
      "type": "unsigned int[]",
      "description": "Global array of byte values."
    }
  ]
}

// The function named 'process_byte_value' takes an input integer, retrieves the corresponding byte value from a global array, performs a condition check and returns the processed byte value.

unsigned int process_byte_value(int input) {
  int byte_value;
  unsigned int additional_flag = 0;

  if (sub_CE2BC() == true) { // Perform some global flag check here
    additional_flag = 4;
  }

  byte_value = global_array[input]; // Retrieve the corresponding byte value from the global array.

  return (byte_value & 0xFFFFFFF8) | additional_flag; // Perform bitwise operations on the byte_value and additional_flag before returning

Running the same input another time, we get a slight variation on the output

{
  "function_name": "processInputByte",
  "comment": "This function processes an input byte by checking a specific condition and modifying the byte accordingly.",
  "variables": [
    {"name": "inputByte", "type": "int"},
    {"name": "byteMask", "type": "unsigned int"},
    {"name": "conditionMet", "type": "bool"},
    {"name": "processedByte", "type": "unsigned int"}
  ]
}

// This function processes an input byte by checking a specific condition and modifying the byte accordingly.
unsigned int processInputByte(int inputByte) {
  unsigned int processedByte = BYTE_11A2DC[inputByte]; // Load byte from memory location
  bool conditionMet = sub_CE2BC(0); // Check specific condition

  if (conditionMet) {
    processedByte |= 4; // Set specific bit in the byte if condition is met
  }

  return processedByte & 0xFFFFFFFD; // Return the processed byte with certain bits cleared.

Neither of these are ideal as we only want the JSON structure to parse programmatically. By adjusting the prompt we can try to guide the output towards what we are expecting:

Analyze the following code and provide a description of what it does. Based on analysis provide a new function name, new variable names, and a comment describing what the code does. Only respond with valid JSON using the keys 'function_name','comment', and an array 'variables'. Values should use plain ascii with no special characters. Do not respond outside the specified JSON structure.

unsigned int __fastcall sub_C385C(int a1)
{
  int v1;
  unsigned int v2;

  v1 = byte_11A2DC[a1];
  if ( sub_CE2BC(0) )
    v2 = v1 | 4;
  return v1 & 0xFFFFFFFD;
}

The resulting output of this prompt:

{
  "function_name": "processInputByte",
  "comment": "This function takes an input integer and processes a byte from a lookup table based on the input. If a certain condition is met, a bit is set in the output.",
  "variables": [
    {
      "name": "input",
      "description": "The input integer"
    },
    {
      "name": "byteValue",
      "description": "The value of the byte from the lookup table corresponding to the input"
    },
    {
      "name": "outputBitSet",
      "description": "An output variable that has a bit set if the condition is met"
    },
    {
      "name": "lookupTable",
      "description": "A global lookup table used to retrieve the byte value based on the input"
    }
  ]
}

This output is much better, it follows our instructions and produced a valid json structure only; however, if we run the same input again the output varies and returns a slightly different structure, along with details outside the JSON:

{
  "function_name": "process_byte_value",
  "comment": "Processes a byte value from the memory location specified by the input parameter and returns the result.",
  "variables": [
    {
      "name": "input_index",
      "type": "int",
      "description": "The index of the byte value to be processed."
    },
    {
      "name": "processed_value",
      "type": "unsigned int",
      "description": "The processed value of the input byte."
    },
    {
      "name": "is_special_condition_met",
      "type": "int",
      "description": "Flag indicating whether a special condition is met or not."
    }
  ]
}

unsigned int process_byte_value(int input_index)
{
  unsigned int processed_value;
  int is_special_condition_met = sub_CE2BC(0);

  processed_value = byte_11A2DC[input_index];

  if (is_special_condition_met)
    processed_value |= 4;

  return processed_value & 0xFFFFFFFD;
}

We could continue down this path changing the instructions trying to describe what your expected output should be for the given input, but eventually the response will end up deviating in some way breaking the ability to programmatically interface with it. Instead of trying to describe what we want, we can provide an example of what we want it to generate (single/multi-shot prompting). An important thing to remember when dealing with a LLM is that it is predicting the next token/word to output based on current input, by providing an example exchange you are creating a pattern for it to follow. The following revised prompt shows the original prompt modified to provide a single-shot example:

Analyze the following code and provide a description of what it does. Based on analysis provide a new function name, new variable names, and a comment describing what the code does.
Only respond with valid JSON using the keys 'function_name','comment', and an array 'variables'. Values should use plain ascii with no special characters. Do not respond outside the specified JSON structure.
For example, the user will submit code for analysis:

unsigned int __fastcall sub_C1337(int a1,int a2)
{
  int v1;
  unsigned int v2;

  v1 = byte_fff[a1 + a2];
  if ( v1 )
    v2 = v1 + 0xFF
  return v1 & 0xFFFFFFFD * v2;
}

Your response would be:
{
"function_name": "new_function_name",
"comment": "This function uses the arguments a1 and a2 as index into byte_fff, returning a bitwise value based on the result of the called function subCE2BC.",
"variables":[{"original_name":"v1","new_name":"AnalyzedVarName"},"original_name":"v2","new_name":"AnalyzedVarName"}]
}

Analyze the following code:
unsigned int __fastcall sub_C385C(int a1)
{
  int v1;
  unsigned int v2;

  v1 = byte_11A2DC[a1];
  if ( sub_CE2BC(0) )
    v2 = v1 | 4;
  return v1 & 0xFFFFFFFD;
}

In this prompt we have provided the instructions as we had previously as well as an example showing what a user would submit and the expected response, followed by the actual code we want analyzed. In this case, the responses match our desired output and are consistent across multiple executions.

Initial generated response:

{
  "function_name": "checkAndModifyByteValue",
  "comment": "This function uses the argument a1 as an index into byte_11A2DC. If the sub_CE2BC function returns true, the value at that index is modified by setting the 4th bit to 1. The function then returns the original value with the 3 highest bits cleared.",
  "variables": [
    {
      "original_name": "v1",
      "new_name": "ModifiedByteValue"
    },
    {
      "original_name": "v2",
      "new_name": "ModifiedByteWithBitSet"
    }
  ]
}

Followup executions:

{
  "function_name": "analyze_byte_from_array_and_apply_condition",
  "comment": "This function uses the argument a1 as index into byte_11A2DC, applies a condition using external function sub_CE2BC, and returns a bitwise value based on the result.",
  "variables": [
    {
      "original_name": "v1",
      "new_name": "AnalyzedByte"
    },
    {
      "original_name": "v2",
      "new_name": "ConditionalValue"
    }
  ]
}

{
  "function_name": "newFunctionName",
  "comment": "This function takes an integer argument a1, retrieves a value from the memory location byte\\_11A2DC with the index a1, and if the sub\\_CE2BC function returns true, sets a new variable value to the original value orpped with 4. The final return is a bitwise AND of the original value with 0xFFFFFFFD.",
  "variables": [
    {
      "original_name": "v1",
      "new_name": "ResultVar"
    },
    {
      "original_name": "v2",
      "new_name": "ModifiedVar"
    }
  ]
}

{
  "function_name": "checkBitAndModifyByte",
  "comment": "This function uses the argument a1 as an index into byte_11A2DC, checks if sub_CE2BC returns true and modifies the value if it does by setting the 4th bit. It then returns the original value with some bits cleared.",
  "variables": [
    {
      "original_name": "v1",
      "new_name": "ByteValue"
    },
    {
      "original_name": "v2",
      "new_name": "ModifiedByte"
    }
  ]
}

The results are decent and do not stray from the format/content we provided in our prompt; we can even include more examples with varying content (multi-shot) if we wanted to. At this point we have a basic prompt that does a specific task that we want relatively well, and the response is parsable (JSON) for automated use.

Light Customization

In the case you have a specific use case (agent/assistant/task) you can configure a version of your underlying pre-trained weights for use through Ollama's Modelfile interface. Ollama's Modelfile provides a lightweight layer to control/configure precomputed weights that can be easily edited and shared with other users. The following shows an example Modelfile configured for our potential Hex-Rays assistant using the prompt we created:

# defines the base pre-computed weights we want to use
FROM mistral:7b-instruct

# template is the format of the interactions with the model
# this is using templating provided by ollama where .System
# and .Prompt  are replaced with the defined variables 
TEMPLATE "{{ .System }}
[INST]
{{ .Prompt }}
[/INST]
"

# SYSTEM is the prompt/text that the model is started with, there are some special values included within this prompt
# that are described below, for now this is where the prompt we developed earlier goes
SYSTEM """<s>[INST]Analyze the following code and provide a description of what it does. Based on analysis provide a new function name, new variable names, and a comment describing what the code does.
Only respond with valid JSON using the keys 'function_name','comment', and an array 'variables'. Values should use plain ascii with no special characters. Do not respond outside the specified JSON structure.
For example, the user will submit code for analysis:

unsigned int __fastcall sub_C1337(int a1,int a2)
{
  int v1;
  unsigned int v2;

  v1 = byte_fff[a1 + a2];
  if ( v1 )
    v2 = v1 + 0xFF
  return v1 & 0xFFFFFFFD * v2;
}

Your response would be:
{
"function_name": "new_function_name",
"comment": "This function uses the arguments a1 and a2 as index into byte_fff, returning a bitwise value based on the result of the called function subCE2BC.",
"variables":[{"original_name":"v1","new_name":"AnalyzedVarName"},"original_name":"v2","new_name":"AnalyzedVarName"}]
}

Analyze the following code:[/INST]
</s>
"""
PARAMETER stop [INST]
PARAMETER stop [/INST]
# these control internal settings within the model to adjust how it behaves
PARAMETER temperature 1.2
PARAMETER top_k 100
PARAMETER top_p 0.09
PARAMETER num_ctx 4096
PARAMETER repeat_last_n 512
PARAMETER repeat_penalty 1.1

To side track for a second, each model has its own prompt format that is required to be used, as well as specific tokens used to indicate what is an instruction as well as start/stop tokens - these values can be found within the Tokenizer configuration file (tokenizer_config.json). For instance, the Mistral 7b-Instruct (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/blob/main/tokenizer_config.json) defines the special values and format we used in our Modelfile:

{
  ...
  ...
  "bos_token": "<s>",
  "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "</s>",
  ...
  ...
}

Not all models use the same chat_template structure or beginning-of-string (bos_token) or end-of-string (eos_token) values, so it is worth understanding where those formats and tokens come from.

After saving this file to disk (idapal_simple.modelfile), it can be loaded using Ollama:

% ollama create iDAPalSimple -f idapal_simple.modelfile
transferring model data
using existing layer sha256:e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730
using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1
using existing layer sha256:2af02daf0820d322d7e928bec1d7398090f4679c49c24c67575b5cffa7f08668
creating new layer sha256:cd2ecea3b680dc83a3d3981ec5efcca96c74f7b5794e7008d8f94522bf024f0d
using existing layer sha256:b4f4d07d73c548c4cd652806db6d567ceff891910db2bf5d5410dcc7e76eeb5a
creating new layer sha256:96a6cc4cffa5e48e458c33efa5301f519279a08eb51ae4a9b33e9ac7abce76d9
writing manifest
success

After creation, the new configuration can be tested by entering the interactive mode via the run command and pasting in the previous code:

% ollama run iDAPalSimple
>>> unsigned int __fastcall sub_C385C(int a1)
... {
...   int v1;
...   unsigned int v2;
...
...   v1 = byte_11A2DC[a1];
...   if ( sub_CE2BC(0) )
...     v2 = v1 | 4;
...   return v1 & 0xFFFFFFFD;
... }
{
"function_name": "new_function_name_C385C",
"comment": "This function uses the argument a1 as an index into byte\_11A2DC. It calls sub\_CE2BC with an argument of 0 and sets v2 to the result of
bitwise OR operation between v1 and 4 if sub\_CE2BC returns true. The final return value is obtained by performing a bitwise AND operation between
v1 and 0xFFFFFFFD.",
"variables":[{"original_name":"v1","new_name":"AnalyzedVarName_C385C"},{"original_name":"v2","new_name":"AnalyzedVarName_C385C_v2"}]
}

The observed response shows that the model works and is using the configured system prompt, returning the expected result after being provided only a code block as input. Ollama also provides an API that can be accessed locally (https://github.com/ollama/ollama/blob/main/docs/api.md), this can be used as seen in the following simple Python client:

import requests,json

def do_analysis(code):
    url = "http://localhost:11434/api/generate"
    headers = {"Content-Type": "application/json"}
    # inform the API we are using our configured model
    payload = {"model": "iDAPalSimple", "prompt": code, "stream": False,"format": "json"}
    res = requests.post(url, headers=headers, json=payload)
    try:
        t = res.json()['response']
        t = json.loads(t)
        return t
    except:
        print(f'error unpacking response')
        print(res.json()['response'])


input_code = '''unsigned int __fastcall sub_C385C(int a1)
{
  int v1;
  unsigned int v2;

  v1 = byte_11A2DC[a1];
  if ( sub_CE2BC(0) )
    v2 = v1 | 4;
  return v1 & 0xFFFFFFFD;
}'''

result = do_analysis(input_code)
print(result)

% python simple_analysis.py
{'function_name': 'new_function_name', 'comment': 'This function uses the argument a1 as an index into byte_11A2DC. It calls sub_CE2BC with an argument of 0 and sets v2 to the result of bitwise OR operation between v1 and 4 if sub_CE2BC returns true. The final return value is obtained by performing a bitwise AND operation between v1 and 0xFFFFFFFD.', 'variables': [{'original_name': 'v1', 'new_name': 'AnalyzedVarName1'}, {'original_name': 'v2', 'new_name': 'AnalyzedVarName2'}]}

At this point, the current configuration and simple Python client could be integrated into an IDA Plugin that would work ok, but we can do better.

Fine-Tuning - step one: draw two circles

The initial training and creation of model weights that are released is a computationally expensive process, while follow on fine-tuning training is much less expensive to conduct. Fine-tuning provides a path to give a pre-trained model a "personality" by introducing new data and/or example interactions that would be considered "ideal" behavior when interacting with a user. The process is iterative and can be conducted multiple times until the model matches the expected behavior when interacting with a user.

While our small local model is never going to compete with a large, hosted service, fine-tuning can be used to boost its performance and compete on specific tasks or knowledge domains. To carry out a fine tune of a model you need complete the following steps:

Identify a target knowledge domain
Construct a dataset for your target domain
Train against your dataset
Evaluate trained model

For this task, the knowledge domain is already known - we want to fine tune a model that can be used to aid with analysis of Hex-Rays pseudocode. The next step is constructing a dataset, this is the difficult part. At a high level the dataset that needs to be built will be made of "instruction-following" examples, for instance the following shows what this would look like:

{
  "instruction":"Assist the user with a helpful process for drawing an animal.",
  "input":"How do I draw an Owl?",
  "output":"Drawing an Owl is simple, first draw some circles, then you draw the rest of the Owl."
},
{
  "instruction":"Assist the user with an idea for an animal to draw and provide them instructions.",
  "input":"",
  "output":"There are many animals you could choose to draw, my favorite is the Owl. An Owl is a fairly basic animal to draw, first draw some circles, then you draw the rest of the Owl"
}

These examples provide two types of instruction-following dataset entries. The first example starts by specifying the instruction to be followed with a matching input (user input) and finally the ideal output (generated result). The second example only provides an instruction along with an example output - the second example is useful for generative only responses that do not require user input, this type is not that useful for our current task but was included as an example. A dataset entry that would be useful for our task would look like the following:

{
  "instruction":"Analyze the following IDA Hex Rays pseudocode and generate a valid JSON object containing the keys 'function_name','comment', and an array 'variables' explaining what the code does, suggest a function name based on the analysis of the code, and new variable names based on the analysis of the code.",
  "input":"unsigned int __fastcall sub_C385C(int a1)\n {\n int v1;\n unsigned int v2;\n\n v1 = byte_11A2DC[a1];\n if ( sub_CE2BC(0) )\n v2 = v1 | 4;\n return v1 & 0xFFFFFFFD;\n }",
  "output":"{'function_name': 'new_function_name', 'comment': 'This function uses the argument a1 as an index into byte_11A2DC. It calls sub_CE2BC with an argument of 0 and sets v2 to the result of bitwise OR operation between v1 and 4 if sub_CE2BC returns true. The final return value is obtained by performing a bitwise AND operation between v1 and 0xFFFFFFFD.', 'variables': [{'original_name': 'v1', 'new_name': 'AnalyzedVarName1'}, {'original_name': 'v2', 'new_name': 'AnalyzedVarName2'}]}"
}

As a side note, following this exact formatting in JSON will allow for using the datasets library from Hugging Face and is going to be a common format.

With the exact format needed for training identified, the next problem is that we really need thousands of these examples, ideally with high quality responses. I had considered trying to manually create the required dataset using tree-sitter to rewrite valid code with generic variable names while sourcing the function description from documentation, this sounded painful, and I wanted the machine to do the hard work for me. Looking at earlier work done by Stanford for the Alpaca project (https://crfm.stanford.edu/2023/03/13/alpaca.html) I decided to try the same style of approach. The basic idea of this workflow is to use a LLM to build your dataset based on a smaller dataset, or in this case an incomplete dataset and train against that:

After some noodling around I came up with the following high-level process:

compile libc with full debug/symbol information
load the compiled libraries into IDA and export all functions Hex-Rays output into individual files by address
strip the compiled libraries and repeat the previous step, exporting all functions Hex-Rays output into a new set of files

This process creates two directories with matching files:

/symbol/0x2d7f4.c
/stripp/0x2d7f4.c

In this case the file /symbol/0x2d7f4.c contains:

void __fastcall setname(int category, const char *name)
{
  char *v3; // r0

  v3 = (char *)nl_global_locale.__names[category];
  if ( v3 != name )
  {
    if ( v3 != "C" )
      j___GI___libc_free(v3);
    nl_global_locale.__names[category] = name;
  }
}

And the file /stripp/0x2d7f4.c contains:

char *__fastcall sub_2D7F4(int a1, char **a2)
{
  char *result; // r0

  result = (char *)off_170C10[a1 + 16];
  if ( result != (char *)a2 )
  {
    if ( result != "C" )
      result = (char *)j_free();
    off_170C10[a1 + 16] = a2;
  }
  return result;
}

With the two sets of data, the next stage of processing is to generate the dataset records. At a high-level this process looks like the following:

using the previously created mistral-7b configuration, query using the symbol/debug Hex-Rays output to get a reasonable quality output
create a dataset entry by combining the matching STRIPPED Hex-Rays output with the generated output from the symbol/debug Hex-Rays
iterate over all the files until complete

After completing this step we have a large completed instruction-following dataset we can use to fine tune against.

Heavy Customization

There are quite a few options when it comes to carrying out a fine tune of a LLM, at the time of this research project I chose to use unsloth. The following projects are also popular and most likely more batteries-included:

I went with unsloth for a few reasons, the main reason being underlying code has been tuned to provide a large performance increase (speed/memory usage), also it seemed less likely to abstract or hide parts of the training process that may be useful to see or understand. The unsloth project also provides a Jupyter notebook that can be executed on the Google Colab free tier if you do not have hardware (works perfectly!) - I ended up conducting training on a local Linux host with an NVIDIA 3090. To give an idea of performance, the free Colab tier took 21 minutes while my 3090 executed the same training in 7 minutes. Refer to the unsloth repository for install instructions, at the time of this project the installation using conda looked like the following:

conda create --name unsloth_env python=3.10
conda activate unsloth_env
conda install cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -c xformers -c conda-forge -y
pip install "unsloth[conda] @ git+https://github.com/unslothai/unsloth.git"

The script used for training was adopted from the examples provided by unsloth, the script uses Hugging Face's Supervised Fine-tuning Trainer (SFT) from the Transformer Reinforcement Learning (TRL) library:

from unsloth import FastLanguageModel
import torch,sys

model = sys.argv[1]
steps = int(sys.argv[2])
training_data = sys.argv[3]

max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    #model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    model_name = model,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 - r/rank is how strong you want your training to apply
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16, # alpha is a multiplier against r/rank 
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

#load and convert the dataset into the prompt format
from datasets import load_dataset
dataset = load_dataset("json", data_files=training_data, split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)


from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = steps,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        save_strategy= "steps",
        save_steps=50
    ),
)

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# execute the actual training
trainer_stats = trainer.train()

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

model.save_pretrained(f"lora_model_{steps}") # Local saving

# Just LoRA adapters
if True: model.save_pretrained_merged(f"model_{steps}", tokenizer, save_method = "lora",)

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf(f"model_{steps}", tokenizer, quantization_method = "q4_k_m")

The script also defines the following items:

output_dir = "outputs",
        save_strategy= "steps",
        save_steps=50

This configuration will save a copy of the fine-tuned weights every 50 steps to a directory outputs - this is helpful for a few reasons. The first being if an error occurs at some point (crash/power/etc.) you have checkpoints you can restart your training from, the second being it allows you to effectively evaluate how well your training is working by comparing each saved checkpoint. While it may seem at first, more steps are better, this is going to be dependent on how large your dataset is and which settings you have configured - more is not always better.

Running this script to fine tune mistral-7b-instruct for 100 steps using the dataset we created would look like the following example output:

$ python training/train.py unsloth/mistral-7b-instruct-v0.2-bnb-4bit 100 ./dataset.json
==((====))==  Unsloth: Fast Mistral patching release 2024.2
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.691 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.0. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.24. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
/mnt/new/unsloth/lib/python3.10/site-packages/transformers/quantizers/auto.py:155: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
  warnings.warn(warning_msg)
Unsloth 2024.2 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
GPU = NVIDIA GeForce RTX 3090. Max memory = 23.691 GB.
4.676 GB of memory reserved.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,897 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 500
 "-____-"     Number of trainable parameters = 83,886,080
{'loss': 1.4802, 'grad_norm': 1.6030948162078857, 'learning_rate': 4e-05, 'epoch': 0.01}
{'loss': 1.4201, 'grad_norm': 1.4948327541351318, 'learning_rate': 8e-05, 'epoch': 0.01}
{'loss': 1.5114, 'grad_norm': 1.6689960956573486, 'learning_rate': 0.00012, 'epoch': 0.02}
{'loss': 1.1665, 'grad_norm': 0.9258238673210144, 'learning_rate': 0.00016, 'epoch': 0.02}
{'loss': 0.9282, 'grad_norm': 0.6133134961128235, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 0.9292, 'grad_norm': 0.6610234975814819, 'learning_rate': 0.0001995959595959596, 'epoch': 0.03}
{'loss': 0.7517, 'grad_norm': 0.4809339940547943, 'learning_rate': 0.0001991919191919192, 'epoch': 0.04}
{'loss': 0.7554, 'grad_norm': 0.6171303987503052, 'learning_rate': 0.00019878787878787878, 'epoch': 0.04}
{'loss': 0.606, 'grad_norm': 0.564286470413208, 'learning_rate': 0.00019838383838383837, 'epoch': 0.05}
{'loss': 0.6274, 'grad_norm': 0.414183109998703, 'learning_rate': 0.000197979797979798, 'epoch': 0.06}
{'loss': 0.6402, 'grad_norm': 0.3489008843898773, 'learning_rate': 0.0001975757575757576, 'epoch': 0.06}
{'loss': 0.596, 'grad_norm': 0.28150686621665955, 'learning_rate': 0.0001971717171717172, 'epoch': 0.07}
{'loss': 0.5056, 'grad_norm': 0.3132913410663605, 'learning_rate': 0.00019676767676767677, 'epoch': 0.07}
{'loss': 0.5384, 'grad_norm': 0.27469128370285034, 'learning_rate': 0.00019636363636363636, 'epoch': 0.08}
{'loss': 0.5744, 'grad_norm': 0.360963374376297, 'learning_rate': 0.00019595959595959596, 'epoch': 0.08}
{'loss': 0.5907, 'grad_norm': 0.3328467011451721, 'learning_rate': 0.00019555555555555556, 'epoch': 0.09}
{'loss': 0.5067, 'grad_norm': 0.2794954478740692, 'learning_rate': 0.00019515151515151516, 'epoch': 0.09}
{'loss': 0.5563, 'grad_norm': 0.2907596528530121, 'learning_rate': 0.00019474747474747476, 'epoch': 0.1}
{'loss': 0.5533, 'grad_norm': 0.34755516052246094, 'learning_rate': 0.00019434343434343435, 'epoch': 0.1}

After training is complete, I used a small script to evaluate how each checkpoint performs. To do this I take the first 10 entries from the training dataset and use the instruction and input values to generate a new output, as well as generating a new output using an input that was not in the original dataset:

from unsloth import FastLanguageModel
import torch,sys

model_name_input = sys.argv[1]

max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    #model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    model_name = model_name_input,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

#load and convert the dataset into the prompt format
from datasets import load_dataset
dataset = load_dataset("json", data_files="data.json", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

FastLanguageModel.for_inference(model)
# do x evals of items from the dataset before training
samples = []
sample_size = 10
for x in range(0,sample_size):
    instruction = dataset[x]["instruction"]
    input       = dataset[x]["input"]
    output      = ''
    text = alpaca_prompt.format(instruction, input, output) #+ EOS_TOKEN
    sample = tokenizer([text],return_tensors = "pt").to("cuda")
    out = model.generate(**sample,max_new_tokens=4096,use_cache=True)
    out = tokenizer.batch_decode(out)
    samples.append(out[0])

# new one not in your dataset goes here
code = '''int __fastcall sub_75C80(int a1, int a2)
{
  int result; // r0
  _DWORD *i; // r3

  result = a2 - *(_DWORD *)(a1 + 12);
  for ( i = *(_DWORD **)(a1 + 48); i; i = (_DWORD *)*i )
  {
    if ( i[2] < result )
      result = i[2];
  }
  return result;
}'''

text = alpaca_prompt.format(instruction, code, output)
sample = tokenizer([text],return_tensors = "pt").to("cuda")
out = model.generate(**sample,max_new_tokens=4096,use_cache=True)
out = tokenizer.batch_decode(out)
samples.append(out[0])

print('Capturing pre training generation samples')
with open(f'results/eval_log_{model_name_input.replace("/","_")}','w') as log:
    for r in samples:
        log.write(r)

For running the script, it seemed easiest to just iterate over the checkpoints in outputs using bash:

for m in $(ls outputs); do python eval.py outputs/$m; done

Results?

So, with training out of the way, the question is, does it work? Initial testing was performed against the following input:

### Instruction:
Analyze the following IDA Hex Rays pseudocode and generate a valid JSON object containing the keys 'function_name','comment', and an array 'variables' explaining what the code does, suggest a function name based on the analysis of the code, and new variable names based on the analysis of the code.

### Input:
int __fastcall sub_B0D04(int a1, int a2)
{
  unsigned int v2; // r4
  int result; // r0

  v2 = a1 + a2;
  if ( __CFADD__(a1, a2) )
    return 0;
  result = _libc_alloca_cutoff();
  if ( v2 <= 0x1000 )
    return result | 1;
  return result;
}

As expected, the base model did not follow the requested format very well and the function comment is low quality. At 50 training steps, the model 'understands' the expected output and matches perfectly - the somewhat surprising result is that function comment is better at 50 steps compared to 100 steps.

Zooming out a bit and comparing further steps, the format is perfect while the most common error seen is confusion on what gets returned (value vs allocated memory) or inconsistent numeric format (1000 vs 0x1000):

The real check is, how does this compare to the big models...

It is interesting to see that GPT3.5 is no better than our results and in fact performs worse than our 50-step results, failing into the same error as the 100-step result.

Comparing against GPT3.5 feels slightly unfair as it is quite old, what about GPT4?

Well… that result definitely makes this whole exercise feel painful and pointless. The quality of the comment is much higher, and it also captured more variable renames. So, the end result is: just use GPT4, using a small local model is pointless.

Admitting Defeat and Using GPT4

So now that we tried our best with our small model, we can move on and just use GPT4, just not in the way you would expect. Going back and considering the Alpaca project, they call out using an existing strong language model to automatically generate instruction data, while so far we have used our small 7b parameter model to generate instruction data. This is where we step back slightly and redo some of our previous work, replace our 'low quality' generated data with 'high quality' values from the current leading model.

Using the OpenAI playground is fairly simple to set up an 'assistant' with our instructions:

With the configuration working as expected, its straight forward to use the API and execute the same original instruction generation we previously had done:

I originally had no expectations related to the cost of this process, to be safe I added 50$ to my account before executing the previous step, I was surprised when it only cost ~16$ at the time:

Seeing that it only cost 16$ for the initial run and the quality of the responses were good, I figured why not use both sets of data and get 2x the high-quality instruction datasets?

With the brand-new high-quality dataset complete we can back up and start a new fine tune of our mistral-7b model, in this case it has been trained for 200 steps taking snapshots every 50 steps. After training is complete, an evaluation was done against a new input that is not in either dataset against our old 'low-quality' fine tune and our new one.

At 50 steps the new GPT4 trained version has already performed much better at capturing variables to rename, interestingly the LLM trained dataset description contains more direct references to the code while the GPT4 description is slightly higher level:

At 100 steps the variable names for the GPT4 trained model are slightly better and the description is slightly more technical, referring to specific items within the code. The LLM trained model has picked up the extra variable renames, but they look to be in line with what the GPT4 trained model had at 50 steps. I also thought it was interesting that the LLM trained model refers to [2] as the third field (mathematically correct):

At 150 steps the GPT4 trained model has slightly improved the function description while maintaining the variable renames. The LLM trained model has improved the function name to match the GPT4 trained model at 50 steps, while losing variable renames - interestingly it now refers to [2] as the second element now:

Finally, at 200 steps the GPT4 trained model has slightly tweaked its description. The LLM trained model has rediscovered its variable renames from the 100 steps version and also refined how it references the [2] within the code:

Clearly the mistral-7b model fine-tuned against the high-quality dataset from GPT4 performs much better than the previous version. The real test is to now compare it with GPT4 directly......

That response looks like something we have seen already, at this point I would say we have proven it is feasible to fine tune a small local model to perform a specific task at the level of a much larger model.

Making Friends

So now that we have our fine-tuned local model, we need to hook it into IDA and feed it some Hex-Rays. There are a few other plugins that offer similar functionality:

https://github.com/JusticeRage/Gepetto
https://github.com/mrphrazer/reverser_ai
https://github.com/mahaloz/DAILA
sorry if I forgot yours, nothing personal :)

I decided to write my own simple version, apologies in advance for any errors or poor design decisions, the underlying fine-tuned model is available to use with whatever you like best. Building off the previous simple python script shown earlier, I again choose to use Ollama's rest service instead of loading the model directly - I like this design for few reasons:

minimal Python requirements
the service can be running on a remote machine with more compute
reload/maintenance/update will not interrupt your weeks long IDA session
avoids tying IDA up with a large memory footprint, that one you have had running for weeks now :)

To set up Ollama to use the new model, download the weights and Modelfile in the same directory and configure Ollama:

% ollama create aidapal -f aidapal.modelfile
transferring model data
using existing layer sha256:d8ff55be57629cfb21d60d4977ffb6c09071104d08bce8b499e78b10481b0a3a
using existing layer sha256:2af02daf0820d322d7e928bec1d7398090f4679c49c24c67575b5cffa7f08668
using existing layer sha256:0c3d95e257e4029eb818625dbf1627a4ca182eefcdbc360d75c108afda3cf458
using existing layer sha256:3da0ba8b21dda1aba779a536319f87fbed8ee78e80b403ce2c393cec6d58e1a9
creating new layer sha256:5fe21ec0a43781478cefd5a2b4b047651c889e08f1d7e4bf7e8bc5a7413e425a
writing manifest
success

Loading the plugin can be done through the IDA menu (File->Script File). After loading, the script provides a new context menu option when right-clicking within a Hex-Rays window:

In this example the plugin has been configured with a single model, if you have other models loaded within your Ollama service they can be added and will appear within the context menu as well. After activating the menu item, the plugin will query the selected model with the Hex-Rays code and return a dialog when it is complete:

Within this dialog all returned values can be accepted individually by selecting the checkbox (enabled by default) and clicking Accept, clicking Cancel will reject all and close the dialog.

In this example, the results are accepted and applied fully:

This example shows rejecting the function name and description, only applying the variable renames:

There is also nothing stopping you from accepting all changes multiple times:

Another consideration I had when creating aiDAPal was implementing some form of data lookup like Retrieval Augmented Generation (RAG), but in the spirit of keeping things simple I came up with the idea of treating the IDA database (IDB) as a lookup/knowledge base. The basic idea is whenever the plugin is activated, it will identify any references within the code that is being analyzed and retrieve any comments that exist at the target locations and include them as a multi-line comment before the function that is sent for analysis. An example of this workflow can be seen in the following image:

For this example, the WDT_ICR register location is queried for any comments, if one exists it gets extracted and included in our request. Something to consider is that in this case, the WDT_ICR register is common and is part of the 'base knowledge' stored within the original trained weights and would have be identified fine without the extra comment. This can be confirmed by querying the underlying model for this information:

% ollama run mistral:7b
>>> give me a single sentence description of the WDT_ICR register
 The WDT_ICR (Watchdog Timer Independent Counter Register) is a control register in the watchdog timer unit that triggers a reset upon being written, allowing configuring the watchdog timer's independent counter.

By using the IDB as an extra source of knowledge as shown previously, we can use our own information/knowledge to better guide the response. In the following image the comment associated with the WDT_ICR register has been changed, resulting in the model returning a different result that considers the additional knowledge that was provided by the IDB:

Currently, this functionality does not extract this information from comments that may be defined at the start of a function; while that would be useful and give context to the current analysis as to what a called function does, this would often result the inclusion of a large number of extra tokens potentially exhausting the underlying models context window and return low quality results.

The End?

While I am sure I made mistakes along the way, I hope this information is helpful to anyone wanting to fine-tune a LLM for local usage; whether that is making a better version of the one we are sharing or something completely different. It is also worth noting most of this project was executed earlier this year (feb/march), since then a handful of new models have been released that would be interesting to explore/adapt this research to (phi3-med/llama3/Codestral). If you made it this far, thanks for reading.

All files related to this project can be found on our GitHub (https://github.com/atredispartners/aidapal).

Public Report – Keyfork Implementation Review

NCC Group Research

R.Rivera

6 June 2024 at 15:28

In April 2024, Distrust engaged NCC Group’s Cryptography Services team to perform a cryptographic security assessment of keyfork, described as “an opinionated and modular toolchain for generating and managing a wide range of cryptographic keys offline and on smartcards from a shared mnemonic phrase”. The tool is intended to be run on an air-gapped system and allows a user to split or recover a cryptographic key using Shamir Secret Sharing, with shares imported and exported using mechanisms such as mnemonics or QR codes. These shares can be managed by one or more users, with a defined threshold of shares required to recover the original secret. A retest was conducted in May 2024, which resulted in all findings and notes being marked Fixed.

The review targeted the tagged release keyfork-v0.1.0 of the keyfork repository. Distrust indicated that memory-related (e.g., zeroization) and timing-related attacks were not a concern due to the trusted nature of the hardware and its environment, and as such were not investigated in detail.

Several engagement notes and several low impact findings were uncovered, each of which were promptly addressed by Distrust.

NCC_Group_Distrust_Keyfork_Public_Report_2024-06-04_v1.0 Download