7 minutes in, he shows the SQLI he found in Ghost (the first sev:hi in the history of the project). If I'd remembered better, I would have mentioned in the post:
* it's a blind SQL injection
* Claude Code wrote an exploit for it. Not a POC. An exploit.
POC generally means “you can demonstrate unintentional behavior”.
“Exploit” means you can gain access or do something malicious.
It’s a fine line. Author’s point is that the LLM was able to demonstrate some malfeasance, not just unintended consequence. That’s a big deal considering that actual malicious intent generally requires more knowhow than raw POC.
Specifically: the exploit extracted the admin's credentials from the database. A blind SQLI POC would simply demonstrate the existence of a timing channel based on a pathological input.
One other commenter asked a decent question - does going lighter (Zig) or harder on memory safety (Rust) confer any meaningful advantages against the phenomenon you describe?
Nicholas Carlini is the real deal. He was most recently on the front page for "How to win a best paper award", about his experience winning a series of awards at Big 4 academic security conferences, mostly recently for work he coauthored with Adi Shamir (I'm just namedropping the obvious name) on stealing the weights from deep neural networks. Before all that (and before he got his doctorate), he and Hans Nielsen wrote the back half of Microcorruption.
I've never been on a security-specific team, but it's always seemed to me that triggering a bug is, for the median issue, easier than fixing it, and I mentally extend that to security issues. This holds especially true if the "bug" is a question about "what is the correct behavior?", where the "current behavior of the system" is some emergent / underspecified consequence of how different features have evolved over time.
I know this is your career, so I'm wondering what I'm missing here.
It has generally been the case that (1) finding and (2) reliably exploiting vulnerabilities is much more difficult than patching them. In fact, patching them is often so straightforward that you can kill whole bug subspecies just by sweeping the codebase for the same pattern once you see a bug. You'd do that just sort of as a matter of course, without necessarily even qualifying the bugs you're squashing are exploitable.
As bugs get more complicated, that asymmetry has become less pronounced, but the complexity of the bugs (and their patches) is offset by the increased difficulty of exploiting them, which has become an art all its own.
LLMs sharply tilt that difficulty back to the defender.
It's a good question. Fuzzers generated a surge of new vulnerabilities, especially after institutional fuzzing clusters got stood up, and after we converged on coverage-guided fuzzers like AFL. We then got to a stable equilibrium, a new floor, such that vulnerability research & discovery doesn't look that drastically different after fuzzing as before.
Two things to notice:
* First, fuzzers also generated and continue to generate large stacks of unverified crashers, such that you can go to archives of syzkaller crashes and find crashers that actually work. My contention is that models are not just going to produce hypothetical vulnerabilities, but also working exploits.
* Second, the mechanism 4.6 and Codex are using to find these vulnerabilities is nothing like that of a fuzzer. A fuzzer doesn't "know" it's found a vulnerability; it's a simple stimulus/response test (sequence goes in, crash does/doesn't come out). Most crashers aren't exploitable.
Models can use fuzzers to find stuff, and I'm surprised that (at least for Anthropic's Red Team) that's not how they're doing it yet. But at least as I understand it, that's generally not what they're doing. It something much closer to static analysis.
I suspect we'll see combinations of symbolic execution + fuzzing as contextual inputs to LLMs, with LLMs delegating highly directed tasks to these external tools that are radically faster at exploring a space with the LLM guiding based on its own semantic understanding of the code.
I'm with you, I expected this to be happening already. Funny enough, I guess even a hardened codebase isn't at that level of "we need to optimize this" currently so you can just throw tokens at the problem.
Right, so that's exactly how I was thinking about it before I talked to Carlini. Then I talked to Carlini for the SCW podcast. Then I wrote this piece.
I don't know that I'm ready to say that the frontier of vulnerability research with agents is modeling, fuzzing, and analysis (orchestrated by an agent). It may very well be that the models themselves stay ahead of this for quite some time.
That would be a super interesting result, and it's the result I'm writing about here.
I have just seen too much infrastructure set up to 'find bugs,' effectively sitting and doing nothing- either the wrong thing gets audited, or tons of compute gets thrown at a code base and nobody ever checks in on or verifies.
This seems like a human/structural issue that an AI won't actually fix - attackers/defenders alike will gain access to the same models, feels a little bit like we are back to square one
If that's true, and if patches can effectively be pushed out quickly, then the results of this will be felt mostly by vulnerability researchers, which is the subject of the piece. But those are big "ifs".
The other thing is to remember is that when it comes to complex targets, attacks still are found by using a different fuzzer and/or targeting a different entry point.
It stands to reason that the same will apply for LLMs.
I don't know, maybe it is? My point is just that frontier models start off with latent models of all the interconnectivity in all the important open-source codebases, to a degree that would be infeasible for the people who learned how all the CSS object lifecycles and image rendering and unicode shaping stuff worked well enough to use them in exploits.
That might be one outcome, especially for large, expertly-staffed vendors who are already on top of this stuff. My real interest in what happens to the field for vulnerability researchers.
Perhaps a meta evolution, they become experts at writing harnesses and prompts for discovering and patching vulnerabilities in existing code and software. My main interest is, now that we have LLMs, will the software industry move to adopting techniques like formal verification and other perhaps more lax approaches that massively increase the quality of software.
> Perhaps a meta evolution, they become experts at writing harnesses and prompts
Harnesses, maybe, but prompts?
There's still this belief amongst AI coders that they can command a premium for development because they can write a prompt better than Bob from HR, or Sally from Accounting.
When all you're writing are prompts, your value is less than it was before., because the number of people who can write the prompt is substantially more than the number of people who could program.
Also, synthetic data and templates to help them discover new vulnerabilities or make agents work on things they're bad at. They differentiate with their prompts or specialist models.
Also, like ForAllSecure's Mayhem, I think they can differentiate on automatic patching that's reliable and secure. Maybe test generation, too, that does full coverage. They become drive by verification and validation specialists who also fix your stuff for you.
Outside of limited specific circumstances, formal verification gives you nothing that tests don't give you, and it makes development slow and iteration a chore. People know about it, and it's not used for lot of reasons.
It sounds like what makes the pipeline in the article effective is the second stage, which takes in the vulnerability reports produced by the first level and confirms or rejects them. The article doesn't say what the rejection rate is there.
I don't think the spammers would think to write the second layer, they would most likely pipe the first layer (a more naive version of it too, probably) directly to the issue feed.
* Carlini's team used new frontier models that have gotten materially better at finding vulnerabilities (talk to vulnerability researchers outside the frontier labs, they'll echo that). Stenberg was getting random slop from people using random models.
* Carlini's process is iterated exhaustively over the whole codebase; he's not starting with a repo and just saying "find me an awesome bug" and taking that and only that forward in the process.
* And then yes, Carlini is qualifying the first-pass findings with a second pass.
I guess the broader point I wanted to make is about the people responsible for the deluge of LLM-reported bugs and security vulnerabilities on countless open-source projects (not only on curl): they weren't considerate or thoughtful security researchers, they were spammers looking to raise their profile with fully automated, hands-off open source "contributions". I would expect that the spammers would continue to use whatever lowest common denominator tooling is available, and continue to cause these headaches for maintainers.
That doesn't mean frontier models and tooling built around them aren't genuinely useful to people doing serious security research: that does seem to be the case, and I'm glad for it.
These articles are no fun anymore, because it's almost impossible to find anybody to take the other end of the claim, that there's any perceptible difference in sound quality from high-end cables. Every audiophile forum I could find talking about this video all said the same thing: "no shit, of course, everyone knows this already".
It’s true, but somehow there still seems to be a market for those things to keep existing. Which to me is also interesting, that everyone knows there’s no point but people still buy them
Presumably the people who buy them aren't talking about audiophile stuff online. My bet would be that they're generally people who buy ludicrously expensive components, and are then told they need to get the matching cables to get the most out of it.
reply