More

tptacek · 2026-03-31T01:00:39 1774918839

Conventional static analysis tools come nowhere close to catching all bugs, even accounting for the false positives.

MeetingsBrowser · 2026-03-31T12:01:17 1774958477

Sorry, it was supposed to be a joke.

If everything is reported as a bug, there will be 0 false negatives but a lot of false positives

tptacek · 2026-03-31T00:30:45 1774917045

It's this talk right here:

https://www.youtube.com/watch?v=1sd26pWhfmg

7 minutes in, he shows the SQLI he found in Ghost (the first sev:hi in the history of the project). If I'd remembered better, I would have mentioned in the post:

* it's a blind SQL injection

* Claude Code wrote an exploit for it. Not a POC. An exploit.

streetfighter64 · 2026-03-31T09:03:54 1774947834

> Not a POC. An exploit.

What's the distinction? A proof of concept is just something that demonstrates that a bug is possible to exploit, by doing so.

cushychicken · 2026-03-31T13:44:52 1774964692

Repeatability and/or an actual negative effect.

POC generally means “you can demonstrate unintentional behavior”.

“Exploit” means you can gain access or do something malicious.

It’s a fine line. Author’s point is that the LLM was able to demonstrate some malfeasance, not just unintended consequence. That’s a big deal considering that actual malicious intent generally requires more knowhow than raw POC.

tptacek · 2026-03-31T17:37:25 1774978645

Specifically: the exploit extracted the admin's credentials from the database. A blind SQLI POC would simply demonstrate the existence of a timing channel based on a pathological input.

cushychicken · 2026-03-31T21:01:19 1774990879

One other commenter asked a decent question - does going lighter (Zig) or harder on memory safety (Rust) confer any meaningful advantages against the phenomenon you describe?

tptacek · 2026-03-31T00:20:03 1774916403

Will it? Why do you assume that?

tptacek · 2026-03-31T00:19:40 1774916380

Nicholas Carlini is the real deal. He was most recently on the front page for "How to win a best paper award", about his experience winning a series of awards at Big 4 academic security conferences, mostly recently for work he coauthored with Adi Shamir (I'm just namedropping the obvious name) on stealing the weights from deep neural networks. Before all that (and before he got his doctorate), he and Hans Nielsen wrote the back half of Microcorruption.

He's not a sales guy.

acdha · 2026-03-31T02:38:33 1774924713

Thanks for having him on. It was really nice to hear a sober, experienced voice talking about their work with fellow practitioners.

tptacek · 2026-03-31T03:48:12 1774928892

Thank Nicholas! We'll talk to anyone. :)

tptacek · 2026-03-30T21:22:05 1774905725

People have said that for decades and it wasn't true until recently.

joatmon-snoo · 2026-03-31T00:33:18 1774917198

Hmm: can you elaborate?

I've never been on a security-specific team, but it's always seemed to me that triggering a bug is, for the median issue, easier than fixing it, and I mentally extend that to security issues. This holds especially true if the "bug" is a question about "what is the correct behavior?", where the "current behavior of the system" is some emergent / underspecified consequence of how different features have evolved over time.

I know this is your career, so I'm wondering what I'm missing here.

tptacek · 2026-03-31T00:37:16 1774917436

It has generally been the case that (1) finding and (2) reliably exploiting vulnerabilities is much more difficult than patching them. In fact, patching them is often so straightforward that you can kill whole bug subspecies just by sweeping the codebase for the same pattern once you see a bug. You'd do that just sort of as a matter of course, without necessarily even qualifying the bugs you're squashing are exploitable.

As bugs get more complicated, that asymmetry has become less pronounced, but the complexity of the bugs (and their patches) is offset by the increased difficulty of exploiting them, which has become an art all its own.

LLMs sharply tilt that difficulty back to the defender.

saagarjha · 2026-03-31T08:20:42 1774945242

In a sense, breaking a vulnerability is easier than fixing it up to be an exploit.

underdeserver · 2026-03-30T21:32:56 1774906376

Specifically in software vulnerability research, you mean.

Fixing vulnerable code is usually trivial.

In the physical world breaking things is usually easier.

tptacek · 2026-03-30T21:01:25 1774904485

It's a good question. Fuzzers generated a surge of new vulnerabilities, especially after institutional fuzzing clusters got stood up, and after we converged on coverage-guided fuzzers like AFL. We then got to a stable equilibrium, a new floor, such that vulnerability research & discovery doesn't look that drastically different after fuzzing as before.

Two things to notice:

* First, fuzzers also generated and continue to generate large stacks of unverified crashers, such that you can go to archives of syzkaller crashes and find crashers that actually work. My contention is that models are not just going to produce hypothetical vulnerabilities, but also working exploits.

* Second, the mechanism 4.6 and Codex are using to find these vulnerabilities is nothing like that of a fuzzer. A fuzzer doesn't "know" it's found a vulnerability; it's a simple stimulus/response test (sequence goes in, crash does/doesn't come out). Most crashers aren't exploitable.

Models can use fuzzers to find stuff, and I'm surprised that (at least for Anthropic's Red Team) that's not how they're doing it yet. But at least as I understand it, that's generally not what they're doing. It something much closer to static analysis.

staticassertion · 2026-03-30T21:04:13 1774904653

I suspect we'll see combinations of symbolic execution + fuzzing as contextual inputs to LLMs, with LLMs delegating highly directed tasks to these external tools that are radically faster at exploring a space with the LLM guiding based on its own semantic understanding of the code.

I'm with you, I expected this to be happening already. Funny enough, I guess even a hardened codebase isn't at that level of "we need to optimize this" currently so you can just throw tokens at the problem.

tptacek · 2026-03-30T21:05:52 1774904752

Right, so that's exactly how I was thinking about it before I talked to Carlini. Then I talked to Carlini for the SCW podcast. Then I wrote this piece.

I don't know that I'm ready to say that the frontier of vulnerability research with agents is modeling, fuzzing, and analysis (orchestrated by an agent). It may very well be that the models themselves stay ahead of this for quite some time.

That would be a super interesting result, and it's the result I'm writing about here.

narginal · 2026-03-30T21:08:42 1774904922

I have just seen too much infrastructure set up to 'find bugs,' effectively sitting and doing nothing- either the wrong thing gets audited, or tons of compute gets thrown at a code base and nobody ever checks in on or verifies.

This seems like a human/structural issue that an AI won't actually fix - attackers/defenders alike will gain access to the same models, feels a little bit like we are back to square one

tptacek · 2026-03-30T21:11:32 1774905092

If that's true, and if patches can effectively be pushed out quickly, then the results of this will be felt mostly by vulnerability researchers, which is the subject of the piece. But those are big "ifs".

gsnedders · 2026-03-31T05:51:28 1774936288

The other thing is to remember is that when it comes to complex targets, attacks still are found by using a different fuzzer and/or targeting a different entry point.

It stands to reason that the same will apply for LLMs.

tptacek · 2026-03-30T20:51:52 1774903912

It's all three, I just had it on the brain when I was writing this.

streetfighter64 · 2026-03-30T20:53:43 1774904023

Hm, kind of a strange question then, no? Is a car's engine connected to the fuel tank, the wheels or the accelerator pedal?

tptacek · 2026-03-30T21:03:26 1774904606

I don't know, maybe it is? My point is just that frontier models start off with latent models of all the interconnectivity in all the important open-source codebases, to a degree that would be infeasible for the people who learned how all the CSS object lifecycles and image rendering and unicode shaping stuff worked well enough to use them in exploits.

tptacek · 2026-03-30T20:32:23 1774902743

That might be one outcome, especially for large, expertly-staffed vendors who are already on top of this stuff. My real interest in what happens to the field for vulnerability researchers.

lifty · 2026-03-30T20:53:49 1774904029

Perhaps a meta evolution, they become experts at writing harnesses and prompts for discovering and patching vulnerabilities in existing code and software. My main interest is, now that we have LLMs, will the software industry move to adopting techniques like formal verification and other perhaps more lax approaches that massively increase the quality of software.

lelanthran · 2026-03-31T12:00:30 1774958430

> Perhaps a meta evolution, they become experts at writing harnesses and prompts

Harnesses, maybe, but prompts?

There's still this belief amongst AI coders that they can command a premium for development because they can write a prompt better than Bob from HR, or Sally from Accounting.

When all you're writing are prompts, your value is less than it was before., because the number of people who can write the prompt is substantially more than the number of people who could program.

sputknick · 2026-03-31T13:22:52 1774963372

I agree with this take. Nothing changes, everything just evolves. Been happening for 60 years, will (likely) continue to happen for the next 60 years.

nickpsecurity · 2026-03-31T02:49:57 1774925397

Also, synthetic data and templates to help them discover new vulnerabilities or make agents work on things they're bad at. They differentiate with their prompts or specialist models.

Also, like ForAllSecure's Mayhem, I think they can differentiate on automatic patching that's reliable and secure. Maybe test generation, too, that does full coverage. They become drive by verification and validation specialists who also fix your stuff for you.

habinero · 2026-03-31T00:36:51 1774917411

Testing exists.

> formal verification

Outside of limited specific circumstances, formal verification gives you nothing that tests don't give you, and it makes development slow and iteration a chore. People know about it, and it's not used for lot of reasons.

stavros · 2026-03-30T20:57:00 1774904220

True, but I already am curious to see what happens in a multitude of fields, so this is just one more entry in that list.

underdeserver · 2026-03-30T21:45:22 1774907122

Just wanted to point out that tptacek is the blog post's author (and a veteran security researcher).

tptacek · 2026-03-30T20:21:47 1774902107

Everybody agrees that idiots were spamming curl with random just-plausible-enough-seeming output from old models.

tomjakubowski · 2026-03-30T20:32:30 1774902750

It sounds like what makes the pipeline in the article effective is the second stage, which takes in the vulnerability reports produced by the first level and confirms or rejects them. The article doesn't say what the rejection rate is there.

I don't think the spammers would think to write the second layer, they would most likely pipe the first layer (a more naive version of it too, probably) directly to the issue feed.

tptacek · 2026-03-30T20:42:13 1774903333

There are at least three differences:

* Carlini's team used new frontier models that have gotten materially better at finding vulnerabilities (talk to vulnerability researchers outside the frontier labs, they'll echo that). Stenberg was getting random slop from people using random models.

* Carlini's process is iterated exhaustively over the whole codebase; he's not starting with a repo and just saying "find me an awesome bug" and taking that and only that forward in the process.

* And then yes, Carlini is qualifying the first-pass findings with a second pass.

tomjakubowski · 2026-03-31T18:36:50 1774982210

Thanks, I hadn't considered the second point.

I guess the broader point I wanted to make is about the people responsible for the deluge of LLM-reported bugs and security vulnerabilities on countless open-source projects (not only on curl): they weren't considerate or thoughtful security researchers, they were spammers looking to raise their profile with fully automated, hands-off open source "contributions". I would expect that the spammers would continue to use whatever lowest common denominator tooling is available, and continue to cause these headaches for maintainers.

That doesn't mean frontier models and tooling built around them aren't genuinely useful to people doing serious security research: that does seem to be the case, and I'm glad for it.

tptacek · 2026-03-29T17:12:04 1774804324

These articles are no fun anymore, because it's almost impossible to find anybody to take the other end of the claim, that there's any perceptible difference in sound quality from high-end cables. Every audiophile forum I could find talking about this video all said the same thing: "no shit, of course, everyone knows this already".

californical · 2026-03-29T17:22:43 1774804963

It’s true, but somehow there still seems to be a market for those things to keep existing. Which to me is also interesting, that everyone knows there’s no point but people still buy them

tptacek · 2026-03-29T17:27:56 1774805276

Presumably the people who buy them aren't talking about audiophile stuff online. My bet would be that they're generally people who buy ludicrously expensive components, and are then told they need to get the matching cables to get the most out of it.

juliusdavies · 2026-03-29T20:22:23 1774815743

Right. If you are dropping $60K on a stereo, you don't really care if the cables are $4K.