It also shows why this approach is questionable. Opus 4.6 without tool use or web access can provide chardets source code in full from memory/training data (ironically, including the licensing header): https://gist.github.com/yannleretaille/1ce99e1872e5f3b7b133e...
This comes with the uncomfortable implication that its impossible to tell actually to what extent are LLMs pulling together snippets of GPLd code, and to what extent is that legally acceptable.
Parallel creation is a very minimal defense to copyright infringement claims. It is practically impossible to prove in humans, to much annoyance of musicians. "Go prove in a court that you have never heard this song, not even in the background somewhere".
LLMs having been trained on all software they could get their hands on will fail this test. There is no parallel creation claim to be had. AI firms love to trot out the "they learn just like humans" which is both false and irrelevant; It's copyright when humans do it to. If you view a GPL'd repo and later reproduce the code unintentionally? Still copyright infringement.
De-facto though, things are different. The technical details behind LLMs are irrelevant. AI companies lie and frustrate discovery, whilst begging politicians to pass laws legalizing their copyright infringement.
There won't be a copyright reckoning, not anymore. All the dumb politicians think AI is going to bail out their economies.
Indeed, and that's through the API. If you use Claude Chat/Code and even if you then turn off web search, it still has access to some of its tools (for doing calculations, running small code snippets etc.) and that environment contains chardets code 4 times:
It's not surprising that they were able to create a new, working version of chardet this quickly. It seems the author just told Claude Code to "do a clean room implementation" and to make sure the code looks different from the original chardet (named several times in the prompt) without considering the training set and the tendency for LLMs to "cheat".
Wow. The guy who’s been thanklessly maintaining the project for 10+ years, with very little help, went way out of his way to produce a zero-reuse, ground-up reimplementation so that it could be MIT licensed... and the very-online copyleft crowd is crucifying him for it and telling him to kick rocks.
Unbelievable. This is why we can’t have nice things.
Mark Pilgrim isn't even the original author, he just ported the C version to Python and contributed nothing to it for the last 10 years.
If you take 5 minutes to look at the code you'll see that v7 works in a completely different way, it mostly uses machine learning models instead of heuristics. Even if you compare the UTF8 or UTF16 detection code you'll see that they have absolutely nothing in common.
Its just API compatible and the API is basically 3 functions.
If he had published this under a different name nobody would have challenged it.
https://github.com/chardet/chardet/issues/327
https://github.com/chardet/chardet/issues/331