Funny thing about things like that is that you can likely write tools to automatically deobfuscate, if you know the mechanisms. Of course, this takes time and effort, and is beyond most spammers' capabilities.
I'm gonna write about this in pt. 2. Basically you can use symbolic execution to recover the CFG[1] (using something like miasm), you can eliminate dead code, restore dynamic lib calls with an emulation, and whatever else. But the point is that it would take an incredible amount of work and co-operation between tools, and then you wouldn't have even begun understanding anything about the binary, which is a whole another story. Now there's a kind of a little shortcut to all of this, which when combined with a couple of tools, you'd be able to make sense of things in this binary, which I'm gonna reveal in my next post.
Most obfuscation techniques are lossy. You lose information such as project structure, names of files, data types, variable names and so on. Decompilation and deobfuscation might give you a shadow of the original source code but the benefits are overstated because the advantages over working directly with assembly code aren't that big. Most of the time is spent finding the dozen relevant functions out of 10000. If you truly need access to the entire source code your time is better spent on an opensource project.
> You lose information such as project structure, names of files, data types, variable names and so on.
You lose half of those by not having debugging symbols and the other half by stripping the binary. This is all lost during compilation already, not due to explicit obfuscation. If you've ever worked with a compiler that is mediocre at generating debug symbols, you'll know it's the compiler doing extra work that provides all these, not obfuscation that removes them.
That works if the obfuscating patterns are all straightforward like a regular grammar. But if it's not possible to distinguish an obfuscation from genuine code, that could quickly become intractable (NP).
We cannot have _the_ source, but we can have a good enough approximation of it, especially if a human is in the loop (see: commercial decompilation software like the Hex-Rays decompiler, Binary Ninja, and even Ghidra).
The point is that we cannot automate reversing these obfuscation mechanisms the same way we cannot automate reversing a binary file to a higher level than assembly.
This not quite true, especially with current state-of-the-art tools like Ghidra, IDA pro (with Hex-rays), etc.
In fact, Rolf Rolles wrote a wonderful guest post[1] for the Hex-Rays blog about automating the reversal of this exact obfusactor, though he wasnt aware of it's origins at the time.
All these are great programs, but none of them can understand that level of obfuscation so far. As stated in the post, both Ghidra and IDA interpret the very first block in any of the obfuscated functions, which ends with an indirect branch, as a complete function in and of its own. Because this is the usual case, indirect branches AKA tail calls terminate a function to start another, all with the same stack frame.
EDIT: also keep in mind the CFG isn't flattened here.
Exactly. Such tools are definitely possible, even if they rely on Ghidra or IDA's plugin systems.
What I like is the economics of the idea that one company can build an obfuscator, and then another company can build an anti-obfuscator which completely nullifies the value proposition of the first company.