By better I mean grading based on whether there is any nonsense in the output or any internal contradictions, or similar criteria
Sounds like you want a hard ai to determine whether a language model generates nonsense.
By better I mean grading based on whether there is any nonsense in the output or any internal contradictions, or similar criteria