@awnlee jawking
A correlation between the presence of the AI tag with lots of em-dashes plus a correlation between the absence of an AI tag and an absence of em-dashes points out why em-dashes are a really iffy warning flag? I'm missing something there.
No. I said 'correlation between the presence of the AI tag with lots of em-dashes plus a correlation between the absence of an AI tag and lots of em-dashes points out why em-dashes are a really iffy warning flag.'
Many of the longest NON-AI stories have LOTS of em dashes. That's why it's a lousy flag.
AI detectors (including AIs themselves) use em-dashes as an indicator of AI-generated work.
Most AI detectors are awful at detecting AI, and most AIs are even worse. The most recent serious academic work I've seen on the subject suggests a hit rate no better than 90%, and often much lower, for the vast majority of them. Part of the problem is that, yes, there are very good AI detectors for two-year-old AI models. But they're usually lousy at detecting current models.
Also note that this is an enormous problem both for universities, which are increasingly hitting cases where they're flagging work for 'AI use' and being clobbered in court by students who can prove AI was not used, and for students, who are having to maintain an edit history, notes, etc to prove (to the level required in legal proceedings) that AI was not used.
AIs are mostly trained on historic works
True, if one believes history starts in the past few months. One of the problems with state-of-the-art LLMs is that they're training on 'current data', more and more of which is, itself, AI generated. And, since AI detectors are miserably bad, weeding AI generated data out of the training data is hard.
GPT-4o (May 2024 release) is trained on books through late 2023. GPT-5 (Aug 2025 release) is trained on books through Aug 2024. But GPT-5.2 (Dec 2025 release) is trained on books through Aug 2025. However, it looks like GPT-5.3 and 5.4 haven't moved their dates forward. GPT-5.5 moves out to Dec 2025.
In all cases, that's 'historic,' as long as your definition is fairly broad.
The vast majority of professionally printed material, whether 'historic' from the 15th century or the 21st century, uses em dashes. Thus, the vast majority of professionally printed material used to train LLMs will contain em dashs, regardless of training cutoff.
The collected Harry Potter books contain 10,000 em dashes. Clearly, Ms. Rowling was a huge user of ChatGPT, no? Of course, that's sarcastic - those books were professionally published, so they contain em dashes. Are the Harry Potter books 'historic?' I would argue that they are - they're certainly part of history - but they're definitely modern history. But there is ample evidence that LLMs are trained on Harry Potter.
Em-dashes have declined in usage to such an extent that many (most?) keyboards don't even have a key for them.
Did you miss my comment about '--', which has been used by typesetters for generations to note a place to set an em dash, and by pretty much every major word processor since the 1980s to actually generate an em dash? Em dash use by non-professional writers has gone sharply up since the 1980s, mostly because TYPEWRITERS didn't have a way to easily generate them. Ever since word processing became a thing, it's been easier for authors to use them, and use went up sharply between 2010 and 2024.
EDIT: See my later comment above in response to @rustyken. MS Word, and likely others, automatically converts '-' (ASCII dash, on pretty much every keyboard) to '&emdash;' (an em dash) in cases where em dash is appropriate (in its judgment, of course) unless you turn it off. So, most MS Word authored text likely includes em dashes. You don't need a key for it if software is doing it for you.
In any case, em dashes were the standard for professional work a century ago and they're the standard for professional work today.
There is indeed evidence that some writers are trying to avoid em dashes now, for fear of being seen as AI. In my opinion, that's awful, and we should be pushing back against it.
I wouldn't be at all surprised if authors who did NOT want their AI-generated work to be seen as 'AI Generated' aren't mass-replacing em dash with plain ASCII dash. If that takes hold, use of ASCII dash may become a much stronger 'red flag' for AI-authored writing than em dashes are.
I'm not switching, either way. Not six books into my series and not in other work (if and when).