In June 2026, NeurIPS, one of the most prestigious conferences in machine learning, did something the machine learning field has spent years telling everyone else not to do. It deployed an automated AI detector to make high-stakes decisions, at scale, with no appeal, and human work got caught in the net.
The chairs of the conference's position paper track used Pangram, a commercial AI text detector, to enforce a policy against AI-generated submissions. They desk-rejected 178 papers, about 18.4% of the track, based on the detector's scores. Another 123 authors were told to prove they wrote their own papers by a deadline. There was no appeal for the rejected papers.
Then researchers started showing their work. People with drafts they had written by hand, with git histories and draft trails to prove it, had been flagged. The trigger in several cases was punctuation: em-dashes and clean, consistent formatting pushed the detector's score up. The irony was not lost on a community that publishes papers warning about exactly this failure mode.
If you run a small team and you are thinking about using an AI detector to screen job applicants, grade work, moderate content, or check whether an employee used AI, the NeurIPS episode is the clearest warning you will get this year.
TL;DR: NeurIPS desk-rejected 178 papers (18.4% of a track) on AI-detector scores, no appeal, and human-written work was flagged because of em-dashes. AI content detectors produce false positives at rates that make them unsafe as the sole basis for a high-stakes decision. If you use one, validate it on your own population, never make it the sole basis for an adverse decision, require human review, and offer an appeal. For most small teams, the right answer is to not use detectors for consequential decisions at all.
What actually happened at NeurIPS
The numbers are worth sitting with, because they show why detector-driven decisions are fragile.
The same detector flagged 42.7% of submissions on its default window settings, and 12.7% on a refined 100-word configuration. That is the single most important fact in the whole story. The share of papers labeled "AI" more than tripled depending on a setting the authors could not see and had no say in. When the outcome of a decision swings that much based on a configuration choice, the decision is not measuring the thing it claims to measure.
The 178 rejections came from three thresholds: scores above 0.9 (77 papers), scores above 0.8 combined with a solo-authorship flag (79 papers), and scores above 0.5 where the authors had explicitly denied using AI (22 papers). That last group is the most troubling. People said, on the record, that they did not use AI, and a score above a coin-flip was treated as enough to override them.
NeurIPS is not a careless organization. Its reviewers are domain experts. It had a written policy and a defensible goal: position paper tracks had seen a flood of AI-generated submissions, with roughly a third of the track estimated to be machine-written. The problem was not intent. The problem was treating a probabilistic score as a verdict.
The lesson the field taught itself
For years, machine learning researchers have made the same argument to hospitals, courts, schools, and HR departments: do not deploy a model in a high-stakes setting without validating it on your real population, calibrating its thresholds, being transparent about how it works, and giving people a way to contest the output. That argument is correct. It is also exactly what NeurIPS skipped.
This is the part that matters for your team. If a conference full of the people who build these models can fall into the trap, a five-person company evaluating job applicants will fall into it faster. The failure is not about expertise. It is about the seductive convenience of a number. A detector hands you a clean 0.94 and the temptation is to treat 0.94 as truth instead of as a noisy estimate with a real chance of being wrong about this specific person.
A high-stakes decision is any decision that materially affects someone: getting a job, passing a course, keeping a content account, surviving a performance review. The bar for evidence in those decisions should be high. An AI detector does not clear it.
Why AI detectors produce false positives at scale
AI detectors do not "know" whether text was written by a machine. They estimate the statistical likelihood that text matches patterns common in machine-generated writing. Human writing that happens to be clean, formal, structured, or templated can match those patterns closely.
The documented triggers are mundane and unfair. Em-dashes raise scores. Consistent punctuation raises scores. A formal register raises scores. Several studies have found that text from non-native English speakers is flagged at higher rates, because that writing is often more templated and careful. In other words, the writers most likely to be wrongly accused include some of the most disciplined and the most disadvantaged.
Even a detector with a low advertised false-positive rate becomes dangerous at volume. If a tool is wrong 5% of the time and you run 400 applicants through it, you will wrongly flag around 20 real people. Each of those is a qualified candidate rejected, or a student accused, on the strength of a number. And the vendor's advertised rate is almost always measured on text that looks nothing like yours, which is why the NeurIPS configuration could swing the flag rate from 12.7% to 42.7% on the same papers.
Where teams deploy detectors, and what a false positive costs
The risk depends on what the flag triggers. Here is how it plays out in the settings small teams actually use.
| Where detectors are used | What a false positive does | What governance requires |
|---|---|---|
| Hiring / resume and assessment screening | Rejects a qualified human applicant as "AI" | Never the sole basis; human review; adverse-action notice; bias audit |
| Education / training / grading | Accuses an honest student or trainee of misconduct | Human decision-maker; evidence beyond the score; real appeal |
| Content moderation / creator platforms | Removes or demonetizes legitimate work | Transparency on thresholds; appeal; human escalation |
| Employee monitoring | Disciplines an employee for "AI use" | Written disclosed policy; human review; documented basis |
In every row, the false positive lands on a person who did nothing wrong, and the cost is borne by them, not by the tool. That asymmetry is the heart of the governance problem. The detector vendor is not in the room when a real candidate gets auto-rejected.
Decision framework: should you deploy an AI detector?
Work through these in order before you turn a detector on for any consequential use.
- Is the decision high-stakes? If a wrong output materially harms someone (rejection, discipline, account loss), treat the detector as unsafe as a deciding factor. Stop here for most cases.
- Can you validate it on your own population? If you cannot run a sample of known-human and known-AI text that resembles your real inputs and measure the false-positive rate yourself, you do not know the tool's error rate. Do not deploy on faith in the vendor's number.
- Will it ever be the sole or primary basis for an adverse action? If yes, do not deploy. A probabilistic score must never be the reason someone is rejected, failed, or disciplined.
- Is there a human reviewer who can override it? If there is no human who sees the score, understands its limits, and is empowered to disagree, the system is automated judgment, which is the thing to avoid.
- Is there a real appeal path? If the affected person cannot contest the result and have a human re-decide, you are repeating the NeurIPS mistake.
If you cannot answer those cleanly, the right call is the boring one: do not use an AI detector for the decision. For most small teams, that is the correct answer, and it is also the cheapest one.
If you decide to use one anyway: four guardrails
Some workflows genuinely benefit from a detector as a triage signal, for example flagging submissions for closer human review rather than auto-rejecting them. If that is your case, hold to four rules.
Validate on your population first. Assemble known-human and known-AI samples that look like your real inputs, run them, and measure your own false-positive and false-negative rates. Use that number, not the vendor's.
Make it a signal, never a verdict. A high score routes a case to a human. It never, by itself, produces an adverse outcome. Write that into the policy.
Require documented human review. A named person reviews the flagged case with full context, can override the score, and records the reasoning. Keep the record.
Disclose and provide appeal. Tell applicants, students, or employees that a detector is part of the process, what it does, and how to contest a result. For hiring, align this with adverse-action notices and state AI disclosure rules.
The legal exposure is real
Even though no law regulates AI detectors by name, several frameworks reach them. In hiring, EEOC disparate-impact principles under Title VII apply if a detector screens out a protected group at a higher rate, and detectors are known to over-flag non-native English writers, which is a protected-class risk. State laws add disclosure and audit duties: Illinois requires notice for AI in hiring, and NYC Local Law 144 requires bias audits of automated employment decision tools. The EU AI Act classifies AI used in employment and education as high-risk, which brings accuracy, documentation, and human-oversight obligations. And a public accusation that someone used AI when they did not can carry defamation risk if it is communicated and causes harm.
The throughline is the same as the NeurIPS lesson. A detector score is a guess. Build your decision so that a wrong guess cannot, on its own, hurt a real person. If you cannot build it that way, do not deploy it.
