A friend who runs a research lab told me her grad students used to spend the first month of any project just finding out what was already known. Not running experiments. Not writing. Just reading their way out of ignorance, one PDF at a time, hoping they hadn't missed the one 2019 paper that already answered the question they were about to spend a year on.
Now that month is about three days. The work didn't disappear. It moved. Here's roughly how, and — more importantly — what stubbornly hasn't changed.
The bottleneck was never reading. It was triage
Everyone talks about literature review like the problem is reading speed. It isn't. A motivated grad student can read a paper in twenty minutes. The problem is deciding which hundred papers out of the thousand your search returns are worth those twenty minutes each.
That sorting is miserable work. You open an abstract, skim it, decide it's tangential, close it. Repeat four hundred times. By Thursday you've forgotten what you were looking for, and you still haven't read a single paper closely. The mechanical front half of research swallows the part where you actually think.
A research agent does that brutal first pass for you. It takes your question, pulls candidate papers, reads the abstracts, judges each one against what you're actually asking, and hands back a ranked pile. You still read the top twenty yourself. You just don't burn a week on the eighty that were never going to matter.
What a good research agent actually does
Worth being precise here, because "AI reads papers for you" is doing a lot of dishonest work in most marketing. A skill does one bounded thing — summarize this one abstract, say. An agent is handed a goal and the room to take several steps toward it on its own:
- It takes your research question, the real one, not just a bag of keywords.
- It pulls candidate papers and reads the abstracts.
- It scores each one for relevance to your specific angle, not to the topic in general.
- It summarizes the keepers — method, core finding, stated limitation.
- It maps where papers agree, disagree, or quietly cite each other.
That last step is the one people underestimate. Noticing that paper A's method directly undercuts paper B's central assumption is exactly the kind of connection that used to take a sharp human two weeks and a wall of sticky notes to spot. The agent flags it on Monday afternoon.
A real question, and how the triage played out
Let me make this concrete, because in the abstract it sounds like magic and it isn't.
Say a team is studying whether intermittent fasting changes gut microbiome diversity in adults — and specifically whether the effect holds up outside of rodent studies. That "outside of rodent studies" clause is the whole game. A keyword search for "intermittent fasting microbiome" returns somewhere north of eight hundred results, and most of them are mouse work, review articles citing the same six papers, or studies measuring weight loss with the microbiome as an afterthought.
The team handed the agent the question with that human-trials constraint spelled out. It came back with 94 ranked papers. The top of the list was eleven human randomized trials, each summarized down to sample size, fasting protocol, and what actually moved. Useful on its own. But the map underneath was where the day turned.
The agent flagged that two of the highest-ranked human trials reported increased diversity, while a third — larger, longer, better controlled — reported no significant change. Three papers, one of which the team's lead admitted she'd have skimmed past because its title was about metabolic markers, not the microbiome. The disagreement wasn't noise. The trials had used different sequencing methods, and the contradiction was the actual open question hiding under the topic.
The team didn't get an answer that Monday. They got something more useful: a sharp, specific argument worth having. The agent didn't resolve the contradiction. It just made sure they saw it before they'd committed to a design that assumed the effect was settled.
What it does NOT do
The agent finds the papers and tells you what they say. It does not tell you what they mean for your work. That line is the whole ballgame, and it should stay exactly where it is.
In the fasting example, the agent surfaced the contradiction. It did not tell the team which trial to trust, whether the sequencing difference was the real culprit, or what any of it implied for their own protocol. It can't, and the labs getting genuine value here don't pretend otherwise.
The agent compresses the search. The researcher still does the synthesis — still reads the crucial five papers in full, still decides what the disagreement means, still forms the argument and stakes a claim on it. The summaries are a starting point, not a verdict. A summary that says a trial "found no significant change" can't tell you the trial was underpowered; you learn that by reading the methods section yourself, which is precisely why you still read it.
Treat the agent's output as a verdict and you'll write a confident literature review built on someone else's skim. Treat it as triage and it gives you back the time to do the part that was always yours.
Where the danger actually lives
The failure mode isn't the agent missing a paper. You'll catch most misses when you read the top twenty and notice a gap. The real danger is subtler: trusting the ranking so completely that you stop reading critically, because the machine already sorted it and the machine seemed confident.
Relevance scoring is genuinely good now. It is not infallible, and it has no idea which paper is going to crack your problem open versus which one merely matches your question on the surface. Sometimes the paper ranked thirty-first — the one with the odd framing the scorer didn't quite get — is the one that reframes your entire project. The teams who use this well read past the top of the list precisely because they don't fully trust it. The ones who get burned are the ones who outsourced their judgment along with their triage.
A realistic week
- Monday: hand the agent the question, get back roughly 100 ranked papers with summaries and an agreement map.
- Tuesday: read the top twenty in full, correct the agent's misses, pull two surprises from further down the list.
- Wednesday: use the disagreement map to find the open question nobody's settled.
- The rest of the week: do actual research instead of database archaeology.
The phrase "100 papers a week" sounds like a productivity flex, and I'd drop it if it weren't roughly accurate. The real story is duller and better. The boring, mechanical front half of research finally got cheap. So the interesting half — the reading that matters, the synthesis, the argument only a human can stake — gets the time it always deserved and rarely got.
Lena Ortiz
Editor
Writing for the Skillmint blog on how people build, price, and put Claude Skills & Agents to work.