What to Read — The Agentic AI Field Guide

5.1

Part V / Frontier · The closer — how to keep learning as the field moves

What to read, what to skip, and how to think about the frontier.

The rest of this guide taught what's stable enough to teach. This chapter is about the part that isn't stable — the parts that are still moving fast enough that anything written here will be partially obsolete within months. The frontier of agentic AI in 2026 looks different from the frontier in 2025 and will look different again in 2027. Rather than try to predict what comes next (a fool's errand), this chapter teaches the meta-skill: how to find signal in the firehose of AI content, who to read, what events to watch for, and how to triage new claims so you spend your reading time on things that actually matter. By the end you'll have a reading practice that keeps you current without consuming you.

STEP 1

The signal-to-noise problem.

The AI content firehose in 2026 is overwhelming. LinkedIn posts, Twitter threads, Substack newsletters, Medium articles, YouTube videos, paper preprints, podcasts, conferences, vendor blogs, and the model providers' own announcements all produce content at high volume. Most of it is either marketing, repetition of widely-available knowledge, or speculation dressed as analysis. Some of it — maybe 5% — is genuinely informative.

The skill that matters: triaging quickly. Not reading everything; not even reading "the important things." Reading the few sources that consistently produce signal, while developing pattern-recognition for what's worth opening from the rest.

The four content categories

Almost everything you'll encounter falls into one of four buckets, with very different signal characteristics:

Original research / engineering

Papers, lab blog posts, technical deep-dives from teams shipping the work

High when from credible sources; this is the canonical signal

Synthesis / explainers

Articles that distill complex research for working engineers; some podcasts

Medium — value depends on the synthesizer's quality and judgment

Practitioner experience

"We shipped X and here's what we learned"; postmortems; case studies

High when honest about failures; low when marketing-shaped

Speculation / commentary

LinkedIn takes, Twitter threads, opinion pieces about where AI is going

Low — interesting for taking the temperature, not informative on its own

The pragmatic position: read aggressively in the first three categories; skim the fourth only to know what conversations are happening. A reading practice weighted toward speculation produces a sense of being current without actually being informed.

The patterns of low-signal content

Quick-scan tells for content unlikely to inform you:

Confident predictions about timelines. "AGI by 2027." "Agents will replace 50% of knowledge work by 2028." Predictions of this kind have a poor track record across the history of the field. The honest current state is that even people working at the frontier don't know what comes next year, let alone five years out.

Headline-grabbing benchmark claims without methodology. "Our new model crushes GPT-5 by 30%." The methodology section (chapter 3.4) is where the truth lives; a piece that doesn't get into it isn't worth your time.

Lists of "AI tools you must know." Mostly affiliate marketing or tool-discovery SEO. The tools worth knowing about surface organically from sources you already trust.

Anthropomorphic language about model behavior. "The model gets confused / decides / refuses because it doesn't want to." Models do statistical next-token prediction; pieces that consistently anthropomorphize tend to be light on the technical substance.

"This changes everything" framings. Most things change something. Very few change everything. The framing usually means the author hasn't thought carefully about what specifically does and doesn't change.

None of these are guarantees of low quality — but they're correlated enough to use as filters. If a piece opens with one and never recovers, time to close the tab.

STEP 2

The sources worth following regularly.

The shorter your follow list, the more carefully each source has to earn its place. The list below is opinionated — it's the set of sources that have consistently produced signal across the 2024–2026 period rather than a comprehensive bibliography. Use it as a starting point, prune what doesn't work for you, add what does.

Primary sources from the labs

The labs building frontier models publish a meaningful fraction of the original research worth reading. Their blogs are biased — they tell their own story — but they also have privileged information about what's actually working.

Anthropic's engineering and research posts. anthropic.com/engineering and anthropic.com/research. The "How we built our multi-agent research system" post (referenced in chapter 4.3) is the kind of content these channels produce — concrete details from production-shipping work. The Building Effective Agents guide (which informed several patterns in this guide) is foundational reading.

OpenAI's research blog. Posts about new model capabilities, evaluation methodologies, and (occasionally) safety work. Less engineering-detail than Anthropic; more capability-announcement. Worth following for the announcements; less worth following for working-engineer content.

DeepMind / Google AI publications. Heavier on the research-paper side, lighter on shipping-engineering. Their work on Gemini, AlphaGeometry, AlphaFold, and ongoing reasoning research is consequential.

Other major labs. Meta AI Research (especially on open-source models), Mistral, Cohere, xAI. Less consistent volume; worth a quarterly check rather than weekly.

Research organizations doing evaluation and safety work

A specific category of organization worth following: groups that evaluate models rather than build them. They surface failure modes, contamination, reward-hacking, and other things model providers might prefer not to discuss.

METR (Model Evaluation and Threat Research). Cited multiple times in this guide. Their HCAST benchmark, time-horizon research, and reward-hacking findings are some of the most useful capability-evaluation work in the field. metr.org.

UC Berkeley RDI (Center for Responsible Decentralized Intelligence). Their April 2026 study on benchmark reward-hacking (chapter 3.4) is the kind of work that materially changes how the field interprets results. Worth watching for ongoing work on benchmark trustworthiness.

Apollo Research. Adversarial-evaluation work, especially around deceptive behaviors and capability elicitation. Lower output volume; high signal per post.

EleutherAI and OpenAssistant communities. The open-source side of LLM research. Less consistently signal-rich than the dedicated research orgs, but occasionally produces the most useful empirical work because they share methodology fully.

Engineering blogs from teams shipping at scale

Companies building production systems on top of LLMs occasionally publish honest engineering accounts of what they've learned. These are some of the most useful sources for working engineers because they describe real trade-offs in real production:

Notion's engineering blog. Their writeups on the multi-year iteration of agent infrastructure (referenced through the field as exemplary practice) describe genuine production engineering rather than launch-day announcements.

Cursor, Cognition (Devin), and similar code-agent teams. When they share what's actually shipping vs. what was demoed, the content is useful. When they share demo-shaped marketing, less so.

Sourcegraph, Augment, and other dev-tools companies. Long writeups on agent architectures, retrieval, and code understanding.

Vercel and similar platform companies. Their writeups on AI infrastructure, streaming, and production reliability are practical.

Individual researchers and engineers worth following

A small list of individuals who consistently produce signal-rich content. These aren't the only people worth following; they're stable starting points:

Simon Willison (simonwillison.net). Linkblog covering AI developments with consistent quality of curation. His llm Python tool and shipped agent work give him direct engineering perspective. The signal density on his blog is unusually high.
Andrej Karpathy. Lower posting frequency now that he's away from OpenAI, but his existing material on LLM foundations (especially the YouTube lecture series) is foundational.
Sebastian Raschka. Newsletter and blog covering ML/AI fundamentals; signal-dense, technically rigorous.
Jeremy Howard. fast.ai writeups; opinionated and informed.
Eugene Yan (eugeneyan.com). Applied-ML engineering posts; system design rather than research.

This list is small on purpose. Following too many individuals creates noise; following a few who consistently deliver signal lets you actually read what they publish.

Conferences and events worth marking on the calendar

The conferences worth following in 2026 for agent work:

NeurIPS, ICML, ICLR. The major academic ML conferences. Most of the original research that matters appears at one of these. Reading every paper is impossible; reading the accepted-paper lists and skimming abstracts of anything that looks agent-adjacent is realistic.
Anthropic Dev Day / OpenAI DevDay / Google I/O. Product announcements from the major model providers. Sometimes substantive (new APIs, new capabilities); often marketing. Worth watching for the substantive parts.
QCon, Strange Loop (when it returns), GOTO. Industry conferences with practitioner content. Talks from companies running agents in production often appear here.
AI Engineer Summit. Newer; focused specifically on the working-engineer side of LLM applications. Signal density is good for an industry event.

The honest assessment: most conference talks are 30 minutes long because the venue requires 30 minutes, not because the content earns 30 minutes. Watch recordings rather than attending live; skip the first 10 minutes of most talks; close anything that's introducing itself for too long.

STEP 3

How to read papers and announcements without drowning.

The volume of new papers in 2026 is impossible to keep up with comprehensively. The discipline is triage — quickly deciding whether a given piece is worth reading carefully, reading briefly, or skipping.

The five-question triage

Before reading any new paper, announcement, or substantial post, ask five questions. If most answers are unsatisfying, skip.

1. Who wrote this, and what's their incentive? A paper from a model provider claiming their model is best has a marketing incentive; treat the claims with appropriate skepticism. An independent evaluator showing a model failure has a credibility-building incentive; treat with appropriate skepticism in the opposite direction. Anonymous claims are usually low-signal.

2. What's the specific claim? "Our agent achieves new SOTA" is vague. "Our agent achieves 87% on SWE-bench Verified with this specific scaffolding, under these conditions" is specific. Vague claims are usually marketing; specific claims are usually engineering.

3. Is the methodology described in enough detail to replicate? Papers and posts that hide methodology are usually hiding something. The minimum: prompt, scaffolding, model version, evaluation procedure, sample size. Posts that include these and explain them are credible; posts that don't are not.

4. Does the work address contamination, multi-run variance, and scaffolding effects? Chapter 3.4 covered why these matter. Work that ignores them produces inflated numbers; work that acknowledges them gives you trustworthy signal.

5. Would this change what I do tomorrow? Some pieces are interesting but don't actionably change your work. That's fine — those are background-knowledge reading, worth time but not priority. Others have direct implications for what you should build differently. Sort by this; spend most of your time on actionable.

The pattern of useful papers

Across the papers from 2023-2026 that materially shaped this guide, the consistent pattern: they describe specific empirical findings, with full methodology, on well-defined tasks, with honest discussion of limitations. Most don't claim to be revolutionary; they claim to be specific.

Examples of the pattern:

"Building Effective Agents" from Anthropic (2024): described the orchestrator-worker pattern with specific examples and trade-offs. Not "agents will change everything"; "here is what we found works."
"How we built our multi-agent research system" (2025): documented architecture decisions and concrete numbers (15× token cost, 90.2% improvement on internal eval).
Various lost-in-the-middle papers (2023-2024): specific empirical findings about attention degradation at long contexts.
The METR HCAST work (2024-2026): specific measurements of agent capability-over-time-horizon, with full methodology.
The Berkeley RDI benchmark-hacking study (April 2026): specific demonstration of how each of 8 benchmarks could be exploited.

None of these papers said "this changes everything." All of them changed something specific. The signal-rich content has this shape.

The "read it twice" rule

For papers that pass the five-question triage and you decide to read: read once for the claim, then read again for the methodology. The first pass tells you what the authors say; the second tells you whether to believe them.

The second pass is where most of the value lives. You're looking for: did they control for the right things? What's their evaluation set? Is the comparison fair (same scaffolding, same conditions)? How big is the effect, in real terms, vs. how big does the headline suggest? Are there obvious confounds they didn't address?

This is slower than skimming, but it's how you build the judgment to evaluate future claims. The investment pays back: after reading 20-30 papers this carefully, you'll recognize the patterns of credible vs non-credible work quickly enough to triage in the first pass.

STEP 4

A reading practice that actually works.

The final piece: turning all of the above into a sustainable habit. The teams that stay current don't read everything — they have a small, regular reading rhythm that they actually maintain. Three structural recommendations:

A weekly rhythm

The realistic shape that works for most working engineers: a 30-60 minute weekly reading block, scheduled on the calendar like any other meeting.

15-20 minutes: check the small list of sources you follow regularly. Skim everything new. Mark the few items worth a careful read.
20-30 minutes: actually read those marked items. Apply the five-question triage to anything from the broader firehose that ended up in your queue.
10 minutes: write down the one or two specific things you learned that might change what you do at work. This is the part most people skip; it's also the part that compounds.

An hour a week, every week, sustained for a year, produces deep familiarity with the field. The same hour-a-week spread inconsistently across many sources produces a vague sense of being current without the depth.

A monthly deeper dive

Once a month, pick one topic that's come up multiple times in your weekly reading and go deeper. Read the foundational paper (if there is one). Read two or three follow-up papers. Read the strongest critique. Form your own position on what's true and what's overclaimed.

The benefit isn't to become an expert on everything — it's to build the habit of going from "I've heard about this" to "I understand this well enough to evaluate new claims about it." After a few months of this practice, your triage gets sharper because you have actual reference points for what work in each area looks like.

A quarterly recalibration

Every 3-4 months, audit your reading list. Which sources have you stopped finding signal-rich? Drop them. Which topics that seemed important six months ago no longer do? Drop those too. Which new sources have surfaced (often through recommendations from sources you already trust) that deserve a trial period?

The field moves fast enough that your reading list from a year ago is probably stale. The discipline of pruning is what prevents your reading time from becoming an obligation rather than an investment.

What to do when you encounter something genuinely new

Occasionally — every few months, maybe — you'll encounter a paper or post that meaningfully changes how you think about agent work. The reasoning model paradigm (when extended thinking became a first-class technique). The multi-agent architecture findings. The benchmark contamination work.

When this happens, slow down. Read carefully. Read what others have said about it. Think about what specifically changes about the work you do. Sometimes the answer is "nothing yet, but the next quarter's work will reflect this." Sometimes it's "I should rethink an approach I've been using." Either is fine; the response should be deliberate.

Avoid the trap of immediately changing your work to chase every new finding. Most don't replicate; most don't generalize from their benchmark conditions to your production conditions; most are partial truths that need integration with other constraints. The teams that thrive in fast-moving fields are the ones that absorb new ideas without being whipsawed by them — patient, deliberate, and selective about which findings translate to action.

CLOSING

The end of the guide.

This is the last chapter. Everything before it taught the patterns and disciplines that are stable enough to teach as foundational: how agent loops work, how to build the tools they use, how to ship them to production, how to evaluate them, what shapes of agent exist and when to reach for each. The guide tried to be honest — about what works, what doesn't, where the hype outruns the engineering, where the field genuinely is and isn't yet.

The honest position about the future, in two sentences: the underlying agent patterns in this guide will probably remain useful for years; the specific models, prices, benchmarks, and product offerings won't. The discipline of building agents — observability, evaluation, cost discipline, careful scaffolding, honest acknowledgment of failure modes — is the part that compounds across whatever models exist when you read this. The frontier moves; the engineering practice that puts the frontier to work is more stable.

The thing worth carrying forward, more than any specific technique: the bar for what counts as "working" in agent systems is lower than the bar for what counts as "shipping reliably to real users". Demos work easily. Production is where the chapters in this guide earn their place. Most teams underestimate the gap; the teams that consistently ship agentic systems take the gap seriously, instrument for it, and engineer past it.

If you've worked through the chapters in order, you now have a working mental model for agentic AI as a category of engineering practice — not as marketing, not as research speculation, but as the specific set of techniques that produce reliable behavior from non-deterministic models. That mental model will be the most durable thing you take away. Models will improve; APIs will change; benchmarks will shift. The discipline of building well stays.

Thanks for reading. Build well.

End of chapter 5.1 — and the guide.

Deliverable

A reading practice that keeps you current without consuming you. A short list of sources that earn their place (Anthropic engineering, METR, RDI, a handful of individuals, specific conferences). The five-question triage for new papers and posts. A weekly rhythm of ~60 minutes, a monthly deeper dive on one topic, a quarterly audit of the reading list. The discipline of going from "I've heard about this" to "I understand this well enough to evaluate new claims about it" on the topics that recur. And — the closing skill of the whole guide — the judgment to distinguish what's stable enough to act on from what's still moving too fast to bet on.

Weekly reading block on the calendar, 30-60 minutes, defended like any meeting
Small follow list of 5-10 sources that consistently produce signal; pruned quarterly
Five-question triage applied to new papers and posts before deciding to read carefully
Monthly deeper dive on one topic that's recurring in your reading
Notes on what each week's reading would change about your work, if anything
Skepticism toward confident timeline predictions, headline benchmark claims, "changes everything" framings
Patience with new findings — most don't replicate, most don't generalize, most need integration
The mental model from the rest of the guide as the stable substrate that survives model and product churn