how might a modern lighthouse work with large language models? #036

AI Focus

how might a modern lighthouse work with large language models?

Lighthouse is one of the most successful developer tools we’ve ever shipped, and I say that as someone who is genuinely proud of the work the team has done on it. But something has gnawed at me for a while. Our ability to help the developer ecosystem with it is ultimately capped by the number of tests we can physically create. Every audit is a hand-written piece of JavaScript. Someone has to sit down, understand a best practice, encode it as a deterministic check, ship it, and maintain it as the platform moves underneath it. A lot of rigor goes into that, and the rigor is a good thing, it is exactly what makes the scores trustworthy. But it is also a serious amount of time, which means you inevitably end up having to pick the highest-value tests to build, and whole categories of “is this any good” stay quietly out of reach because nobody could justify the cost of a check. The tool can only ever test what someone found the time to write a test for.

A few months ago I got obsessed with a simple question: what if the audit was just a sentence? What if you could describe, in plain English, what good looks like, and let a model figure out how to check it? I wrote that down as an idea I called “tests as prompts”. What brought it back to life was a conversation just before Google I/O. I met up with a friend and her partner, and they were deep in this whole area of auto-research, the idea of pointing a model at a goal and letting it run experiments until it gets there. Hearing them talk about it was what got me excited again. I had tests-as-prompts on one side and now this notion of goal-driven loops on the other, and the two of them together were the actual spark. There are so many ways to experiment with that combination right now, and that, honestly, is the fun part. I had the pieces, so I built a thing called web-uplift. The tool is fun, but the shape of it is the part I actually want to talk about.

The other strand came from a different direction. The Chrome DevTools team shipped a skill called memory-leak-debugging that runs over their Chrome DevTools MCP server. It uses Meta’s Memlab to diff heap snapshots, orchestrates Chrome to take them, works out where memory is leaking, and then, drawing on the DevTools team’s own knowledge, tells you how to fix it. That was the thing that clicked for me. If a model can orchestrate a browser, read a heap, and diagnose a leak, then it can identify a whole category of issues, not just memory. The intelligence is in the reading, not in a check someone hand-wrote.

I took that shape and built memory-tracer, a command-line tool that points at a list of sites, say the top thousand, and reports on their memory usage and leaks. Memory was one dimension. The obvious next question was why stop at memory. Why not point the same harness at everything we say makes a modern site good: the dimensions Lighthouse already checks, the hundreds of best practices in Modern Web Guidance, and Una Kravets’ modern-UX principles. That is what became web-uplift. And once you have a model that can both find the issues and reason about the fixes, the auto-research framing finishes the thought: set a goal, let the model be the judge of whether you have reached it, and loop until it is satisfied.

Let me start with what it does, because the shape only makes sense once you see the loop.

You point web-uplift at a URL. It explores the site, works out which pages and flows actually matter (a homepage, an article template, a form, a listing, not just the one page every audit happens to hit), and then it judges the site against a set of principles. Then, if you give it the source code, it fixes what it found and runs the whole thing again. Find, fix, test. Find, fix, test. It keeps going until there is nothing left to fix.

If you want to try it, it installs as a skill into whatever coding agent you already use:

npx -y web-uplift@latest install --agent claude   # or codex, gemini, opencode, pi, ...

Then, inside your agent session, the audit is a slash command (pi surfaces it as /skill:web-audit):

/web-audit https://example.com                         # report only
/web-audit http://localhost:8080 --source ./src --fix  # find, fix, re-audit

There is also a headless path for CI that spawns the agent for you, and a scorecard that can gate a build on the outcome scores:

npx -y web-uplift@latest audit https://example.com --agent claude
npx -y web-uplift@latest scorecard https://example.com --min-overall 80 --max-critical 0

That loop is the part I want you to hold onto. But to understand why it works, you have to understand the three things it is built from, because the interesting bet is in how they are wired together.

The first ingredient is a spec of what good looks like. I did not invent this. Una Kravets and Bramus gave a talk at Google I/O 2026 called “What’s new in Web UI” (there is a written version on the Chrome blog) where they laid out five principles for modern UX: respect user preferences, implement natural interactions, provide guided navigation, maximize content and reduce noise, and adapt to the form factor. I took those five and expanded them out, folding in the dimensions Lighthouse already measures (performance and Core Web Vitals, accessibility, best practices, discoverability) plus a handful of things we have come to expect from the modern web (privacy and security, resilience, internationalisation, being trustworthy rather than manipulative, sustainability, and whether your site is ready for agents to read it). Seventeen principles in the end, all of them in one JSON file you can read (the rationale for the set is in the repo too). Each one is written as an outcome, not a check. “The surface follows the user’s colour scheme preference.” “Animations respect reduced motion.” “The primary task succeeds.” Each has a hint about how you might detect it and a pointer into the guidance for how to fix it, but the hint mandates nothing.

The second ingredient is the guidance itself, the how. This is Modern Web Guidance, a machine-readable feed of about 137 use-case-based best practices that you can search and pull from over npm. Phil Walton has a good video introduction to it. When the model finds a problem, it does not have to guess at the fix, it looks up the recommended modern approach. This matters more than it sounds. It is the difference between an audit that says “this is bad” and one that says “this is bad, and here is exactly how to do it right, grounded in the current platform.”

The third ingredient is the one I argued with myself about the most, and it is the actual bet of the whole project. There are no hard-coded checks in web-uplift. None. The runtime is a generic set of evidence primitives over raw Chrome DevTools Protocol: take a screenshot, capture a video of a transition, grab a heap snapshot, read the layout metrics and CLS, capture a performance trace and the network as a HAR, fetch the page the way a no-JS crawler sees it, pull the DOM and the local source, run arbitrary JavaScript in the page, run Lighthouse, inject axe-core. The primitives give the model senses. They return data and artifacts, and every judgement, whether a principle passes or fails, is made by the model reasoning over that evidence.

I made a deliberate decision early on, and it is really the whole experiment: there are no new encoded heuristics in this thing. The way we have always scaled quality tooling is by encoding what we know into JavaScript, Lighthouse audits, axe rules, hand-written checks, and I wanted to find out whether we still need to do that at all. To be clear, you still need tools. Chrome will hand you a multi-megabyte heap snapshot or a HAR file and something has to turn that into signals a model can actually read, so there is plenty of code in web-uplift for gathering and distilling evidence. What it does not have is code that decides whether something is right or wrong. The temptation to add a quick deterministic check for the easy cases is real, and I resisted it on purpose, because the moment the cheap check answers the question, the model never gets asked and you are back to only testing what someone encoded. The experiment is defining the test and the goal as text, and letting the model do the judging.

Here is what that feels like in practice. You describe the outcome you want. The model works out which of its “senses” it needs, a screenshot, a video under reduced motion, a heap snapshot, a Lighthouse run (yes - it is an option), and goes and gets the evidence and makes the call. You do not have to teach it how to check if a page honours dark mode. You tell it the page should honour dark mode and the users preference on the site. It emulates the preference, reads the computed colours, and tells you what it found, with a screenshot attached as proof.

And the senses the model can bring to bear are flipping bonkers in 2026. It can look at a screenshot and see that the focus ring is missing. It can watch a video of a transition and tell you it ignores reduced motion. It can read a heap snapshot and tell you memory is climbing on a long session, which is exactly the leak-hunting my other project, memory-tracer, does. It can read a network trace, a HAR, a performance trace, the actual authored source sitting next to the rendered DOM. Stitching these capabilities together is cheap now, and the part that used to be the hard work, encoding the test, has evaporated. Writing the test is just writing.

Before I point it at anything real, the demo. I built a tiny site with six deliberately broken scenarios: no dark mode, motion that ignores reduced motion, a fixed layout that overflows at 360px, poor focus, layout shift, and no container queries. It is the demo and the ground truth in one. A successful audit recalls all six. This run found all six, with zero false positives, and then three real ones I had not planted.

And rather than describe the output, here is the verbatim report that run wrote, the same one you would get as a site author: the evidence it chose to gather, every finding with its screenshot or recording attached, the eval against the ground truth, and the prioritised task list at the end.

So does it actually work on something real? I ran it against my own sites, because they were conveniently broken and conveniently mine.

aifoc.us, this blog, was the first. It came back with eight issues. The biggest was sustainability, not performance: I was loading all of Font Awesome 6, around 290KB of CSS and webfonts from a CDN, to render about twelve icons. Then a string of accessibility ones, the embedded demo iframes had no title attribute (including the model-gap widget further up this page), the newsletter input was labelled only by its placeholder, and there were duplicate form landmarks. It caught that I run two analytics stacks at once, that I have no reduced-motion handling, no HSTS header, and that unknown URLs soft-404 to the homepage. It also correctly credited the strengths: zero CLS, every image has alt text, a single h1, colour scheme declared, HTTPS everywhere. What I care about is not the list itself, it’s that it found real things, on a real site, by looking.

I should be honest here, because it is not perfect. That newsletter finding was correct, the field really was labelled only by its placeholder. But when I let the loop fix it, it managed to hide the email input entirely. The diagnosis was right and the tool’s fix was broken. That has been the pattern: the judgement is consistently good, the edits still need a human watching them. It is a keen collaborator that depending on the model, gets things wrong, and I treat it the way I would treat any tool that I use. I don’t hand it the keys and walk away. I read the diff and direct the tool.

paul.kinlan.me, my other blog, was the more satisfying one. You would not expect someone on the Chrome DevRel team to ship a site with a performance score this bad, and the blunt answer is that I migrated the design of that site using a large language model and I didn’t take my advice from the previous paragraph. I was happy with the result, it looked right, and performed well for me and I did not actually see the performance hit until a little later. At which point it struck me that this was a pretty good test of whether the tools could dig me out of the hole the tools had helped me dig. They could. Almost everything came back to a single architectural choice: the Tailwind Play CDN, a render-blocking, not-for-prod runtime that was pushing lab LCP to 9.9 seconds, against a target of 2.5 seconds or less and also showing up as a resilience problem and a sustainability problem. That, the loop fixed.

iwanttouse.com is a site I built. It is a small single-page developer tool: you pick web platform features and it shows you their Baseline status and the share of users who can use your site. Vanilla HTML, CSS, and JS, no framework, served by Vercel. It is exactly the kind of site I would have sworn was already fine, which made it the perfect thing to point the loop at. It is also the one that convinced me the loop actually climbs hills, not just finds them.

The audit came back with six findings, and the interesting one was not a bug so much as an architectural choice catching up with me. The entire feature dataset, all 1082 features plus the results, is rendered client-side from a features.json file. To a real browser that is fine. To anything that does not run JavaScript, which includes most crawlers and most AI ingest pipelines, it is a near-empty shell (I wrote about this recently: if your site relies on JavaScript to render, its content is very likely not in the model). Coverage was 6 percent, that is, only 6 percent of the words a browser eventually renders were in the HTML before any JavaScript ran. The title and description survived but the actual product, the feature list and the share numbers, did not. There was also a contrast failure on the headline share figure (bright green at roughly 1.6:1 on white, below the 3:1 large-text minimum), no web app manifest despite a registered service worker, a 21 pixel horizontal overflow in the awkward band between mobile and desktop, no reduced-motion handling, and no theme-color meta.

I let it fix the source and re-audit. Six findings to zero. The build now pre-renders the whole grid into the HTML, so crawl coverage went from 6 percent to 100 percent and it is no longer a JS shell. The contrast fix is the one I like most, because it is the kind of thing a deterministic check gets wrong. The baseline audit attributed the single axe violation to the share figure. When the fixer actually went to verify, it turned out the real failing node was the active filter button (white on a dark-mode accent at 2.42:1), and it fixed both with per-scheme colours using light-dark(). The tool corrected its own earlier diagnosis while fixing it. axe went from one serious violation to zero, the overflow went to zero pixels, and performance did not move because no new blocking resources were added. The core task still worked end to end.

I love that this is a loop and not a tool. The thing that finds the problem, fixes it, and confirms the fix and they are not three separate systems that have to be kept in agreement. It all uses the same model reading the same rendered output and source, and I think that is incredibly powerful because all of the context (frontend and backend) is right there.

The thing I keep coming back to is that none of this is a single skill. The web-uplift skill is the methodology and not an encoded set of tests (inspired by how Modern Web Guidance works), but the actual capability is someething that people can write in plain-text spec of outcomes, a machine-readable feed of how to achieve them, a generic set of senses over a real browser, and a model in the middle that reads, judges, and writes fixes.

Here is the bit I think is genuinely cool. For as long as I have been building for the web, writing a test for your site has meant writing code. Lighthouse audits are JavaScript. axe rules are JavaScript, and it puts a ceiling on how many tests we can have, because the number of people who can write a good one is small and the number of tests each of them has time for is smaller.

Because we can lift that ceiling it has the opportunity to help us lift up the web massively because there should be almost no reason not to invest in improving your site (because the tools can do it). If the test is a sentence and the judge is a model, then anyone who can describe what good looks like can author a test. That is a much larger group of people, and they can cover a much larger surface area, including all the subjective, fuzzy, hard-to-encode things that deterministic checks have never been able to reach.

One point I do want to touch upon. While the models are good, they’re not magic, and feeding it a raw two-megabyte HAR or a giant XML trace is an absolute waste. The real engineering is in the small tools that sit between the browser and the model: things that distil a heap snapshot down to the signals that matter, or turn a HAR into a one-screen summary of who is loading what and how much it weighs. memory-tracer does this for memory, leaning on Meta’s Memlab for the snapshot analysis. web-uplift does it for the network and the trace. These little summarisers are how you keep the loop fast and cheap, and they are themselves reusable across any test you author as a prompt and I expect more tools to be built over time.

I started this trying to learn about auto-research, and I think I really just landed on Loops and LLM as the Judge, but that loop… Damn. The loop is audit, fix, re-audit. And it is a tight loop now, because the fix and the test are so close to each other now. That tightness is what makes me think modern web development is about to change shape.

I do not think this tool that I made replaces Lighthouse (after all, it’s just me making it and I’m not exactly well known enough to get any of my tools used). Lighthouse’s deterministic scores are valuable precisely because they are deterministic, and a CI pipeline wants a number it can trust run to run. What I think happens is that a second, parallel system grows up alongside it, one where the tests are authored as language and the judgements are made by reading, and that system can cover everything Lighthouse cannot afford to hand-code. The subjective stuff. The stuff that changes every Chrome release. The stuff nobody has written a check for yet.

What makes me pause is how cheap this was to build. Lighthouse is a real product. It has an engineering team, it has been staffed for years, it is integrated into Chrome and into DevTools, and a lot of teams run it in their CI. That is a serious investment by a serious group of people, and it has done a lot for the web. I built this harness in a couple of days, and I can maintain a battery of tests across it faster than any team can hand-engineer the equivalent checks, because authoring a test is now writing a paragraph and proving it works is running an eval against a fixture. When the cost of producing a test collapses like that, the question I have is whether a browser vendor needs to own the test product at all (hey - before you go further, the team don’t know I wrote this, there’s no discussions happening!).

Two things make me think this is bigger than one tool. The first is who gets to play. Any team with a build pipeline, which is nearly every team that ships a site, can wire tests like this into CI and get a hill-climbing quality loop for roughly the cost of a prompt. You do not need to be on the Lighthouse team to have a hundred modern checks. You need a sentence per check and a fixture to prove it works. The second is that the same loop points at something people have wanted for a long time: self-healing software. Find, fix, test, with no human stuck in the slow part of the loop. We are not all the way there. The fixes still need eyes on them, I said that earlier. But iwanttouse went from six findings to zero without me writing a line of the fix.

The question I started with was how a modern Lighthouse might work with large language models. Having built a version of it, I think the answer is: you stop writing the audits, and you start writing the bar you want to reach and the evals. The model does the rest, and it will keep doing the rest, in a loop, until your site clears it. That feels like a different kind of tool, and I suspect, a different kind of craft.

Subscribe to the Newsletter

Get the latest essays and projects on how AI is changing the medium of the web delivered straight to your inbox.

does a url in a prompt steer an llm's output toward its content? - 2026-07-03

please mind the model gap - 2026-06-07

I think i've got it: WebMCP is the new web intents - 2026-06-06

building a claw in the browser - 2026-05-22

shipping a prompt - 2026-05-22

model half-life - 2026-05-18

are pwas cooked? - 2026-05-17

a business in a box - 2026-05-10

How might a browser be developed? - 2026-05-03

agent-do: my agent loop - 2026-05-01

webmcp is the new web intents ... maybe - 2026-04-27

damn claude, that's a lot of commits - 2026-03-30

the token salary - 2026-03-27

the llm whisperer - 2026-03-08

the prompt is the program - 2026-02-21

If NotebookLM was a web browser - 2026-01-25

the browser is the sandbox - 2026-01-25

projects - 2026-01-02

hyper content negotiation - 2025-11-27

headless stopgap - 2025-11-23

dead framework theory - 2025-10-12

interception - 2025-09-21

dangerous - 2025-08-22

hypermedia - 2025-08-18

elements - 2025-07-16

Whither CMS? - 2025-07-05

token slinging - 2025-06-30

on-device - 2025-06-12

AI Assisted Web Development - 2025-06-04

embedding - 2025-05-28

Mashups 2.0 - 2025-05-24

latency - 2025-05-22

A link is all you need - 2025-05-17

super-apps - 2025-05-12

transition - 2025-05-09