Bot Arena

Failure report

Playwright vs Bot Arena

Seven plain Playwright tests, one per level. Each one tries to sign in. Every test fails — for different reasons across two sections. Below: what each test does, the error Playwright surfaces, and either the detection signals that caught it or the structural mismatch that prevented it.

Tests run
13
Failed
13
Signals / mismatches
28

Headless Chromium driven by @playwright/test running against bot-arena.jhero.app. Source: playwright/levels.spec.ts.

Section 1

Bot detection

The site detects automation through fingerprinting, behavioural signals, or third-party challenge. Five levels of increasing sophistication.

Level 1

The honest tell

· Passive webdriver flags
Playwright: failed AIVA: also fails

What the test does

test('Level 1 sign in', async ({ page }) => {
  await page.goto('/bot-detection/level-1/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Browsers volunteer a lot about themselves to every site they visit — what version they are, what extensions are loaded, whether they are being controlled by an automation program. When Playwright drives a browser, the browser honestly admits "I am being automated" through a flag called navigator.webdriver that any site can read in a single line of JavaScript. Stock Playwright also has no plugins installed, no notification permissions set, and identifies itself as "HeadlessChrome" in its version string. Each of these is a yes/no question a site can ask in milliseconds.

Why a VNC-driven real browser passes

The browser inside a VNC session is a regular, fully-fledged Chrome that a regular user started. Nothing is automating it from the inside — the automation happens outside the browser, at the operating-system level, by moving a mouse and pressing keys on a remote desktop. The browser does not know it is being driven, so none of these flags get set, and it reports back the same values any real human visitor would.

Playwright context — could this test be fixed in Playwright? (click to expand) 2/5 · Stealth-plugin arms race

Verdict: technically patchable, but it's an arms race that the page always wins eventually.

Each of the five remaining signals can in principle be spoofed from Playwright:

  • navigator.webdriver can be hidden via --disable-blink-features=AutomationControlled plus an addInitScript that redefines the property.
  • The User-Agent can be spoofed with --user-agent="..." to strip the HeadlessChrome token.
  • navigator.plugins, navigator.languages, and the Notification.permission / permissions.query pair can all be patched via Object.defineProperty in an init script.

Off-the-shelf stealth bundles (playwright-extra + puppeteer-extra-plugin-stealth) ship most of these patches already. The catch: every Chrome release introduces new tells, and commercial bot-detection vendors (Cloudflare, DataDome, PerimeterX, Imperva) maintain fingerprint databases of every known stealth-plugin signature. You spend more time updating your evasions than writing tests, and you only ever win temporarily.

AIVA context — what would need to change in AIVA to pass this (click to expand) 1/5 5/5
Path 1
1/5
Practical — init-script patch
~30 minutes
5-line patch in browser.ts No architectural change Vibe-codable
Path 2
5/5
Clean — replace Puppeteer / CDP
Multi-week refactor
Rewrite control plane X11/uinput steering No CDP attached

AIVA fails this level because of one signal: navigator.webdriver = true. AIVA launches Chrome via Puppeteer in aiva-node/src/control-server/src/browser.ts:204 (puppeteer.launch({...})), and any browser attached via CDP has this flag set automatically by Chrome itself.

The pragmatic fix is a single init script. Add this to AIVA's page-setup flow (e.g., next to the existing hideCursorScript wiring):

await page.evaluateOnNewDocument(() => {
  Object.defineProperty(navigator, 'webdriver', {
    get: () => undefined,
    configurable: true,
  });
});

Bot Arena's L1 check is literally navigator.webdriver === true → FAIL. Returning undefined makes the check pass. This is exactly what every stealth plugin does (puppeteer-extra-plugin-stealth, playwright-extra-stealth, etc.). The original "multi-week refactor" estimate was for the architecturally pure fix — replacing Puppeteer/CDP entirely with a non-CDP control plane. That's the right answer if you need to pass sophisticated bot-detection vendors that fingerprint the shape of navigator.webdriver (own vs prototype descriptor, getter behaviour, etc.). For Bot Arena and most "naive equality check" detection layers, the 5-line patch is sufficient.

Trade-off: the init-script patch is detectable by sites that audit property descriptors. If AIVA's target customers operate sites with enterprise-grade detection, the architectural path becomes the right long-term investment. For this demo and a wide class of real-world cases, the patch is the right answer today.

Why it failed — Detection Log

  • fail webdriver — navigator.webdriver = true
  • fail plugins — navigator.plugins.length = 0 (expected > 0)
  • pass languages — navigator.languages = [en-US]
  • fail ua-headless — User-Agent contains "HeadlessChrome/148.0.7778.96"
  • pass notif-permission — Notification.permission and permissions.query agreed
Level 2

CDP attached

· Headless / CDP-only tells
Playwright: failed AIVA: also fails

What the test does

test('Level 2 sign in', async ({ page }) => {
  await page.goto('/bot-detection/level-2/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

What "CDP" means: CDP stands for Chrome DevTools Protocol — the low-level remote-control interface Chrome exposes for tools like Chrome's own DevTools panel, Puppeteer, and Playwright. It is how those tools "drive" Chrome from outside the browser: clicking buttons, typing text, reading the DOM, taking screenshots — all without using a real keyboard and mouse. When CDP is attached, Chrome behaves slightly differently in measurable ways, and many of those differences also coincide with "I am running headless."

A real laptop has visible chrome around every browser window — toolbars at the top, tabs, a Windows taskbar at the bottom, a macOS menu bar — and this chrome takes up real pixels. The browser can ask the screen "how much of you is mine, versus the OS's?" and the answer comes back in pixels. A headless automated browser has no chrome and no visible window at all, so the honest answer is zero. There is no way to fake having toolbars that do not exist.

Why a VNC-driven real browser passes

A VNC session streams a real, fully visible Chrome window running on a real desktop. There are real toolbars, a real taskbar, real OS chrome. Every measurement the page makes returns the same numbers any human visitor on any laptop would produce. Crucially, the automation happens outside the browser (at the OS level, moving a real cursor) — no CDP is attached, so Chrome behaves like an ordinary Chrome being used by an ordinary person.

Playwright context — could this test be fixed in Playwright? (click to expand) 3/5 · Partial — script signals only

Verdict: partially patchable in script; the pixel measurements require effectively rebuilding what VNC-AIVA already is.

The JavaScript-level signals (chrome.app, chrome.csi, driver shims, toString integrity) can be polyfilled with an addInitScript at page load. Easy.

The window/screen pixel measurements are different. outerHeight - innerHeight = 0 is true because the headless browser literally has no toolbars. Two ways out, neither great:

  1. Run headed (headless: false) on a server with Xvfb/Xvnc. But then you need a real desktop environment with a window manager and a panel to populate screen.availHeight < screen.height, plus you need Chrome to actually display its chrome (not --kiosk). At that point, you have rebuilt the AIVA architecture from scratch.
  2. Spoof the values from JS — override window.outerHeight, screen.height, etc. via addInitScript. But the spoofs need to be internally consistent across signals: if you claim a 1080-pixel screen with a 40-pixel taskbar, the browser viewport's actual height needs to plausibly fit inside that. Cross-signal correlation catches these mismatches.

In practice: an automation team trying to fix L2 with Playwright ends up reinventing AIVA badly.

AIVA context — what would need to change in AIVA to pass this (click to expand) 2/5
Fix complexity
2/5
Easy — drop 2 flags + add desktop env
Half a day
Config: drop 2 flags Image: add desktop env

AIVA currently fails this level for two reasons:

  1. No visible browser chrome — AIVA's browserArgs.ts passes both --start-fullscreen and --kiosk. Both flags hide the toolbars, tabs, and address bar that any real Chrome window displays. With them dropped, outerHeight - innerHeight jumps from 0 px to the usual 80–120 px. Drop: --start-fullscreen, --kiosk
  2. No taskbar — this one is outside Chrome's launch flags. AIVA's VNC session (Xvfb/Xvnc) has no window manager or desktop panel reserving screen pixels, so the X server reports screen.availHeight === screen.height. Adding a lightweight desktop environment to the AIVA image — XFCE, LXDE, or even just OpenBox + tint2 — with a panel/dock visible at the bottom of the screen would close this gap.

Why it failed — Detection Log

  • pass driver-shims — no cdc_* globals (Playwright is not Selenium)
  • pass tostring-integrity — Function.prototype.toString is native
  • fail chrome-surface — window.chrome.app and chrome.csi both missing (app=false, csi=false)
  • fail browser-chrome-height — outerHeight - innerHeight = 0px (no toolbars/tabs visible)
  • fail screen-taskbar — screen.availHeight = screen.height = 720 (no taskbar reserved)
Level 3

Mouse trajectory

· Behavioural — mouse path and keystroke cadence
Playwright: failed AIVA: passes

What the test does

test('Level 3 sign in', async ({ page }) => {
  await page.goto('/bot-detection/level-3/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

When a human clicks a button on a web page, the mouse pointer travels there — left a bit, up a bit, curving naturally. That path leaves a trail of dozens of "I moved here" events along the way. Playwright does not do that. When you tell Playwright "click this button," the pointer instantly appears at the button's exact pixel and clicks. No travel, no curve. A page that records every mouse event notices that this click came out of nowhere — no human operates a computer like that.

Why a VNC-driven real browser passes

A VNC operator moves a real mouse cursor on a real operating system, generating the same continuous stream of mouse events any human would. Because the path is a physical movement (the cursor is dragged across the screen by a person or by image-recognition automation steering it), it has the same natural variation and curvature as any other user's.

Playwright context — could this test be fixed in Playwright? (click to expand) 3/5 · Humanize every interaction

Verdict: bypassable for the basic checks Bot Arena does, but only by hand-rolling humanized interactions everywhere — and any sophisticated behavioural model still wins.

Playwright does expose lower-level mouse APIs that can generate intermediate moves:

  • page.mouse.move(x, y, { steps: 30 }) emits 30 intermediate mousemove events along a straight line.
  • Wrap that in a Bezier-curve helper with randomized jitter and you produce trajectories with the right shape and curvature.
  • page.keyboard.type(text, { delay: rand(80, 200) }) dispatches one key at a time with randomized inter-key delays.

The catch: every interaction in the test suite needs this treatment. A one-line page.click() becomes a thirty-line "humanize" helper. And advanced behavioural fingerprinting (used by serious bot-detection vendors) trains ML models on real human mouse telemetry — they pick up on acceleration curves, overshoot-and-correct patterns, pause-before-click latency, and dozens of other features that synthetic Bezier curves don't replicate. So: bypassable here, in this demo. Increasingly hard against production-grade defenders.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. The mouse cursor in AIVA's VNC session moves continuously across the screen at the OS level — exactly like any human user dragging a real mouse. No code or config change is needed here; this is one of the levels where running on a real machine wins by construction.

Why it failed — Detection Log

  • info level3-armed — recorder armed at page load
  • fail mouse-trajectory — only 1 mousemove point recorded between load and click (need ≥5 for a human-shaped curve)
  • pass keystroke-cadence — 0 keystrokes — page.fill() bypasses key events, so this check abstains
Level 4

Fingerprint battery

· Canvas, audio, WebGL renderer, font set
Playwright: failed AIVA: passes

What the test does

test('Level 4 sign in', async ({ page }) => {
  await page.goto('/bot-detection/level-4/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Real computers have real graphics cards from real manufacturers (Intel, NVIDIA, AMD), and each draws images and text in subtly different but characteristic ways. Real computers also have real font files installed by the operating system. Headless automated browsers have neither — they use a software-only graphics stack called SwiftShader that produces an obviously-different visual fingerprint, and they ship with a stripped-down set of fonts. A page can render a tiny invisible test image and hash the pixels; that single hash is usually enough to tell whether the browser is running on real silicon or a CI runner.

Why a VNC-driven real browser passes

A VNC session runs on a real machine with a real graphics stack and a real set of fonts. The fingerprints it produces match those of millions of other real desktop Chrome installations.

Playwright context — could this test be fixed in Playwright? (click to expand) 4/5 · Needs real GPU for consistency

Verdict: effectively impossible to fix from inside Playwright without a real GPU and real fonts.

Every individual fingerprint can be spoofed with an addInitScript hook:

  • Override WebGLRenderingContext.prototype.getParameter to return a fake renderer string like "Intel Iris Xe Graphics".
  • Patch HTMLCanvasElement.prototype.toDataURL and getImageData to return a pre-computed "real GPU" hash.
  • Replace OfflineAudioContext.prototype.startRendering to return a pre-recorded waveform.
  • Spoof document.fonts.check and the font-width measurement trick to claim the right font set.

The trap: internal consistency. If you claim "Intel Iris Xe Graphics" for WebGL, your canvas pixel hash needs to match what an actual Intel iGPU produces — and that hash depends on subtle floating-point rounding, anti-aliasing kernels, and driver-specific quirks. Without the actual hardware you cannot reproduce it. Detection services maintain databases of valid combinations: GPU X must produce canvas hash within set Y for fonts Z. Spoofing one signal in isolation creates a contradiction with the others, which is itself a stronger signal than the original tell.

This is where Playwright fundamentally loses against any site doing real fingerprint-based bot detection.

AIVA context — why this level already passes for AIVA (click to expand) 1/5
Fix complexity
1/5
Trivial — hardening (not required)
A few hours, only if hardening is desired
Config: drop 3 flags Operational: harvest denylist hashes

AIVA passes this level — but partially by accident. AIVA's browserArgs.ts includes --disable-gpu, --disable-webgl, and --disable-features=Vulkan,webgpu, which make the WebGL renderer query return nothing. Bot Arena reports an empty renderer as INFO rather than FAIL, so AIVA slips past. Canvas, audio, and font fingerprints come from a real Linux Chrome on a real machine and look like any other desktop user.

Latent risk: if Bot Arena's canvas/audio denylists in src/detections/level4.ts were populated with hashes harvested from AIVA's Chrome (which is the operational follow-up flagged in the implementation plan), this level would fail for AIVA too. Long-term, AIVA should consider whether --disable-gpu/--disable-webgl are still needed — they're a tell to fingerprint-aware sites because most real Chromes do have GPU.

Why it failed — Detection Log

  • fail webgl-renderer — WebGL renderer = "ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device …))" — software rasteriser, no GPU
  • pass canvas-fp — sha256 = f66453e0… (not on denylist — denylist is empty in v1)
  • pass audio-fp — sha256 = 543fb8e0… (not on denylist — denylist is empty in v1)
  • pass font-probe — Segoe UI Emoji, Arial Black, Comic Sans MS — UA-consistent for the Windows runner
Level 5

Cloudflare Turnstile

· Real third-party challenge
Playwright: failed AIVA: also fails

What the test does

test('Level 5 sign in', async ({ page }) => {
  await page.goto('/bot-detection/level-5/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Cloudflare Turnstile is the modern, invisible replacement for "click all the bicycle pictures" CAPTCHAs. When a page asks for it, Turnstile silently runs all the kinds of checks the previous four levels illustrate — plus an additional stack of private signals only Cloudflare knows about — and decides whether the visitor looks human enough to be issued a one-time "yes, this is a human" token. For automated browsers it simply refuses to issue the token. The server-side login check then sees no token and rejects the submission before it ever reaches the application code.

Why a VNC-driven real browser passes

A real Chrome session with a real fingerprint, real interaction history, and real mouse movement looks like any other paying customer to Turnstile. The token gets issued silently, exactly the same way it would for someone working from a coffee-shop laptop.

Playwright context — could this test be fixed in Playwright? (click to expand) 5/5 · Impossible without 3rd-party solver

Verdict: functionally impossible to bypass from inside Playwright. The only working "fix" is to outsource the problem.

Turnstile's logic is intentionally closed-source. It runs every kind of signal the previous four levels illustrate, plus a stack of private checks Cloudflare keeps to itself, plus IP reputation, plus behavioural analysis trained on the firehose of real human traffic across the Cloudflare network. Even a Playwright author who perfectly fixed levels 1-4, ran from a residential IP, and hand-rolled humanized interactions would still be classified as automated with high confidence — Cloudflare's behavioural model is too good.

The "solution" used in the wild is paid CAPTCHA-solver services (2Captcha, anti-captcha, CapMonster, etc.). They route the challenge through real-browser farms — either real humans or sophisticated stealth setups — and return a valid token in a few seconds, for a few cents each. Wire one of those into your test:

const token = await solver.solveTurnstile({
  sitekey: '0x4AAAAAADOBZMoei4aG9CNO',
  url: 'https://bot-arena.jhero.app/bot-detection/level-5/',
});
await page.evaluate((t) => {
  document.querySelector('input[name="cf-turnstile-response"]').value = t;
}, token);

This works — but it has defeated the original purpose of using Playwright. You have paid a third-party service to act as the human in front of the human-detector. Your "automated" tests now have a per-run cost and a human-in-the-loop dependency. This is exactly the kind of corner that VNC-AIVA, by being a real browser session at the OS level, avoids without any third-party dependency.

AIVA context — what would need to change in AIVA to pass this (click to expand) 4/5
Fix complexity
4/5
Hard — partially externally-bound
Inherits L1 + infra work; Cloudflare ML remains uncertain
Blocked on L1 Blocked on L2 Residential IP infrastructure

AIVA currently fails this level as a cascading consequence of L1 and L2. Cloudflare Turnstile silently runs many of the same signals — navigator.webdriver, browser-chrome dimensions, fingerprint plausibility — plus its own private checks, plus IP reputation. Two contributing causes inside AIVA's control:

  1. Signal leakage from L1 and L2. Fixing the Puppeteer/CDP attachment, dropping --incognito/--disable-extensions, and dropping --kiosk/--start-fullscreen would all reduce Turnstile's confidence that the visitor is automated. Closing L1 + L2 likely moves Turnstile from "refuse / interactive challenge" to "silent pass" for many sites.
  2. IP reputation. If AIVA runs on a datacenter or cloud-region IP, Turnstile downgrades by default. Running through a residential proxy or from end-user infrastructure improves the score meaningfully — and is independent of any AIVA code change.

Turnstile's logic is partially closed-source, so even a perfectly-configured AIVA may occasionally fail. This level is the only one where success isn't fully under AIVA's control.

Why it failed — Detection Log

  • fail turnstile — no token — widget did not solve. Cloudflare refused to issue a token for the automated browser; server-side siteverify never called.
Section 2

Selector resistance

The DOM that selector-based automation depends on is absent or randomised. Playwright fails at the selector step before any signal can fire.

Level 1

Canvas-rendered login

· No DOM to query — pixels only
Playwright: failed AIVA: passes

What the test does

test('Level 1 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-1/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

A small but growing class of web apps draws their entire user interface inside a single <canvas> element — Figma, Google Sheets, Photoshop Web, web games. The browser sees one big rectangular pixel buffer; everything inside it (text, buttons, input boxes) is just paint. There is no DOM tree to walk, no element to grab, no label to query. A selector-based automation tool literally has nothing to interact with.

Why a VNC-driven real browser passes

An image-aware automation tool — like the classic AIVA — does not look at the DOM at all. It looks at the rendered pixels, recognises the visible "Email" text and the box right under it, and clicks at those coordinates. It then types using OS-level keystrokes, which the canvas receives as ordinary keyboard events. The DOM's absence is irrelevant; the pixels are the contract.

Playwright context — could this test be fixed in Playwright? (click to expand) 5/5 · Impossible without external vision

Verdict: impossible with selector-based Playwright. The DOM is empty of anything to query.

Playwright's locator APIs all resolve to nothing here:

  • page.getByLabel('Email') — no <label> element exists.
  • page.getByRole('textbox') — no <input> element exists.
  • page.getByText('Sign in') — the text "Sign in" is painted pixels, not a text node.

The only theoretical path is to take a screenshot from Playwright, pass it to an external OCR / template-matching pipeline to find UI elements visually, then use page.mouse.click(x, y) at the resolved coordinates and page.keyboard.type(...) to fill them. At that point you have built a worse version of the classic AIVA — and you've moved the actual automation outside Playwright entirely.

There is no Playwright-native way to interact with canvas-rendered UIs. This is a structural mismatch between the tool and the target, not an arms race over signals.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. AIVA's automation model is image-based from the ground up: it screenshots the visible browser surface, identifies UI elements by what they look like, and dispatches OS-level mouse and keyboard events at the right coordinates. The DOM is incidental — AIVA never touched it on the way in, so it does not matter that there is no DOM to touch here.

This level is the strongest single argument for pixel-based automation as a category. Selector-based tools are not just blocked here — they are structurally unable to attempt the task at all.

Why it failed — Detection Log

  • info no-dom — Only one DOM element exists in the form region: a <canvas>. No <input>, no <button>, no <label>.
  • fail getByLabel-email — page.getByLabel('Email') — locator resolved to <no element matched>
  • fail getByLabel-password — page.getByLabel('Password') — locator resolved to <no element matched>
  • fail getByRole-button — page.getByRole('button', { name: 'Sign in' }) — locator resolved to <no element matched>
Level 2

Dynamic selectors

· Real form, randomised identifiers
Playwright: failed AIVA: passes

What the test does

test('Level 2 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-2/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

Modern web apps frequently ship with build-time CSS-in-JS, which produces randomised class names. Some apps go further and randomise every attribute — id, name, class, aria-label — on every page request, and omit <label> elements altogether. A human still reads "Email" off the screen and types in the box below. A test that uses accessibility-based locators sees nothing it can grab — every locator it had hardcoded is now stale.

Why a VNC-driven real browser passes

AIVA reads "Email" from the screen pixels and clicks the input it visually identifies as a text box just below the label text. It does not look at attributes; it looks at the rendered shape of the page. Randomising the DOM has no effect on it — the visual layout is what matters.

Playwright context — could this test be fixed in Playwright? (click to expand) 4/5 · Brittle workarounds, no stable selector

Verdict: possible with brittle fallback selectors, but the entire promise of accessibility-based testing is gone.

Playwright's normal idioms break:

  • page.getByLabel('Email') — no <label> element exists, so this returns nothing.
  • page.getByRole('textbox', { name: 'Email' }) — no accessible name, so this returns nothing.
  • page.getByPlaceholder('Email') — no placeholder, so this returns nothing.

Possible fallbacks, in increasing brittleness:

  1. page.locator('input[type="email"]') — works this run; breaks if the input type is also randomised, or another email input is added.
  2. page.locator('input').nth(0) — works this run; breaks the moment the form reorders or grows.
  3. page.locator('div:has-text("Email") + input') — works for this layout; breaks if the DOM structure is rewritten.

For a truly hostile site, every fallback is one revision away from breaking. The maintenance burden grows linearly with the number of forms; the test suite becomes the single largest source of flakiness in the project.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. AIVA does not look at attributes. It looks at the visible rendering: a label that says "Email", an input box beneath it, a similar pair for "Password", a dark button labelled "Sign in". Randomising the DOM attributes changes nothing about that visual layout — the OCR and template matching find the same targets in the same places.

As a category, "selector resistance" is invisible to AIVA by construction. Every visual automation tool — AIVA, image-based RPA platforms, agentic vision models — sits in this same advantage zone.

Why it failed — Detection Log

  • info form-rendered — A real <form> with real <input> elements — but every id/name/class is randomised per request, and there are no <label> elements.
  • fail getByLabel-email — page.getByLabel('Email') fails — no <label> element associates with the input.
  • pass getByRole-button — page.getByRole('button', { name: 'Sign in' }) works (button text is stable).
  • info fallback-fragility — Even structural fallbacks like input:nth-of-type(1) work this run, but break on the next form revision.
Level 3

Closed Shadow DOM

· Sealed web component
Playwright: failed AIVA: passes

What the test does

test('Level 3 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-3/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

Many modern web apps build their UI as Web Components — small, self-contained widgets where the internal structure is intentionally hidden from outside code. A "closed shadow root" is the strongest form of this: even your own JavaScript running on the same page cannot read or write what is inside. This is a deliberate privacy and encapsulation boundary used by component libraries, design systems, and many enterprise SaaS frontends. To a test, the component is a black box: the element exists, but the inputs, the button, and even the success message are unreachable.

Why a VNC-driven real browser passes

A vision-based automation tool screenshots the rendered page and finds the input visually. The DOM privacy boundary is irrelevant — the rendered pixels are public. AIVA reads "Email", finds the box beneath it, clicks, types — exactly like a human looking at the screen. Closed shadow roots make automation harder only for tools that look at the DOM; they make it no harder for tools that look at the screen.

Playwright context — could this test be fixed in Playwright? (click to expand) 5/5 · Impossible — closed shadow is opaque

Verdict: impossible from inside Playwright. Closed shadow roots are an explicit privacy boundary that Playwright respects.

Playwright's locator engine can pierce open shadow roots automatically (and the >>> combinator works there too). For a closed shadow root there is no path at all — not from page.locator, not from page.evaluate (the component author may not expose any reference), not from any selector trick.

The only theoretical paths are:

  • Convince the component author to expose mode: 'open' — rarely possible for third-party code.
  • Use a Chrome DevTools Protocol command that exposes the closed shadow — Playwright does not surface this in its public API.
  • Take a screenshot and use external OCR + page.mouse.click(x, y) — at which point you have rebuilt vision-based automation outside Playwright.

For real apps built on closed Web Components, selector-based testing is structurally a dead end.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. AIVA reads the rendered page through screenshots and OCR; it has no concept of DOM accessibility at all. The shadow boundary is invisible to it because the pixels on the screen do not know they are coming from a sealed component. The form is filled, submitted, and "Access granted" is visible — same as any other login page.

As a category, web-component-heavy frontends (Salesforce Lightning, ServiceNow, SAP UI5, and most enterprise design systems) put selector-based testing in a permanent disadvantage. Vision-based tools are unaffected.

Why it failed — Detection Log

  • info sealed-mounted — <sealed-login> custom element with attachShadow({ mode: "closed" })
  • fail getByLabel-email — page.getByLabel('Email') — locator cannot pierce a closed shadow root.
  • fail shadow-piercer — page.locator('sealed-login >>> input') — the >>> combinator works only on OPEN shadow roots.
  • fail getByText-granted — page.getByText("Access granted") fails too — the success message also lives inside the sealed shadow.
Level 4

Iframe-embedded form

· Form in a child browsing context
Playwright: failed AIVA: passes

What the test does

test('Level 4 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-4/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

A huge fraction of production websites embed third-party widgets via iframes — Stripe payment forms, Auth0 login dialogs, Cloudflare challenges, embedded support chats, social login buttons. From the page's perspective the iframe is a single rectangle; the form inside it is in a separate document with its own DOM. Playwright's standard locators (the ones every tutorial teaches) only search the main page, so they silently miss anything inside a frame. The test fails the same way as if the form did not exist.

Why a VNC-driven real browser passes

AIVA does not know or care whether a region of the screen comes from the main page or a child frame. The screenshot is one image; the form is one rectangle of pixels; the email box sits below the "Email" text. Vision-based automation traverses frame boundaries for free, because frames are a DOM concept that does not exist in the rendered image.

Playwright context — could this test be fixed in Playwright? (click to expand) 3/5 · Requires every test to be frame-aware

Verdict: solvable but every test that touches frame content has to be rewritten with explicit frameLocator calls.

The frame-aware version of this test would look like:

const frame = page.frameLocator('iframe[title="login-frame"]');
await frame.getByLabel('Email').fill('[email protected]');
await frame.getByLabel('Password').fill('hunter2');
await frame.getByRole('button', { name: 'Sign in' }).click();
await expect(frame.getByText('Access granted')).toBeVisible();

For a single iframe this is annoying but tractable. In real applications the cost compounds: every Stripe payment field, every Auth0 step, every embedded widget is a separate frame. Some frames are cross-origin (Stripe Elements, for example), at which point Playwright cannot reach in at all — you would need a separate test running against the iframe URL directly, with no shared session.

Net effect: frame-heavy SaaS produces brittle, fragmented test suites where one frame change cascades into many test rewrites.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. AIVA's screenshot includes the iframe contents because the browser composites them into the page exactly like any other element. The image-recognition pipeline sees one form, finds the inputs visually, clicks and types. Frame boundaries do not exist at the pixel level.

This is a major real-world advantage. Payment flows (Stripe, Adyen, Braintree), embedded auth (Auth0, Okta, WorkOS), and most "embedded SDK" patterns ship as iframes — usually cross-origin. Vision-based automation handles them by construction; selector-based testing handles them only after significant per-frame rewrites, and not at all when the frames are cross-origin.

Why it failed — Detection Log

  • info iframe-mounted — The form is in an <iframe srcdoc="..."> with its own document.
  • fail getByLabel-email — page.getByLabel('Email') — runs on the main frame only; the form is in a child frame so the locator never resolves.
  • fail getByText-granted — page.getByText("Access granted") — same problem; the message lives in the child frame.
  • info requires-frame-locator — Test would have to be rewritten to use page.frameLocator("iframe").getByLabel(...) — every assertion + interaction explicitly frame-scoped.
Level 5

Slider verification

· Drag-to-align CAPTCHA
Playwright: failed AIVA: also fails

What the test does

test('Level 5 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-5/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: expect(locator).toBeVisible() failed

Locator:  getByText('Access granted')
Expected: visible
Received: hidden
Timeout:  5000ms
Plain-English explanation (click to expand)
The problem

Slider CAPTCHAs are the dominant anti-bot pattern across the Chinese internet (GeeTest, NetEase, Tencent, Alibaba) and increasingly common in Western anti-bot stacks (Cloudflare interactive, AWS WAF, ticketing platforms). The user is shown a randomised image with a notched gap, and a draggable puzzle piece somewhere else. To pass, drag the piece into the gap. The position of the gap is part of an image — there is no DOM hint for where it is. A test can drag, but only a tool that can SEE the gap knows where to drag to.

Why a VNC-driven real browser passes

A vision-based automation tool screenshots the slider, finds the highlighted target zone in the image, computes its X coordinate, and dispatches an OS-level mouse drag to that exact position. The drag is real — the browser receives a real sequence of mousemove events from a real cursor. The slider sees a human-shaped gesture and unlocks the form.

Playwright context — could this test be fixed in Playwright? (click to expand) 5/5 · Impossible without external vision

Verdict: impossible from inside Playwright. Playwright can drag (mouse.down/move/up at coordinates) but it cannot SEE the target zone.

The only paths a Playwright author has:

  • Take a screenshot via Playwright, pass the image to an external OCR / template-matching service, extract the target X coordinate, dispatch page.mouse.down/move/up. At that point you have built half of AIVA inside your test runner.
  • Use a paid CAPTCHA-solving service (similar to the Turnstile case). The service routes the challenge through real browsers, returns the solved token. Per-run cost + third-party dependency.

For real production slider CAPTCHAs (GeeTest, Alibaba, AWS WAF), the gap position is also rotated, scaled, and obfuscated with noise — generic OCR fails. Vendor-specific solver services are the only working option, and they cost ~$1-3 per 1000 solves.

AIVA context — what would need to change in AIVA to pass this (click to expand) 3/5
Fix complexity
3/5
Moderate — add drag-and-drop primitive to AIVA
Days of work
New interaction primitive Code change

AIVA currently fails this level because it does not yet have a drag-and-drop interaction primitive. Vision recognition of the target zone is already covered by the existing screenshot pipeline; what is missing is the ability to dispatch a sustained mouse-down → mousemove sequence → mouse-up gesture as a single action.

Adding the primitive is a moderate-sized piece of work — it touches the input-dispatch layer of the VNC control plane and needs a small UX vocabulary for "drag from X to Y at speed Z" in the recorder. Once it lands, every slider CAPTCHA (GeeTest, Alibaba, AWS WAF, Turnstile interactive) and every other drag-shaped interaction (sortable lists, signature pads, file pickers with drag-in) becomes accessible at the same time.

Why it failed — Detection Log

  • pass inputs-fillable — Email and Password are reachable via getByLabel — those parts work.
  • fail slider-not-solved — The slider knob was never dragged into the target zone — verified flag stays false on submit.
  • fail access-granted-not-shown — expect(getByText("Access granted")).toBeVisible() times out because the form refused submission with "Blocked — verification required".
Level 6

Image-only labels

· No DOM text — labels are pixels
Playwright: failed AIVA: passes

What the test does

test('Level 6 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-6/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

Some sites — historically many banks, brokerages, anti-scrape news sites, and some CAPTCHA prompts — render every visible label as an image, deliberately to defeat scrapers and automated tools. From a human standpoint the form looks perfectly normal: "Email" written above an empty field, "Password" written above another, "Sign in" on the button. From a test's standpoint there is no text anywhere — every "label" is a graphic with empty alt text. Accessibility-based selectors find nothing.

Why a VNC-driven real browser passes

Vision-based automation reads the rendered image with OCR exactly the way a human reads it. It sees "Email" written above a text-input-shaped rectangle and clicks. The fact that the text is an image rather than a DOM text node is invisible to OCR — they're both pixels.

Playwright context — could this test be fixed in Playwright? (click to expand) 4/5 · No semantic anchors; only brittle structural selectors

Verdict: only brittle structural fallbacks remain. The promise of accessibility-driven testing is gone here.

Every label is an SVG/PNG image with empty alt. Playwright's accessibility-based locators return empty:

  • page.getByLabel('Email') — no <label> element exists.
  • page.getByRole('textbox', { name: 'Email' }) — no accessible name on the input.
  • page.getByPlaceholder('Email') — no placeholder.
  • page.getByText('Email') — text is in image pixels, not a DOM text node.

Possible fallbacks:

  1. page.locator('input').nth(0) — works this layout; breaks on the slightest reorder.
  2. Click at hard-coded pixel coordinates via page.mouse.click(x, y) — exactly the kind of brittle, screen-resolution-dependent code that motivates moving away from selector tests in the first place.
  3. Integrate an OCR library, OCR the screenshot, find the label position, derive coordinates — at which point your test suite has reimplemented visual automation badly.

In real production deployments (bank login keypads with shuffled-position digit images), each session also changes the layout — so even nth-child fallbacks decay across runs.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. AIVA's primary input is the rendered screenshot, processed through OCR for text recognition. "Email" is the same to it whether it came from a DOM text node, an inline SVG, a PNG, or pixel-by-pixel canvas painting. The label-and-input-below visual pattern is recognised the same way regardless of how the page was built.

As a category, image-rendered text is everywhere in legacy financial and government software (and increasingly in anti-scraping CAPTCHAs that render even their prompt text as images). For DOM-based testing it is structurally impossible to do reliably. For vision-based automation it is no different from any other login page.

Why it failed — Detection Log

  • info inputs-present — Real <input> elements exist in the DOM, but they have no <label>, no aria-label, no placeholder, no title.
  • fail getByLabel-email — page.getByLabel('Email') — no <label> element associates with anything.
  • fail getByText-email — page.getByText("Email") — the text "Email" is inside an <img> as SVG, not as a text node.
  • fail getByRole-textbox — page.getByRole("textbox", { name: "Email" }) — no accessible name on the input.
Level 7

Cross-origin iframe

· Form on a different origin
Playwright: failed AIVA: also fails

What the test does

test('Level 7 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-7/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

When you embed a third-party widget like a Stripe card form, an Auth0 hosted login, or a Cloudflare Turnstile challenge, that widget runs in a separate browser-level sandbox loaded from a different origin (a different domain). The browser's same-origin security policy prohibits the host page — and any test framework running in the host page's context — from reading or interacting with the widget's internals. This is by design: it is what stops malicious sites from stealing credit-card details from embedded payment forms. The unavoidable side effect is that automated tests cannot reach in either.

Why a VNC-driven real browser passes

AIVA does not look at the DOM. Its input is the composited screenshot, which the browser draws cross-origin content into exactly like everything else. The Stripe card field, the Auth0 dialog, the Turnstile box — all visible as pixels in one image. AIVA sees one rectangle of pixels with an "Email" label and clicks the input.

Playwright context — could this test be fixed in Playwright? (click to expand) 5/5 · Browser security; literally impossible

Verdict: this is a hard ceiling enforced by the browser security model, not by Playwright.

Cross-origin iframes are sandboxed by the same-origin policy. The browser does not let any JavaScript in the parent page (including Playwright's injected scripts) read or write content in a cross-origin frame. page.frameLocator can list cross-origin iframes but cannot script into them.

For real third-party widgets (Stripe Elements, Auth0, Turnstile) the only supported automation paths are:

  • Use the vendor's test mode + their own SDK. Stripe ships a cardElement.update() helper in test mode; Auth0 has Lock mock APIs.
  • Run a parallel test against the iframe URL directly, with no shared session state with the host page. This breaks any flow that needs the host's auth state.
  • Pay a third-party solver to operate the widget in a real browser and inject the resulting token.

None of these is "your test calling page.getByLabel('Email').fill(...)". The Playwright-native solution does not exist by design.

AIVA context — what would need to change in AIVA to pass this (click to expand) 1/5
Fix complexity
1/5
Trivial — enable cross-origin iframes in AIVA launch config
Minutes
Browser config No code change

AIVA currently fails this level — but only because its embedded Chrome blocks cross-origin iframes via its launch configuration. Once cross-origin iframes are allowed in the browser config, the iframe renders normally and AIVA reads the form pixels just like any other page region. The architectural advantage is intact; only a launch-time flag stands in the way.

After the flag flips, this is a major real-world advantage. Payment forms, hosted auth dialogs, CAPTCHA challenges, embedded SDKs — all of which use cross-origin iframes by industry convention — render normally for AIVA. Selector-based testing handles them only via fragile vendor-specific workarounds, or not at all.

Why it failed — Detection Log

  • info iframe-cross-origin — iframe src is a data: URI with an opaque origin — cross-origin to the parent page.
  • fail getByLabel-email — page.getByLabel('Email') — main-frame-scoped, finds nothing.
  • fail frameLocator-blocked — page.frameLocator("iframe").getByLabel("Email") — Playwright refuses to script into a cross-origin frame; browser same-origin policy.
  • fail getByText-granted — expect(getByText("Access granted")).toBeVisible() — the message lives inside the cross-origin frame and is invisible to the parent.
Level 8

Virtual scrolling

· Windowed list — off-screen items are absent from DOM
Playwright: failed AIVA: passes

What the test does

test('Level 8 sign in', async ({ page }) => {
  await page.goto('/selector-resistance/level-8/');
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Password').fill('hunter2');
  await page.getByRole('button', { name: 'Sign in' }).click();
  await expect(page.getByText('Access granted')).toBeVisible();
});

What Playwright sees

Error: locator.fill: Test timeout of 30000ms exceeded.

Call log:
  - waiting for getByLabel('Email')
  - locator resolved to <no element matched>
Plain-English explanation (click to expand)
The problem

Performant lists in modern web apps only render the rows currently inside the visible viewport — a technique called virtual scrolling or windowing. Used by Slack's message history, Notion's database views, Gmail's thread list, every data-grid library (AG Grid, MUI X, TanStack Virtual). When a test wants to click an item that is 500 rows down, the row simply does not exist in the DOM until the list has been scrolled past it. Standard test idioms ("find the row, click it") return nothing.

Why a VNC-driven real browser passes

A vision-based automation tool already has a scroll-and-recognise loop built into its pipeline. It scrolls the visible viewport, takes a new screenshot, looks for the target visually, scrolls again, and repeats until the target appears. This is exactly what a human does. AIVA does not need to know that the list is virtualised — it does what humans do.

Playwright context — could this test be fixed in Playwright? (click to expand) 4/5 · Requires bespoke per-list scroll logic

Verdict: solvable with significant per-list bespoke code; the standard test idioms do not work.

For each virtualised list a test interacts with, the test author has to:

  1. Know that the list is virtualised (it might not be obvious from the rendered HTML).
  2. Know the total number of rows and the row height to compute where to scroll.
  3. Dispatch a programmatic scroll on the list container, wait for the new rows to mount, query, repeat.
  4. Alternatively, scroll incrementally and probe for the target row after each step.

All of this is custom code that lives inside the test suite. Different virtualisation libraries (react-window, TanStack Virtual, AG Grid) expose different APIs — there is no portable solution. And many sites virtualise BOTH rows AND columns, multiplying the complexity.

For real apps with very long lists (Gmail, Slack, AG Grid dashboards), the per-test cost compounds: every test that needs to click a non-visible item has its own scroll helper, its own retry logic, its own flake mode.

AIVA context — why this level already passes for AIVA (click to expand) ✓ passes natively
✓ No fix needed — passes by construction

AIVA passes this level natively. Visual automation systems are built around a perception loop: screenshot, look, decide what to do, act, screenshot again. Scrolling a virtualised list is exactly the same as scrolling any other long list — AIVA scrolls a screen, looks for the target, scrolls again if needed. Whether the rows are virtualised, all-DOM, or paginated does not matter — they are all just visible rows on screen at the moment of the screenshot.

As a category, every data-heavy SaaS app (Slack, Notion, Linear, Salesforce, ServiceNow, every CRM and ERP) uses virtualisation. Selector-based testing builds an ever-growing pile of per-list scrollers; vision-based automation does not.

Why it failed — Detection Log

  • info list-virtualised — 1,000 logical accounts; only ~10 visible rows are mounted in the DOM at any moment.
  • fail getByLabel-email — page.getByLabel('Email') — no Email field exists on this page; the email is selected by clicking a row.
  • fail getByText-target — page.getByText("[email protected]") — the row for user-371 is not currently mounted; locator returns empty.
  • info scroll-required — A working test would have to detect virtualisation, compute the row's scroll position, scrollTop into view, then click. Requires bespoke per-list logic.

What changes with VNC AIVA

Point the same seven tests at the classic AIVA — a real headed Chrome on a Linux host, clicked through VNC at the OS level, with image-based recognition instead of selectors. The five detection levels close down to one or two trivial fixes. The two selector-resistant levels are unblocked by construction — pixel-aware automation does not care whether the DOM is empty or randomised.

The difference is not patches, plugins, or stealth tricks. It is that VNC-AIVA is a real browser session driven by real OS-level input, with image recognition that reads the screen instead of the DOM.