kimi-webbridge
Rating is derived from the repo's GitHub stars and shown for reference.
name: kimi-webbridge
description: |
Kimi WebBridge lets AI control the user's real browser — navigate, click, type, read, screenshot, and interact with any website using the user's actual login sessions. Use this skill whenever the user wants to interact with websites, automate browser tasks, scrape web content, or perform any action requiring a real browser. Also use when the user mentions "browser", "webpage", "open URL", "screenshot", or asks to read/interact with any website. Use even for simple-sounding browser requests — the daemon handles all complexity.
Kimi WebBridge
Control the user's real browser (with their login sessions) via a local daemon at http://127.0.0.1:10086.
Health check (always do this first)
~/.kimi-webbridge/bin/kimi-webbridge status
Then act on the result:
running: trueandextension_connected: true— healthy. Proceed with the tool calls below.- Anything else (command not found,
running: false,extension_connected: false, errors) — Readreferences/operations.mdin this skill directory. It has the install / start / diagnose routing table.
Don't guess fixes here — every non-healthy state is handled in references/operations.md.
Tools
| Tool | Args | Returns | Note |
|---|---|---|---|
navigate |
url, newTab(bool), group_title |
{success, url, tabId} |
First call opens a tab — see Tabs. group_title sets the group's visible label |
find_tab |
url, active(bool) |
{success, url, tabId} |
Select an already-open tab as the current one — see Tabs |
snapshot |
— | {url, title, tree} with @e refs |
Accessibility tree (text) — use this to read page content and locate elements |
click |
selector (@e ref or CSS) |
{success, tag, text} |
Synthetic el.click() |
fill |
selector, value |
{success, tag, mode} |
Works on <input>/<textarea> AND [contenteditable] (ProseMirror/Lexical/Slate). mode is "value" or "contenteditable" |
evaluate |
code (supports async/await) |
{type, value} |
|
screenshot |
format(png|jpeg), quality(0-100), optional selector (@e/CSS), optional path |
{format, path, sizeBytes, mimeType} |
Returns a file path, not base64 — see Screenshots |
network |
cmd(start|stop|list|detail), filter, requestId |
request/response data | |
upload |
selector, files(string[]) |
{success, fileCount} |
|
save_as_pdf |
paper_format, landscape, scale, print_background, optional path |
{path, sizeBytes, mimeType, pageTitle} |
Render current page → PDF, returns a file path — see Save as PDF |
list_tabs |
— | {success, tabs:[{tabId, url, title, active, groupTitle}]} |
Inspect tabs in the current session |
close_tab |
— | {success, closed: bool} |
Close the current tab in the session |
close_session |
— | {success, closed: int} |
Close all tabs in the session — closed is the count. See Sessions for when to call |
Tabs and the current tab
Single-tab tools (snapshot, click, fill, screenshot, save_as_pdf) act on the current tab — the one you most recently opened with navigate or selected with find_tab.
- Opening pages: use
newTab:truewhen pages should coexist (comparing, cross-referencing); omit it to send the current tab to a new URL. - Going back to an earlier tab: call
find_tabto make a tab you already opened the current one again. Pass the tab's full URL — take it fromlist_tabsor the earliernavigateresult. A bare root domain (zhihu.com) may miss awww.zhihu.comtab, so prefer the exact URL.active:truepicks the tab the user is currently viewing — use it when they say "用我打开的 X" / "在我当前的 X 页面上"; otherwise the leftmost match wins. - If
find_tabreturns "no open tab found", the page isn't open —navigatewithnewTab:trueinstead.
curl -s -X POST http://127.0.0.1:10086/command \
-d '{"action":"find_tab","args":{"url":"https://www.zhihu.com","active":true},"session":"camping-research"}'
Call Format
Every command carries a top-level session naming the current task — see Sessions below. The examples in later sections omit it only for brevity; in real calls always include it.
curl -s -X POST http://127.0.0.1:10086/command \
-H 'Content-Type: application/json' \
-d '{"action":"navigate","args":{"url":"https://example.com","newTab":true,"group_title":"My task"},"session":"my-task"}'
Sessions
One task = one session = one tab group. A session collects every tab this task opens into a single tab group, so the user sees one group representing "what the agent is doing right now". Pass it as a top-level field of the request body (not inside args).
Rules:
- Pick one session name when the task starts, put it on every command, and never change it mid-task.
- One task uses one session — even across multiple sites. Searching on Google and then opening results on three different domains all share the same session and land in the same group. Do not switch session names per site — that is the #1 cause of fragmented tab groups.
- Name the session after the task, not the site or domain — e.g.
camping-research,phone-compare. group_titleis the human-readable label shown on the group in the browser. Write it in the user's language (match the conversation — Chinese or English). Pass it on the firstnavigate; later calls in the same session don't need it.- Use multiple sessions only when the user asks for several unrelated tasks at once — one session per task.
# First tab of the task: set session + human-readable label (in the user's language)
curl -s -X POST http://127.0.0.1:10086/command \
-d '{"action":"navigate","args":{"url":"https://www.google.com/search?q=tents","newTab":true,"group_title":"Camping gear research"},"session":"camping-research"}'
# Another SITE, SAME task → same session → joins the same group automatically
curl -s -X POST http://127.0.0.1:10086/command \
-d '{"action":"navigate","args":{"url":"https://www.zhihu.com/search?q=tents","newTab":true},"session":"camping-research"}'
# Every later command carries the same session
curl -s -X POST http://127.0.0.1:10086/command \
-d '{"action":"snapshot","args":{},"session":"camping-research"}'
When the task is finished and the user no longer needs these pages, close_session clears the whole group. If they might want to look further (a follow-up question, inspecting a result), deliver your answer first and leave the tabs open — closing too eagerly throws away the work the user can still see.
Screenshots
The daemon writes the image to disk and returns {format, path, sizeBytes, mimeType} — never base64, since the model can't read raw image bytes. Take the .path and open it with the Read tool to actually see it.
# Default: PNG of the visible viewport, daemon picks a temp path
curl ... -d '{"action":"screenshot","args":{}}'
# Options (each independent): JPEG quality, element-only via @e/CSS selector, custom output path
curl ... -d '{"action":"screenshot","args":{"format":"jpeg","quality":60}}'
curl ... -d '{"action":"screenshot","args":{"selector":"@e123"}}'
A caller-supplied path is honored verbatim (parent dirs created, existing file overwritten) — use a unique name to avoid clobbering. save_as_pdf follows the same rule.
Prefer snapshot over CSS/JS selectors
snapshot returns interactive elements with @e refs based on semantic role/name. Use them directly with click/fill — they survive CSS class hash changes that break manually-written selectors.
Fall back to evaluate (JS) only when:
- The target has no
@eref in the snapshot - You need attributes not in the snapshot (e.g.,
href) - You need to dispatch complex event sequences, or scroll
Evaluate Tips
- Always use compact
JSON.stringify(data)— never addnull, 2formatting. Indentation and newlines can inflate the response several times over, causing truncation during transmission. evaluatecalls share the page's JS realm — re-declaring the sameconst/letacross two calls throwsSyntaxError. Wrap in an IIFE for a fresh scope:(() => { const x = ...; return x; })().
Text input — use fill
fill handles all three text input shapes. Pass selector (CSS or @e ref) + value:
| Target | What fill does |
Returned mode |
|---|---|---|
<input> / <textarea> |
Sets .value via native setter, fires input/change. |
"value" |
[contenteditable] (ProseMirror / TipTap / Lexical / Slate / Quill etc.) |
Focuses, selects all existing content, calls document.execCommand('insertText', ...) which fires beforeinput/input with inputType:'insertText' and data:value. |
"contenteditable" |
| Other element | Best-effort .value + events. |
"value" |
fill is clear-and-insert: existing content is replaced. For "append to existing text", read the current value via evaluate, concatenate, then fill with the result.
Form submit / special keys
There's no separate "press Enter" tool. To submit a form, click the submit button directly (click on the @e ref or selector). To dispatch a key event programmatically (e.g. Escape to close a modal):
{"action":"evaluate","args":{"code":"document.activeElement.dispatchEvent(new KeyboardEvent('keydown',{key:'Escape',bubbles:true}))"}}
Save the current page as PDF
save_as_pdf renders the current page to PDF and returns the file path. All args optional:
paper_format:letter(default) |a4|legal|a3|tabloidlandscape:false(default)scale:1.0(default), range[0.1, 2.0]print_background:true(default) — keep background colorspath: caller-supplied output path; if absent, daemon picks a default under OS temp dir using the page title as the filename
path semantics match screenshot: written verbatim, parent dirs auto-created, existing files overwritten.
Decoded PDF cap is 100 MB. Above that the daemon refuses; reduce scale or split the page.
Known limitations
- Sites that strictly check
event.isTrusted(some banking portals, captcha challenges) rejectfillandclickbecause both go through DOM-level synthetic events (isTrusted=false). This is a product boundary, not a bug — no automation primitive that runs on the user's machine without stealing OS focus can produce trusted events on these sites. - Cross-origin iframes:
fill,click,evaluate, andsnapshotoperate on the top frame. If a target element lives in a same-page iframe from a different origin (e.g. embedded sandbox demos), navigate to the iframe's URL directly instead.
Versions
Daemon, extension, and this skill share a 1:1 version string. Read both via:
~/.kimi-webbridge/bin/kimi-webbridge status
# {"version":"<daemon>", "extension_version":"<extension>"}
If a tool returns an error containing "Please update the Kimi WebBridge extension", the user's extension is older than this skill. Tell the user:
请更新 Kimi WebBridge 浏览器扩展后重试:https://kimi.com/features/webbridge
Don't retry the failed tool. Don't auto-switch skill versions based on extension_version — the pairing protocol isn't finalized.