Preprint 2026 Instance Goal Navigation Cost-Aware Interaction Zero-shot MLLM Agent

Ask When It Pays Cost-Aware Open-Ended Interaction for Instance Goal Navigation

We recast interactive Instance Goal Navigation as cost-sensitive uncertainty reduction: an agent should ask the question whose answer reduces the most navigation uncertainty relative to its cost. We contribute a data-derived question taxonomy, a Weighted Success Rate benchmark, and TANDEM, a modular zero-shot MLLM navigator.

Xunyi Zhao^1,2,*, Sihao Lin^1,2,*, Gengze Zhou¹, Zerui Li¹, Shijie Li³, Wei Tao⁴, Jiajun Liu^2,5, Qi Wu^1,2,†

¹Adelaide University · ²Responsible AI Research Centre, Australian Institute for Machine Learning · ³Institute for Infocomm Research (I²R), A*STAR · ⁴iMotion · ⁵CSIRO Data61

^* Equal contribution. ^† Corresponding author.

Paper soon Code soon Data soon BibTeX

SR@1.5 — best zero-shot agent

SR@1.5 over the next-best prior agent

eligible episodes · 262 scenes

cost-weighted question types

Overview

When asking is allowed, the cost of asking should count

A household robot asked to "bring me the cup" faces uncertain room boundaries, several same-category instances, and routes to different valid-looking candidates. Talking to an oracle disambiguates — but only if the agent asks the right question at the right price.

Reframing

Cost-sensitive

Interactive IGN becomes uncertainty reduction per unit cost, not raw success at any number of queries.

Taxonomy

Four typed questions

Appearance, region, route, and confirmation — with penalties mined from human navigation corpora.

Benchmark

Weighted Success Rate

An Isaac Sim benchmark with grounded typed oracles and a metric that discounts success by query cost.

Agent

TANDEM

A modular zero-shot navigator that separates planning, grounding, metric execution, and cost-aware asking.

The Problem

When to ask, what to ask, and how to use the answer

Prior interactive navigation lets agents query an oracle, but the space of questions is hand-designed and the cost of asking is absent or uniform. Under cost-agnostic evaluation, an agent can lift Success Rate just by repeatedly asking high-information route questions — raw SR cannot separate efficient disambiguation from heavy oracle use.

We model the hidden target state S and value a candidate question q by its expected information gain — how much it shrinks the agent's uncertainty about the target. The agent should prefer questions that reduce uncertainty substantially relative to their cost.

The posterior over real targets in real scenes is unobservable, so we use heterogeneous human-authored navigation corpora (R2R, REVERIE, RxR, CVDN, SOON) as an empirical proxy for which linguistic cues typically reduce target and spatial ambiguity — yielding a reproducible, data-derived prior for pricing each question type.

Route-level questions stay available because they are useful — but their higher information value is reflected in their cost.

A fixed-schema scorer labels each corpus record with cues, evidence spans, importance, and rank.

EIG(q) = H[p_t(S)] − 𝔼_a∼p(a|q) H[p_t(S | q, a)]

Expected information gain: prior entropy over the target minus the entropy expected after receiving answer a. Ask when this gain outweighs the question's price.

Question Taxonomy

Four question types, four data-derived prices

Mined cues are grouped by answer semantics into four interaction-level question types. Each cue's utility combines its average importance, its high-gain fraction, and how often it ranks first; these are rescaled into locked per-type penalties w_t. More informative types cost more and are discounted harder when measuring autonomous navigation.

Type 1

Appearance

Reduces uncertainty over target identity among same-category instances.

Penalty w0.182

Type 2

Region

Reduces uncertainty over the room or region to search.

Penalty w0.162

Type 3

Direction / Route

Reduces uncertainty over the branch, corridor, or layout direction to follow. The most informative — and the priciest.

Penalty w0.240

Type 4

Confirmation

Verifies whether a visible candidate is the target. Cheapest, since it only checks one hypothesis.

Penalty w0.103

Method · TANDEM

Two-stage navigation with disentangled planning and metric grounding

TANDEM decomposes interactive IGN into cost-aware semantic planning, visual grounding, constrained oracle interaction, and deterministic metric execution — keeping the MLLM's role qualitative and the metric quantities auditable.

TANDEM architecture: question utility modelling on the left, agent and interactive protocol on the right — **TANDEM** decomposes interactive instance-goal navigation into two coupled stages. **(Left)** Question Utility Modelling derives the four question types and locked per-type costs from human navigation corpora. **(Right)** The Agent consumes panoramic RGB, maintains a Fact Base and Spatial Memory, and a cost-aware policy decides when and which type of question to ask the Oracle; the chosen direction is grounded to one cell on an 8×8 grid and projected to a world coordinate by a deterministic executor.

Plan (semantic)

A Navigator MLLM reasons over the 8-view panorama, goal, history, question budget, Fact Base, and Spatial Memory, then emits one decision: ASK, MOVE, or STOP — never raw metric state.

Ground (visual)

A Grounder maps the chosen direction and place-based subgoal to a single cell on an 8×8 floor-plane grid within the selected 90° view.

Execute (metric)

A deterministic executor back-projects the cell to a world-frame waypoint, casts forward rays for collisions, and writes blocked directions into Spatial Memory.

Cost-aware interaction & a typed oracle

At every Planner call the agent sees the remaining budget, the relative penalties of the four types, the Fact Base, and Spatial Memory. If it asks, no movement happens that step; the controller charges the type's penalty and stores the answer for later calls, so duplicate or low-value questions compete against information already resolved.

Answers grounded, never leaked

Type 1 uses a sanitized appearance description; Type 2 uses the room-object graph; Type 3 summarizes a front-view route clip and occupancy metadata as a natural-language hint — but never coordinates, headings, or step counts; Type 4 verifies a visible candidate within 3 m. Interaction informs, it does not hand off navigation.

Benchmark

Controllable ambiguity, grounded oracles, a cost-aware metric

Built in Isaac Sim from USD scenes with a consistent object–room–pose–relation graph. Each episode gives a coarse goal such as "Find a laptop" and asks the agent to identify the intended instance among same-category distractors — without any human route trajectory.

Benchmark statistics: episode distribution by difficulty, distractor counts, target categories, goal distance and goal room distributions — **Benchmark statistics.** Episode distribution by difficulty, distractor room and instance counts, target object categories, goal-distance distribution, and goal-room distribution.

WSR_e(τ) = 1{d_e ≤ τ} · exp(− Σ_i w_{type(q_e,i)})

Weighted Success Rate: a successful episode is discounted by the accumulated per-type query costs. Repeated questions are charged repeatedly; a successful no-question episode reduces to plain success; failures score zero.

22,905 eligible episodes across 262 scenes, 70 target categories, and 11 normalized goal-room labels.

Deterministic difficulty from distractor count, same-room distractors, contextual ambiguity, and initial path distance — binned into a balanced 30 : 40 : 30 easy/medium/hard split.

Headline metric SR@1.5 m (instance-level), with SR@0.5 m as a strict check and SR@3 m for comparability.

Typed oracle answers are grounded to the target instance and hidden when they would leak the answer.

Results

Cost-aware interaction wins — most on hard episodes

All methods share the same simulator, observation interface, oracle protocol, and scoring rule, with Qwen3.5-8B as the default backbone. TANDEM reaches 35.3 SR@1.5 and 21.4 Weighted SR, beating the next-best prior agent by 9.3 SR points.

Agent / Variant	SR@1.5	OSR@1.5	SR@0.5	SR@3	NE ↓	\|Q\|	WSR@1.5	WSR@3
Naive QA baselines (unconstrained interaction)
GTA	26.7	39.3	8.8	50.6	3.49	8.25	6.4	12.2
COIN	18.6	29.2	5.9	36.8	4.63	9.06	3.7	7.3
MapGPT	15.9	28.3	6.7	32.6	4.89	8.87	3.6	7.1
NavGPT	14.8	24.6	5.1	28.9	5.23	9.02	3.1	6.1
Cost-aware QA protocol (ours)
GTA	26.0	38.6	10.5	49.3	3.55	3.44	15.9	30.3
MapGPT	16.5	27.6	6.6	31.5	4.97	3.28	10.2	19.4
NavGPT	14.3	24.1	5.8	27.8	5.31	3.21	8.9	16.9
COIN	13.0	22.0	2.0	27.4	5.20	2.61	9.2	19.3
TANDEM	35.3	50.0	14.2	66.5	2.68	3.57	21.4	40.8
— w/o Spatial Memory	28.0	43.5	8.8	57.8	3.15	3.01	18.5	36.0
— w/o QA	20.0	34.0	3.2	49.2	3.35	—	14.2	34.2
— w/o Spatial Memory & QA	14.0	25.0	1.5	34.2	5.15	—	10.0	23.8

Overall metrics on the 500-episode evaluation subset, Qwen3.5-8B Navigator. Naive QA lets agents ask freely (large |Q|, low WSR); the cost-aware protocol regulates query cost while improving success. Teal marks the best value per column.

Navigator Backbone	SR@1.5	OSR@1.5	SR@3	NE ↓	TL	WSR@1.5	WSR@3
Closed-source MLLMs
GPT-5.4	41.6	56.4	74.8	2.21	23.21	25.0	45.3
Gemini3-Flash	41.1	59.4	73.7	2.18	19.35	23.8	44.5
Open-source MLLMs
Qwen3.5-8B (default)	35.3	50.0	66.5	2.68	23.87	21.4	40.7
Qwen3-VL-4B	23.9	39.8	51.8	2.97	24.32	17.7	37.4
Qwen3.5-4B	22.0	37.9	47.2	3.21	24.72	13.4	29.5
Gemma-e4B	19.3	34.2	41.4	3.59	25.36	12.0	26.2
InternVL3.5-4B	17.8	31.8	38.2	3.80	25.71	11.2	24.4

Performance scales with backbone strength on both SR and Weighted SR — stronger Navigators improve through better navigation, not through extra querying.

Method	CVDN (GP ↑)		SOON (Val unseen)		REVERIE (Val unseen)
Method	Val	Test	SR ↑	SPL ↑	SR ↑	SPL ↑
HAMT	5.13	5.58	—	—	33.0	30.2
DUET	—	—	36.3	22.6	47.0	33.7
AutoVLN	—	—	41.0	30.7	55.9	40.9
GOAT	—	—	40.4	28.1	53.4	36.7
ScaleVLN	6.12	6.97	—	—	57.0	41.8
NaviLLM	6.16	7.90	38.3	29.2	42.2	35.7
SAME	6.94	7.07	36.1	25.4	46.4	36.1
TANDEM	8.15	8.50	45.2	33.5	62.4	45.6

A transfer check on discrete-environment benchmarks. Despite substantial differences in instructions, splits, and protocols, TANDEM consistently improves the held-out metrics.

The Aha Moment

Quantifying when and where TANDEM asks

Beyond whether asking helps, the benchmark lets us see how interaction reduces uncertainty: the temporal distribution of question types, the physical contexts of Type-3 spatial questions, and how much search space each answer removes.

Temporal and spatial patterns of interaction for the full TANDEM agent — **Temporal and spatial patterns of interaction.** When the agent asks, where Type-3 questions arise, and how question counts and uncertainty reduction scale with episode difficulty.

When it gets curious

Identity first

Appearance (Type 1) and region (Type 2) cluster in the first 30–40% of an episode, grounding target identity and goal region before committing to a path. Confirmation (Type 4) is bimodal, peaking near 80%.

When it seeks help

At junctions

Type-3 questions concentrate at long corridors, multi-way junctions, and dead-ends — only 7% occur in closed rooms. 65% ask about the goal room's position, 25% about room relations, just 10% are raw route requests.

Interaction scales with uncertainty

ΔU: 21.9% → 47.5%

Average questions rise from 2.32 (easy) to 5.33 (hard), driven by Type 4 (0.05→2.60). A Type-3 answer compresses the explored search area most when layout ambiguity is highest.

Case Studies

One well-placed question flips the trajectory

Paired episodes where full TANDEM succeeds while the no-QA ablation fails under the same start and target. The oracle returns a natural-language route hint — never coordinates — so the gain refines the agent's coarse spatial prior rather than outsourcing navigation.

Case study: a spatial QA cue helps the agent resolve ambiguity and reach the target directly — **Effect of spatial interaction.** A single Type-3 question at the marked junction redirects the agent toward the correct wing of the layout; it then quickly reaches the target room, instead of exhausting uncertain exploratory paths.

Case study: a staircase-junction example where the oracle answer redirects the agent to the correct floor wing — **Staircase-junction case.** The same paired structure on a multi-floor layout: the route hint redirects the agent to the correct wing of the second floor.

Cite

BibTeX

arXiv identifier to be added on release.

@article{zhao2026askwhenitpays,
  title   = {Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation},
  author  = {Zhao, Xunyi and Lin, Sihao and Zhou, Gengze and Li, Zerui and
             Li, Shijie and Tao, Wei and Liu, Jiajun and Wu, Qi},
  journal = {arXiv preprint},
  year    = {2026}
}