VLM-enriched object detection based navigation and fixed semantic navigation#967
VLM-enriched object detection based navigation and fixed semantic navigation#967
Conversation
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds VLM-enriched object detection for precise navigation and fixes performance issues in semantic navigation.
Key Changes
New Navigation Skills:
navigate_to_detected_object(): Navigate to objects detected by vision system using ObjectDB lookuplist_detected_objects(): Query and list all detected objects for situational awareness
VLM Enrichment:
- ObjectDB now uses VLM to enrich YOLO labels (e.g., "person" → "person in white shirt")
- VLM enrichment occurs on object creation and re-enriches every 10 detections
- Fallback to YOLO labels when VLM fails or is disabled
Performance Fixes:
- RPC methods now return lightweight dicts instead of full Object3D objects (fixes 120s timeouts)
- SpatialMemory no longer fetches images from visual memory during queries
- Commented out verbose logging
Frame ID Changes:
- Changed TF frame from "map" to "world" across navigation and spatial perception modules
New Blueprint:
- Added
agentic_detectionblueprint combining detection, spatial memory, and agent capabilities
Critical Issue
The Object3D initialization never sets yolo_label from detection.name, causing the entire VLM enrichment and object tracking system to fail with None values.
Confidence Score: 1/5
- Critical bug will cause object detection and navigation to fail completely
- Object3D.init never initializes yolo_label from detection.name, leaving it as None. This breaks: (1) object tracking matching logic that compares yolo_label, (2) VLM enrichment that uses yolo_label in prompts and fallbacks, (3) object name assignment that falls back to yolo_label. The entire object detection navigation feature will fail.
- dimos/perception/detection/moduleDB.py - Missing yolo_label initialization in Object3D.init
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| dimos/perception/detection/moduleDB.py | 1/5 | Critical bug: Object3D.init never initializes yolo_label from detection.name, causing None values throughout VLM enrichment and object tracking logic |
| dimos/agents/skills/navigation.py | 4/5 | Added navigate_to_detected_object and list_detected_objects skills. Changed frame_id from "map" to "world" throughout |
| dimos/perception/spatial_perception.py | 4/5 | Changed SpatialMemory base class to SkillModule, updated frame_id from "map" to "world", added 1-second sleep in start(), improved error handling |
| dimos/agents_deprecated/memory/spatial_vector_db.py | 5/5 | Commented out image fetching and logging to fix RPC timeouts (performance optimization) |
| dimos/robot/all_blueprints.py | 5/5 | Added new blueprint "unitree-go2-agentic-detection" to registry |
| dimos/robot/unitree_webrtc/unitree_go2_blueprints.py | 5/5 | Added agentic_detection blueprint combining detection, spatial_memory, utilization, llm_agent, and common agentic components |
Sequence Diagram
sequenceDiagram
participant User
participant Agent
participant NavigationSkill
participant ObjectDB
participant VLM
participant Navigator
User->>Agent: "Navigate to person in white shirt"
Agent->>NavigationSkill: navigate_to_detected_object("person in white shirt")
NavigationSkill->>ObjectDB: lookup("person in white shirt")
Note over ObjectDB: Search objects by name<br/>(matches VLM-enriched labels)
alt Object found
ObjectDB-->>NavigationSkill: [{"name": "person in white shirt", "pos_x": 1.2, "pos_y": 3.4, ...}]
NavigationSkill->>Navigator: set_goal(PoseStamped)
Navigator-->>NavigationSkill: Navigation complete
NavigationSkill-->>Agent: "Successfully navigated to 'person in white shirt'"
else Object not found
ObjectDB-->>NavigationSkill: []
NavigationSkill-->>Agent: "No detected object matching..."
end
Note over User,Navigator: Parallel: Detection & Enrichment Flow
participant Camera
participant Detection3D
Camera->>Detection3D: Image + PointCloud
Detection3D->>ObjectDB: Detection3DPC (name="person")
alt New object
ObjectDB->>ObjectDB: create_new_object()
Note over ObjectDB: BUG: yolo_label not set!<br/>Should be: yolo_label = detection.name
ObjectDB->>VLM: _enrich_with_vlm(obj)
Note over VLM: Uses obj.yolo_label (None!)<br/>in prompt and fallbacks
VLM-->>ObjectDB: rich_label
ObjectDB->>ObjectDB: Store as Object3D<br/>(name=rich_label, yolo_label=None)
else Existing object
ObjectDB->>ObjectDB: add_to_object()
alt Every 10 detections
ObjectDB->>VLM: Re-enrich
VLM-->>ObjectDB: Updated rich_label
end
end
dimos/agents/skills/navigation.py
Outdated
There was a problem hiding this comment.
[P3] Bare except Exception: swallows the actual error. Consider logging the exception for debugging: except Exception as e: logger.error(f"Error querying ObjectDB: {e}") before returning the error message.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (1)
|
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR introduces VLM-enriched object detection for navigation, enabling the robot to navigate to specific objects with rich descriptions (e.g., "person in white shirt" instead of just "person"). Key improvements include:
New Features:
- VLM enrichment pipeline that enhances YOLO labels with detailed descriptions using Qwen VLM
- Two new navigation skills:
navigate_to_detected_object()for object-based navigation andlist_detected_objects()for environment awareness - New
agentic_detectionblueprint combining detection, navigation, spatial memory, and agent capabilities - Lightweight RPC methods (
lookup,get_all_detected_objects) that return dicts instead of full Object3D instances
Performance Fixes:
- RPC timeout fix: Returns lightweight dicts reduce response time from 120s to 0.2s
- SpatialMemory improvements: TF timing fixes and image fetching removal reduce latency from 120s to 0.15s
- Reduced logging verbosity in spatial memory components
Architecture Changes:
- Frame reference migration: Changes from "map" frame to "world" frame across navigation and spatial perception
- SpatialMemory now extends SkillModule instead of Module
- Object tracking now separates YOLO labels (for matching) from VLM labels (for display)
The changes enable more precise object-based navigation by allowing agents to distinguish between similar objects based on visual characteristics.
Confidence Score: 1/5
- Critical runtime error will occur when agent_encode() is called
- The PR contains a P0 blocking bug in moduleDB.py line 311 where
len(obj.detections)is called on an integer field. This will cause TypeError at runtime whenever the agent_encode() method is invoked. Additionally, there are P1 type safety issues with potential None returns that violate the function contract. While the VLM enrichment feature and performance improvements are valuable, the critical bug must be fixed before merging. - dimos/perception/detection/moduleDB.py requires immediate attention due to TypeError bug in agent_encode() method
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| dimos/agents/skills/navigation.py | 3/5 | Adds two new navigation skills (navigate_to_detected_object, list_detected_objects) and changes frame references from "map" to "world". Contains typos in error messages. |
| dimos/perception/detection/moduleDB.py | 1/5 | Major changes: VLM enrichment for object labels, new RPC methods (lookup, get_all_detected_objects), and lightweight dict returns. Critical bug: len() called on int in agent_encode(). |
| dimos/perception/spatial_perception.py | 4/5 | Changes SpatialMemory to extend SkillModule, adds 1s startup delay, changes frame references from "map" to "world", reduces logging verbosity. Minor warning message inconsistency. |
| dimos/agents_deprecated/memory/spatial_vector_db.py | 4/5 | Comments out logging and image retrieval from visual memory to improve performance. |
| dimos/robot/unitree_webrtc/unitree_go2_blueprints.py | 5/5 | Adds new agentic_detection blueprint combining detection with navigation, spatial memory, and agent capabilities. |
Sequence Diagram
sequenceDiagram
participant Agent
participant NavigationSkill
participant ObjectDB
participant VLM as Qwen VLM
participant Detection as YOLO Detection
participant Nav as Navigation
Detection->>ObjectDB: Detection3DPC (YOLO label)
ObjectDB->>ObjectDB: Check closest_object()
alt New Object
ObjectDB->>VLM: _enrich_with_vlm(obj)
VLM-->>ObjectDB: Rich description
ObjectDB->>ObjectDB: Create Object3D<br/>(yolo_label + vlm_label)
else Existing Object (every 10 detections)
ObjectDB->>ObjectDB: Add to existing
ObjectDB->>VLM: _enrich_with_vlm(obj)
VLM-->>ObjectDB: Updated description
end
Agent->>NavigationSkill: "Navigate to person in white"
NavigationSkill->>ObjectDB: lookup("person in white")
ObjectDB-->>NavigationSkill: [dict with pos_x, pos_y, frame_id]
NavigationSkill->>NavigationSkill: Create PoseStamped
NavigationSkill->>Nav: _navigate_to(goal_pose)
Nav-->>NavigationSkill: Success/Failure
NavigationSkill-->>Agent: "Successfully navigated to..."
Agent->>NavigationSkill: "What do you see?"
NavigationSkill->>ObjectDB: get_all_detected_objects()
ObjectDB-->>NavigationSkill: [list of dicts]
NavigationSkill-->>Agent: "I've detected N objects..."
There was a problem hiding this comment.
[P0] obj.detections is an int, not a list. Calling len() on it will cause TypeError: object of type 'int' has no len() at runtime. Should be if obj.detections < 4: instead.
| if len(obj.detections) < 4: | |
| if obj.detections < 4: |
There was a problem hiding this comment.
[P1] Function signature promises str return, but returns obj.yolo_label which is typed as str | None. If yolo_label is None (possible per field definition), this violates the type contract and could cause issues in callers expecting a string.
There was a problem hiding this comment.
[P2] If obj.yolo_label is None, the prompt becomes "Describe this None in detail...", which is semantically incorrect. Should handle None case or assert yolo_label is not None before this point.
| return | ||
|
|
||
| if tf is None: | ||
| logger.warning("No TF transform (map -> base_link) available") |
There was a problem hiding this comment.
[P3] Warning message says "(map -> base_link)" but code at line 223 actually requests "world" to "base_link" transform. Message should match the actual frame names.
| logger.warning("No TF transform (map -> base_link) available") | |
| logger.warning("No TF transform (world -> base_link) available") |
Additional Comments (2)
|
dimos/agents/skills/navigation.py
Outdated
There was a problem hiding this comment.
This should be in the objectDBModule, otherwise its possible that if you have a blueprint without that module it will fail
dimos/agents/skills/navigation.py
Outdated
There was a problem hiding this comment.
Yeah see here this is a hardcoded ObjectDBModule check - not good practice easier to just put this skill in there directly
| agentic, | ||
| MCPModule.blueprint(), | ||
| ) | ||
| agentic_detection = autoconnect( |
|
|
||
| return "\n".join(lines) | ||
|
|
||
| def navigate_with_text(self, query: str) -> str: |
There was a problem hiding this comment.
| def navigate_with_text(self, query: str) -> str: | |
| @skill() | |
| def navigate_with_text(self, query: str) -> str: |
…itance - Add DetectedObjectInfo dataclass for passing object data to navigation (can move to dimos.msgs if needed) - Improve lookup with hybrid substring/token matching - Clean up skill logging
|
@greptile |
Greptile OverviewGreptile SummaryOverviewThis PR adds VLM-enriched object detection for precise robot navigation. Objects detected by YOLO are enhanced with rich descriptions from a Vision Language Model (e.g., "person" → "person in white shirt"), enabling navigation to specific objects like "go to the red coffee mug" or "navigate to person in white." Key Features:
Architecture:
Critical Issues Found🔴 Object Tracking Completely Broken (moduleDB.py)The 🔴 Navigation Always Reports Failure (navigation.py)Type mismatch in 🔴 TypeError in Object Listing (moduleDB.py)Calling Performance ImprovementsThe RPC optimizations are well-executed:
Confidence Score: 1/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Agent as LLM Agent
participant NavSkill as NavigationSkillContainer
participant ObjectDB as ObjectDBModule
participant VLM as QwenVlModel
participant Detection as Detection3DModule
participant Nav as NavigationInterface
Note over Agent,Nav: VLM-Enriched Object Detection Navigation
Detection->>ObjectDB: ImageDetections3DPC
ObjectDB->>ObjectDB: add_detection(detection)
ObjectDB->>ObjectDB: create_new_object(detection)
ObjectDB->>VLM: query(image, prompt)
VLM-->>ObjectDB: rich_label (e.g., "person in white shirt")
ObjectDB->>ObjectDB: Store object with yolo_label + vlm_label
Agent->>NavSkill: navigate_to_detected_object("person in white")
NavSkill->>ObjectDB: lookup("person in white")
ObjectDB->>ObjectDB: Match objects by name/label
ObjectDB-->>NavSkill: [DetectedObjectInfo]
NavSkill->>NavSkill: Create goal_pose from obj.pose
NavSkill->>Nav: set_goal(goal_pose)
loop Poll navigation state
NavSkill->>Nav: get_state()
Nav-->>NavSkill: NavigationState
end
NavSkill->>Nav: is_goal_reached()
Nav-->>NavSkill: bool (success/failure)
NavSkill-->>Agent: "Successfully navigated to 'person in white shirt'"
Note over Agent,NavSkill: Alternative: List Detected Objects
Agent->>ObjectDB: list_detected_objects()
ObjectDB-->>Agent: "I've detected 5 object(s):\n- person in white shirt (12 detections)\n- red coffee mug (8 detections)"
|
dimos/agents/skills/navigation.py
Outdated
There was a problem hiding this comment.
Type mismatch: _navigate_to() returns bool (True/False), but this code compares it to the string "SUCCESS". This condition will always be False, causing navigation to always report failure even when successful.
The method should check if result: instead of if result == "SUCCESS":.
| if result == "SUCCESS": | |
| if result: |
There was a problem hiding this comment.
obj.detections is an int, not a collection, so calling len() on it will raise TypeError: object of type 'int' has no len().
| if len(obj.detections) < 4: | |
| if obj.detections < 4: |
| """Convert Object3D to DetectedObjectInfo.""" | ||
| try: | ||
| return DetectedObjectInfo( | ||
| track_id=obj.track_id, |
There was a problem hiding this comment.
DetectedObjectInfo expects track_id: str (non-optional), but obj.track_id can be None (type is str | None from line 56). If track_id is None, this will cause a type error when creating the dataclass instance.
While the exception handler at line 366 will catch this and return None, objects with None track_ids should be filtered out before calling _to_detection_info() to avoid unnecessary exceptions.
| return | ||
|
|
||
| if tf is None: | ||
| logger.warning("No TF transform (map -> base_link) available") |
There was a problem hiding this comment.
Error message refers to "map -> base_link" but the code at line 223 actually queries "world -> base_link". The error message should be updated to match the actual frame IDs being used.
| logger.warning("No TF transform (map -> base_link) available") | |
| logger.warning("No TF transform (world -> base_link) available") |
Additional Comments (1)
This will cause The
|
| if not self._skill_started: | ||
| raise ValueError(f"{self} has not been started.") | ||
| tf = self.tf.get("map", "base_link", time_tolerance=2.0) | ||
| tf = self.tf.get("world", "base_link", time_tolerance=2.0) |
There was a problem hiding this comment.
This was a breaking change that i thinik was never tested or caught
| semantic map for a location matching the description. | ||
| CALL THIS SKILL FOR ONE SUBJECT AT A TIME. For example: "Go to the person wearing a blue shirt in the living room", | ||
| you should call this skill twice, once for the person wearing a blue shirt and once for the living room. | ||
| def navigate_to_detected_object(self, object_name: str) -> str: |
There was a problem hiding this comment.
object name cant be a string need type
| if not self._skill_started: | ||
| raise ValueError(f"{self} has not been started.") | ||
|
|
||
| lookup_rpc = self.get_rpc_calls("ObjectDBModule.lookup") |
There was a problem hiding this comment.
@paul-nechifor we need to fix this rpc shit with modules super weird and hard coded
|
|
||
| goal_pose = PoseStamped( | ||
| position=obj.pose.position, | ||
| orientation=Quaternion(), |
There was a problem hiding this comment.
Lets calculate goal_pose via a seperate method. This is goal point which is just position. Goal_pose requires /global_map and /odom and the /goal_point and it picks the closest point within some buffer due to the size of the robot that is not in collision and then points the right way using trig.
| if success_msg: | ||
| return success_msg | ||
|
|
||
| logger.info(f"No tagged location found for {query}") |
| goal_pose = PoseStamped( | ||
| position=make_vector3(*robot_location.position), | ||
| orientation=Quaternion.from_euler(Vector3(*robot_location.rotation)), | ||
| frame_id="map", |
There was a problem hiding this comment.
bug fix from frame thing
| ) | ||
|
|
||
| logger.info(f"Added image vector {vector_id} with metadata: {metadata}") | ||
| #logger.info(f"Added image vector {vector_id} with metadata: {metadata}") |
| logger = setup_logger() | ||
|
|
||
| @dataclass | ||
| class DetectedObjectInfo: |
There was a problem hiding this comment.
use detection2d type or extend (maybe but prob not) with a super minimal thing
| self._vlm_model = QwenVlModel() | ||
| return self._vlm_model | ||
|
|
||
| def _enrich_with_vlm(self, obj: Object3D) -> str: |
There was a problem hiding this comment.
This cant be here. Needs to be in the VLM stuff in models/vl
Changes