feat: Add remote MCP server connectivity verification by prernakakkar-google · Pull Request #249 · GoogleCloudPlatform/evalbench

prernakakkar-google · 2026-02-24T14:04:52Z

Success:

(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ ./evalbench/run.sh 
I0225 11:41:16.648499 140069881029440 evalbench.py:36] EvalBench v1.0.0
I0225 11:41:16.652137 140069881029440 evalbench.py:50] Loaded Configurations in datasets/gemini-cli-tools/example_run_config.yaml
I0225 11:41:16.652468 140069881029440 __init__.py:11] Orchestrator Type: geminicli
I0225 11:41:16.652542 140069881029440 agentorchestrator.py:30] Starting Gemini CLI evaluation
I0225 11:41:16.654238 140069881029440 gemini_cli.py:167] Verifying MCP server: cloud-sql
I0225 11:41:16.654409 140069881029440 gemini_cli.py:190] Running gemini cli to verify loaded tools for MCP server: cloud-sql
I0225 11:41:30.395591 140069881029440 gemini_cli.py:236] MCP server 'cloud-sql' successfully loaded 10 tools: ['list_instances', 'get_instance', 'create_instance', 'execute_sql', 'get_operation', 'create_user', 'update_user', 'clone_instance', 'update_instance', 'import_data']
I0225 11:41:30.395727 140069881029440 gemini_cli.py:167] Verifying MCP server: firestore
I0225 11:41:30.395882 140069881029440 gemini_cli.py:190] Running gemini cli to verify loaded tools for MCP server: firestore
I0225 11:41:44.688469 140069881029440 gemini_cli.py:236] MCP server 'firestore' successfully loaded 14 tools: ['get_document', 'add_document', 'update_document', 'delete_document', 'list_documents', 'list_collections', 'create_database', 'get_database', 'list_databases', 'update_database', 'delete_database', 'get_index', 'list_indexes', 'delete_index']
I0225 11:41:44.688609 140069881029440 gemini_cli.py:76] Fetching new access token via gcloud auth command
I0225 11:41:45.247136 140069881029440 gemini_cli.py:96] Updating /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.npmrc with new token...
I0225 11:41:45.247669 140069881029440 gemini_cli.py:138] NPM authentication updated successfully at /usr/local/google/home/prernakakkar/senseai/evalbench/.venv/fake_home/.npmrc
I0225 11:41:48.505155 140069881029440 agentevaluator.py:67] Running Gemini CLI evaluation
I0225 11:41:48.551893 140065708697280 agentevaluator.py:117] Turn 1/3 - Prompt: list all instances in project astana-evaluation
I0225 11:42:07.202802 140065708697280 agentevaluator.py:180] Turn 1/3 - Gemini CLI exit code: 0
I0225 11:42:07.202910 140065708697280 agentevaluator.py:182] Turn 1/3 - Gemini CLI stdout: {
  "session_id": "3c7116f9-86b4-43b8-b64c-8bd1d3556296",
  "response": "Here are the Cloud SQL instances in project `astana-evaluation`:\n\n*   **clone-agd**: PostgreSQL 16, REGIONAL, 250 GB, us-central1, RUNNABLE\n*   **test56**: PostgreSQL 17, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **test110**: SQL Server 2022 Standard, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **staging-nl2code-5**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **trte**: MySQL 8.4, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **staging-nl2code-2**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **test-instance**: PostgreSQL 17, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **nl2code-staging-clone**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **test-postgres-instance**: PostgreSQL 17, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **tesy**: SQL Server 2022 Standard, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **my-pg-app**: PostgreSQL 17, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **magic**: PostgreSQL 17, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **nl2code**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **staging-nl2code**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **testing-instance**: PostgreSQL 17, ZONAL, 10 GB, us-central1, RUNNABLE\n*   **agd**: PostgreSQL 16, REGIONAL, 250 GB, us-central1, RUNNABLE\n*   **test-cloudsql-mysql-instance**: MySQL 8.0.40, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **test300**: SQL Server 2022 Standard, ZONAL, 100 GB, us-central1, RUNNABLE\n*   **test400**: PostgreSQL 17, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **staging-nl2code-3**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **my-pg-app-instance**: PostgreSQL 18, ZONAL, 100 GB, us-central1, RUNNABLE, tags: {'environment': 'dev'}\n*   **staging-nl2code-6**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **test19**: SQL Server 2022 Standard, REGIONAL, 250 GB, us-central1, RUNNABLE\n*   **staging-nl2code-4**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **nl2code-clone**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE\n*   **test-cloudsql-sql-server-instance**: SQL Server 2022 Standard, ZONAL, 250 GB, us-central1, RUNNABLE",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 3221
        },
        "tokens": {
          "input": 6905,
          "prompt": 6905,
          "candidates": 63,
          "total": 7163,
          "cached": 0,
          "thoughts": 195,
          "tool": 0
        }
      },
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 2,
          "totalErrors": 0,
          "totalLatencyMs": 8323
        },
        "tokens": {
          "input": 33206,
          "prompt": 33206,
          "candidates": 902,
          "total": 34224,
          "cached": 0,
          "thoughts": 116,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 755,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "list_instances": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 755,
          "decisions": {
            "accept": 0,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0225 11:42:07.202973 140065708697280 agentevaluator.py:184] Turn 1/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.
I0225 11:42:23.375614 140065708697280 agentevaluator.py:117] Turn 2/3 - Prompt: what is the state of nl2code
I0225 11:42:44.260240 140065708697280 agentevaluator.py:180] Turn 2/3 - Gemini CLI exit code: 0
I0225 11:42:44.260359 140065708697280 agentevaluator.py:182] Turn 2/3 - Gemini CLI stdout: {
  "session_id": "3c7116f9-86b4-43b8-b64c-8bd1d3556296",
  "response": "I am sorry, but I was unable to find an instance with the name `nl2code` in the project `astana-evaluation`. I did find other instances with similar names, such as `nl2code-clone`, `staging-nl2code`, etc.",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 3066
        },
        "tokens": {
          "input": 7807,
          "prompt": 7807,
          "candidates": 107,
          "total": 8272,
          "cached": 0,
          "thoughts": 358,
          "tool": 0
        }
      },
      "gemini-2.5-pro": {
        "api": {
          "totalRequests": 2,
          "totalErrors": 0,
          "totalLatencyMs": 8965
        },
        "tokens": {
          "input": 38429,
          "prompt": 38429,
          "candidates": 81,
          "total": 38732,
          "cached": 0,
          "thoughts": 222,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 2537,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "run_shell_command": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 2537,
          "decisions": {
            "accept": 0,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0225 11:42:44.260413 140065708697280 agentevaluator.py:184] Turn 2/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.
I0225 11:43:04.135909 140065708697280 agentevaluator.py:117] Turn 3/3 - Prompt: It is in the list you just gave me. What is its state?
I0225 11:43:16.694281 140065708697280 agentevaluator.py:180] Turn 3/3 - Gemini CLI exit code: 0
I0225 11:43:16.694397 140065708697280 agentevaluator.py:182] Turn 3/3 - Gemini CLI stdout: {
  "session_id": "3c7116f9-86b4-43b8-b64c-8bd1d3556296",
  "response": "The Cloud SQL instance named `nl2code` in project `astana-evaluation` is in the `RUNNABLE` state.",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 2529
        },
        "tokens": {
          "input": 1922,
          "prompt": 1922,
          "candidates": 61,
          "total": 2321,
          "cached": 0,
          "thoughts": 338,
          "tool": 0
        }
      },
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 3614
        },
        "tokens": {
          "input": 19244,
          "prompt": 19244,
          "candidates": 28,
          "total": 19373,
          "cached": 0,
          "thoughts": 101,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 0,
      "totalSuccess": 0,
      "totalFail": 0,
      "totalDurationMs": 0,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 0
      },
      "byName": {}
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0225 11:43:16.694449 140065708697280 agentevaluator.py:184] Turn 3/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
YOLO mode is enabled. All tool calls will be automatically approved.
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.
I0225 11:44:41.504982 140069881029440 report.py:25] Total Prompts: 1.
I0225 11:44:41.506503 140069881029440 report.py:43] Prompt Errors: 0.
I0225 11:44:41.506960 140069881029440 report.py:44] SQLGen Errors: 0.
I0225 11:44:41.507419 140069881029440 report.py:45] SQLExec Gen Errors: 0.
I0225 11:44:41.507858 140069881029440 report.py:46] Golden Errors: 0.
I0225 11:44:41.509171 140069881029440 analyzer.py:72] trajectory_matcher:       0/1 = 0.0%
I0225 11:44:41.509608 140069881029440 analyzer.py:105] \n--- goal_completion Analysis ---
I0225 11:44:41.509700 140069881029440 analyzer.py:107] FAIL
Reasoning: The agent initially succeeded in listing all the instances in the project, which included the nl2code instance with a state of RUNNABLE. However, when asked for the state of nl2code in the next turn, the agent incorrectly stated that it was unable to find an instance with that name. It only provided the correct state after the user corrected it, pointing out that it was in the list the agent had just provided. This demonstrates a failure to retain context from its own previous response, thus failing to achieve the conversational goal without significant user intervention.
I0225 11:44:41.510081 140069881029440 analyzer.py:105] \n--- behavioral_metrics Analysis ---
I0225 11:44:41.510152 140069881029440 analyzer.py:107] Hallucination Count: 1
Clarification Count: 0
Reasoning: In the first turn, the agent correctly lists all instances in the project, including one named nl2code with a state of RUNNABLE. However, in the second turn, when the user asks for the state of that specific instance, the agent hallucinates by claiming it "was unable to find an instance with the name nl2code". This is a direct contradiction of the information it had just provided, indicating a failure to maintain context from its previous response. The agent did not ask any clarification questions.
I0225 11:44:41.510490 140069881029440 analyzer.py:105] \n--- parameter_analysis Analysis ---
I0225 11:44:41.510556 140069881029440 analyzer.py:107] Analysis:
The agent's performance in this interaction reveals a significant issue with either its reasoning capabilities or the available toolset.

1.In the first turn, the list_instances tool was called correctly with the project_id from the user's prompt, and it successfully returned a list of instances.
2.In the second turn, the user asked for the state of a specific instance, nl2code. This information was present in the context from the first turn's tool output: * **nl2code**: PostgreSQL 16, ZONAL, 250 GB, us-central1, RUNNABLE. Instead of using this information, the agent made an unnecessary tool call to run_shell_command.
3.The choice of run_shell_command is problematic. It is a highly generic tool that forces the agent to construct a command string (e.g., gcloud sql instances describe nl2code ...), which is error-prone. The unstructured text output of shell commands is also difficult to parse reliably. The agent's subsequent failure, claiming it "was unable to find an instance with the name nl2code," demonstrates the brittleness of this approach. The agent either constructed the command incorrectly or misinterpreted the output.
4.The final turn confirms this failure. When the user pointed out the error, the agent was able to correctly identify the instance's state as RUNNABLE without making any further tool calls, proving the information was in its context all along.

The core problem is the agent's decision to use a generic, high-risk tool (run_shell_command) for a task that should have been handled by either parsing existing context or using a more specific, structured tool. This suggests the lack of a suitable, specific tool for retrieving details of a single resource.

Suggestions:
To prevent this type of failure and guide the agent towards more reliable actions, the following tool improvements are recommended:

1.**Introduce a Specific get_instance Tool:** The agent resorted to run_shell_command because a dedicated tool for fetching a single instance's details was likely missing. A focused tool would be far more robust.
* **Tool Name:** get_instance
* **Parameters:**
* instance_id: string (required) - The unique identifier for the Cloud SQL instance.
* project_id: string (optional) - The project ID. If not provided, it can be inferred from the user's session.
* **Description:** "Retrieves detailed information, including the current state, for a single Cloud SQL instance identified by its ID. This is preferred over list_instances when looking for details about one specific instance."
* **Rationale:** This provides a structured and reliable way to get data for a specific resource. It eliminates the need for the agent to construct shell commands and parse unstructured text, directly addressing the failure point in the conversation.

2.**Enhance list_instances with Filtering:** The existing list_instances tool could be made more powerful, allowing it to handle "get" requests for a single, known item.
* **Tool:** list_instances
* **Add Parameter:**
* filter: string (optional) - A filter expression to apply when listing instances. The filter can be used to match on properties like name, state, or region. Example: name=my-instance or state=RUNNABLE.
* **Updated Description:** "Lists Cloud SQL instances in a project. Can be filtered to find specific instances by name or other attributes using the filter parameter. If you already know the instance name, using a filter like name=<instance_name> is the most efficient way to get its details."
* **Rationale:** This would encourage the agent to reuse the list_instances tool with a specific filter (e.g., filter='name=nl2code') in the second turn. This is a much better pattern than calling a generic shell command, as the tool's output would be a structured object for only the relevant instance.
I0225 11:44:41.511193 140069881029440 analyzer.py:58] turn_count:       Average = 3.00 turns
I0225 11:44:41.511828 140069881029440 analyzer.py:58] end_to_end_latency:       Average = 33010.00 ms
I0225 11:44:41.512405 140069881029440 analyzer.py:58] tool_call_latency:        Average = 3292.00 ms
I0225 11:44:41.512931 140069881029440 analyzer.py:58] token_consumption:        Average = 110085.00 tokens
I0225 11:44:41.513453 140069881029440 analyzer.py:72] executable:       1/1 = 100.0%
I0225 11:44:41.517840 140069881029440 csv.py:31] Created csv configs.csv for StoreType.CONFIGS in directory results/b080f9b4-daac-485c-8087-8cba00a24f39
I0225 11:44:41.519302 140069881029440 csv.py:31] Created csv evals.csv for StoreType.EVALS in directory results/b080f9b4-daac-485c-8087-8cba00a24f39
I0225 11:44:41.520246 140069881029440 csv.py:31] Created csv scores.csv for StoreType.SCORES in directory results/b080f9b4-daac-485c-8087-8cba00a24f39
I0225 11:44:41.520956 140069881029440 csv.py:31] Created csv summary.csv for StoreType.SUMMARY in directory results/b080f9b4-daac-485c-8087-8cba00a24f39
Finished Job ID b080f9b4-daac-485c-8087-8cba00a24f39

Fail:

(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ ./evalbench/run.sh 
I0225 11:38:44.543451 140042615059264 evalbench.py:36] EvalBench v1.0.0
I0225 11:38:44.547159 140042615059264 evalbench.py:50] Loaded Configurations in datasets/gemini-cli-tools/example_run_config.yaml
I0225 11:38:44.547537 140042615059264 __init__.py:11] Orchestrator Type: geminicli
I0225 11:38:44.547620 140042615059264 agentorchestrator.py:30] Starting Gemini CLI evaluation
I0225 11:38:44.549359 140042615059264 gemini_cli.py:167] Verifying MCP server: cloud-sql
I0225 11:38:44.549515 140042615059264 gemini_cli.py:190] Running gemini cli to verify loaded tools for MCP server: cloud-sql
I0225 11:38:58.856009 140042615059264 gemini_cli.py:236] MCP server 'cloud-sql' successfully loaded 11 tools: ['list_instances', 'get_instance', 'create_instance', 'execute_sql', 'get_operation', 'create_user', 'update_user', 'clone_instance', 'update_instance', 'list_users', 'import_data']
I0225 11:38:58.856148 140042615059264 gemini_cli.py:167] Verifying MCP server: firestore
I0225 11:38:58.856302 140042615059264 gemini_cli.py:190] Running gemini cli to verify loaded tools for MCP server: firestore
E0225 11:39:09.090337 140042615059264 gemini_cli.py:239] MCP server 'firestore' returned 0 non-builtin tools. The server might be unreachable or lacks tools.
E0225 11:39:09.090473 140042615059264 evalbench.py:94] MCP Server 'firestore' failed verification. Please check the configuration and ensure the server is running correctly.
Traceback (most recent call last):
  File "/usr/local/google/home/prernakakkar/senseai/evalbench/evalbench/evalbench.py", line 60, in eval
    evaluator.evaluate(flatten_dataset(dataset))
  File "/usr/local/google/home/prernakakkar/senseai/evalbench/evalbench/evaluator/agentorchestrator.py", line 31, in evaluate
    evaluator = AgentEvaluator(self.config)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/prernakakkar/senseai/evalbench/evalbench/evaluator/agentevaluator.py", line 38, in __init__
    self.generator = GeminiCliGenerator(model_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/prernakakkar/senseai/evalbench/evalbench/generators/models/gemini_cli.py", line 54, in __init__
    self._setup()
  File "/usr/local/google/home/prernakakkar/senseai/evalbench/evalbench/generators/models/gemini_cli.py", line 65, in _setup
    self._setup_mcp_servers(
  File "/usr/local/google/home/prernakakkar/senseai/evalbench/evalbench/generators/models/gemini_cli.py", line 169, in _setup_mcp_servers
    raise RuntimeError(
RuntimeError: MCP Server 'firestore' failed verification. Please check the configuration and ensure the server is running correctly.

evalbench/generators/models/gemini_cli.py

feat: Add remote MCP server connectivity verification

a64aa37

prernakakkar-google requested review from IsmailMehdi and mahyareb as code owners February 24, 2026 14:04

fix lint

ff021f9

IsmailMehdi reviewed Feb 24, 2026

View reviewed changes

evalbench/generators/models/gemini_cli.py Outdated Show resolved Hide resolved

resolve comments

df46de6

github-code-quality bot found potential problems Feb 25, 2026

View reviewed changes

evalbench/generators/models/gemini_cli.py Fixed Show fixed Hide fixed

evalbench/generators/models/gemini_cli.py Fixed Show fixed Hide fixed

evalbench/generators/models/gemini_cli.py Fixed Show fixed Hide fixed

prernakakkar-google added 2 commits February 25, 2026 11:49

fix lint

18ab0d7

remove unused packages

d71ecd0

github-code-quality bot found potential problems Feb 25, 2026

View reviewed changes

evalbench/generators/models/gemini_cli.py Fixed Show fixed Hide fixed

resolve comments

c3156d1

IsmailMehdi approved these changes Feb 25, 2026

View reviewed changes

prernakakkar-google merged commit 7bf5716 into main Feb 25, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add remote MCP server connectivity verification#249

feat: Add remote MCP server connectivity verification#249
prernakakkar-google merged 6 commits intomainfrom
gemini-cli-evals

prernakakkar-google commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prernakakkar-google commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prernakakkar-google commented Feb 24, 2026 •

edited

Loading