Evaluation Difficulty of FeatureBench

Hello authors,

Thank you for this excellent work. I have a question regarding the current evaluation difficulty of FeatureBench.

I recently conducted experiments on the Lite subset using the latest DeepSeek-V4-Flash model (non-thinking mode) within the OpenHands framework, and observed a 100% Resolved rate (all tests passed). I carefully reviewed the execution traces and did not find any obvious signs of cheating behavior, such as accessing the internet or directly using the source code.

<img width="1178" height="728" alt="Image" src="https://github.com/user-attachments/assets/2d16d8ea-f410-4bbb-ba5b-85f4383c7ed7" />

This makes me wonder: does this suggest that FeatureBench may gradually lose its effectiveness for evaluating the ability of frontier LLMs to handle complex function generation, especially as stronger models such as GPT-5.4 and Claude 4.7 continue to emerge? More broadly, how do you view the long-term challenge of maintaining the difficulty and discriminative power of this benchmark?

I’m also curious whether you plan to further extend or evolve this benchmark in the future (e.g., harder tasks, dynamic environments, multi-turn settings, repository-level dependencies, etc.).

I would greatly appreciate your thoughts. Thank you again for your insightful work!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Difficulty of FeatureBench #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation Difficulty of FeatureBench #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions