Hello authors,
Thank you for this excellent work. I have a question regarding the current evaluation difficulty of FeatureBench.
I recently conducted experiments on the Lite subset using the latest DeepSeek-V4-Flash model (non-thinking mode) within the OpenHands framework, and observed a 100% Resolved rate (all tests passed). I carefully reviewed the execution traces and did not find any obvious signs of cheating behavior, such as accessing the internet or directly using the source code.
This makes me wonder: does this suggest that FeatureBench may gradually lose its effectiveness for evaluating the ability of frontier LLMs to handle complex function generation, especially as stronger models such as GPT-5.4 and Claude 4.7 continue to emerge? More broadly, how do you view the long-term challenge of maintaining the difficulty and discriminative power of this benchmark?
I’m also curious whether you plan to further extend or evolve this benchmark in the future (e.g., harder tasks, dynamic environments, multi-turn settings, repository-level dependencies, etc.).
I would greatly appreciate your thoughts. Thank you again for your insightful work!
Hello authors,
Thank you for this excellent work. I have a question regarding the current evaluation difficulty of FeatureBench.
I recently conducted experiments on the Lite subset using the latest DeepSeek-V4-Flash model (non-thinking mode) within the OpenHands framework, and observed a 100% Resolved rate (all tests passed). I carefully reviewed the execution traces and did not find any obvious signs of cheating behavior, such as accessing the internet or directly using the source code.
This makes me wonder: does this suggest that FeatureBench may gradually lose its effectiveness for evaluating the ability of frontier LLMs to handle complex function generation, especially as stronger models such as GPT-5.4 and Claude 4.7 continue to emerge? More broadly, how do you view the long-term challenge of maintaining the difficulty and discriminative power of this benchmark?
I’m also curious whether you plan to further extend or evolve this benchmark in the future (e.g., harder tasks, dynamic environments, multi-turn settings, repository-level dependencies, etc.).
I would greatly appreciate your thoughts. Thank you again for your insightful work!