Right now, the models are scored solely based on the amount of test that they pass...a more nuanced score that also involves how the tests are passing would be wonderful. This could involve:
- Whether the model is using the MCP server or not
- Whether the model is using the Test tool or not
- The amount of step it took to complete
- The number of tokens it took to complete
- Possibly cost (?)
Other ideas?
Right now, the models are scored solely based on the amount of test that they pass...a more nuanced score that also involves how the tests are passing would be wonderful. This could involve:
Other ideas?