[ENH] Simplified Unified Get/List API#1552
[ENH] Simplified Unified Get/List API#1552Omswastik-11 wants to merge 19 commits intoopenml:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1552 +/- ##
==========================================
- Coverage 52.73% 52.72% -0.02%
==========================================
Files 37 38 +1
Lines 4399 4427 +28
==========================================
+ Hits 2320 2334 +14
- Misses 2079 2093 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a simplified unified API for getting and listing OpenML objects through two new dispatcher functions: openml.get() and openml.list_all(). These functions provide a more concise alternative to the existing submodule-specific APIs (e.g., openml.datasets.get_dataset()) while maintaining full backward compatibility. The implementation uses dispatch dictionaries to route calls to the appropriate underlying functions.
Changes:
- Added unified
get()andlist_all()dispatcher functions for datasets, tasks, flows, and runs - Updated example tutorials to demonstrate the new API with legacy alternatives shown in comments
- Added unit tests for the dispatcher functions using mocking
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| openml/dispatchers.py | New module containing the unified get() and list_all() dispatcher functions with type checking and error handling |
| openml/init.py | Exports the new dispatcher functions and module; contains duplicate entries in __all__ list |
| tests/test_openml/test_openml.py | Unit tests for dispatchers covering datasets and tasks; missing coverage for flows, runs, and error conditions |
| examples/Basics/simple_tasks_tutorial.py | Updated to demonstrate new openml.get() API with legacy alternative in comments |
| examples/Basics/simple_flows_and_runs_tutorial.py | Updated to demonstrate both openml.list_all() and openml.get() APIs |
| examples/Basics/simple_datasets_tutorial.py | Updated to demonstrate both openml.list_all() and openml.get() APIs |
| examples/Advanced/tasks_tutorial.py | Updated multiple examples to demonstrate new openml.list_all() API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
geetu040
left a comment
There was a problem hiding this comment.
Copilot left a few solid review comments above, please take a look and address them.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@PGijsbers, what do you think of this suggestion? |
geetu040
left a comment
There was a problem hiding this comment.
LGTM.
@PGijsbers could you please take a look and approve if you agree with the design?
PGijsbers
left a comment
There was a problem hiding this comment.
I'm sorry to only see this so late, but I don't really see the added value for this. Please do push back if you think I am wrong in this.
I much prefer the design of #1551. I know they are technically not mutually exclusive, but I feel that PR pretty much provides the same convenience as this one in terms of removing duplication from the calls. However, unlike this PR, the #1551 method also provides:
- (more) accurate type annotation, e.g.,
get_taskreturns a task whereasget("task", ..returnsAny. This is useful for editors that give hints/autocomplete suggestions based on type annotations (common LSP functionality). - call-specific docstrings and parameters (e.g.,
task_typeis forlist_tasksand not other list calls), again useful for the user that they do not have to pick and choose what is the relevant documentation for their entity type. - less prone to typos: mistyping
list_task(instead of plural) is caught by most LSPs because the function doens't exist, whereaslist_all("task", ...)is likely not because its validity is delegated to runtime checks.
Granted, these could be remedied by:
- providing different annotations and signatures with
typing.overloadto provide input-specific signatures and return types - providing an enum for allowed values
but at that point the #1551 seems easier to understand for users, and easier to maintain for us.
Unless I am mistaken, I suppose the main argument for this design would be "extensibility" in that adding a new type of object to retrieve doesn't change the function to call but only an argument. However, this requires either forgoing more detailed annotation (as outlined above). If you do want to provide the developer with good annotations then you would still need to document the new overload of the signature. That seems just as "much" work as providing it as a dedicated function.
We also do not expect to be adding or changing the entities on OpenML at such velocity that this makes a significant difference.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def get(identifier: int | str, *, object_type: str = "dataset", **kwargs: Any) -> Any: | ||
| """Get an OpenML object by identifier. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| identifier : int | str | ||
| The ID or name of the object to retrieve. String identifiers are | ||
| supported for datasets; tasks, flows, and runs require integer IDs. | ||
| object_type : str, default="dataset" |
| } | ||
|
|
||
|
|
||
| def list_all(object_type: str, /, **kwargs: Any) -> Any: |
| def get(identifier: int | str, *, object_type: str = "dataset", **kwargs: Any) -> Any: | ||
| """Get an OpenML object by identifier. |
Redundant syntax for getters
In getters, syntax is repeated and redundant, mainly through
the submodule having to be imported or addressed.
API implementation
Implementation Details
Added
openml.list_all(object_type: str, **kwargs) -> Any, a dispatcher that forwards to:list_datasetslist_taskslist_flowslist_runsAdded
openml.get(object_type_or_name, identifier=None, **kwargs) -> Any, a unified getter with support for:Type-based lookup
Name-only shortcut for datasets
Exported both functions via
__all__and documented them with docstrings.Preserved full backward compatibility:
openml.datasets.get_dataset) remain unchanged.Added unit tests to validate dispatcher behavior without requiring network access.