Conversation
|
If this feature is triggered, RMG reports something like: |
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “auto-database” mode to RMG that can automatically select thermo/kinetics/transport libraries, seed mechanisms, and kinetics families based on detected chemistry (elements, phase, surface, temperature), driven by recommended_libraries.yml in RMG-database and the existing kinetics/families/recommended.py sets.
Changes:
- Add
rmgpy/data/auto_database.pyimplementing chemistry detection + auto-selection logic (including<PAH_libs>handling and family exclusion via!FamilyName). - Extend input parsing and startup initialization to accept/pass through
'auto'and<PAH_libs>and to preserve reaction-library “output edge” flags. - Add unit tests + user documentation + a preview Jupyter notebook.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
rmgpy/rmg/input.py |
Accepts 'auto' / <PAH_libs> tokens in database() and stores reaction libraries as strings + a sidecar reaction_libraries_output_edge set. |
rmgpy/rmg/main.py |
Runs auto-selection during initialize() and converts reaction libraries back to (name, output_edge) tuples before database load. |
rmgpy/data/auto_database.py |
New module implementing detection, YAML expansion, merging logic, and kinetics family resolution. |
test/rmgpy/rmg/inputTest.py |
Updates reaction library parsing expectations; adds token-handling tests. |
test/rmgpy/data/autoDatabaseTest.py |
New test suite for chemistry detection, YAML expansion/merge behavior, and end-to-end selection outcomes. |
documentation/source/users/rmg/input.rst |
Documents 'auto', mixed manual/auto lists, and <PAH_libs> behavior + notebook preview. |
ipython/auto_library_selection.ipynb |
Notebook to preview what auto-selection would choose for a given input file. |
ipython/auto_library_selection.ipynb
Outdated
| { | ||
| "cell_type": "code", | ||
| "id": "eft98a4ciwl", | ||
| "source": "from rmgpy.data.auto_database import auto_select_libraries, PAH_LIBS, _to_reaction_library_tuples\n\n# Work on a fresh copy so we don't mutate the rmg object used above\nimport copy\nrmg2 = copy.deepcopy(rmg)\n\n# Run the same auto-selection that main.py would run\nauto_select_libraries(rmg2)\n\n# Convert reaction libraries to tuples (as main.py does before load_database)\nif isinstance(rmg2.reaction_libraries, list):\n output_edge = getattr(rmg2, 'reaction_libraries_output_edge', set())\n rmg2.reaction_libraries = _to_reaction_library_tuples(rmg2.reaction_libraries, output_edge)\n\nhas_auto = any(\n getattr(rmg, attr, None) == 'auto'\n or (isinstance(getattr(rmg, attr, None), list) and 'auto' in getattr(rmg, attr))\n for attr in ('thermo_libraries', 'reaction_libraries', 'seed_mechanisms',\n 'transport_libraries', 'kinetics_families')\n)\n\nif not has_auto:\n print('The input file does not use \\'auto\\' in any database field.')\n print('The settings below are exactly what was specified in the input file.\\n')\n\nprint_list('Thermo libraries', rmg2.thermo_libraries)\n\nrxn_lib_names = [name for name, _ in rmg2.reaction_libraries] if isinstance(rmg2.reaction_libraries, list) else rmg2.reaction_libraries\nprint_list('Reaction libraries', rxn_lib_names or [])\n\nedge_libs = [name for name, flag in rmg2.reaction_libraries if flag] if isinstance(rmg2.reaction_libraries, list) else []\nif edge_libs:\n print(f'\\n (output unused edge reactions for: {\", \".join(edge_libs)})')\n\nprint_list('Seed mechanisms', rmg2.seed_mechanisms or [])\nprint_list('Transport libraries', rmg2.transport_libraries or [])\n\nif isinstance(rmg2.kinetics_families, list):\n print_list('Kinetics families', rmg2.kinetics_families)\nelse:\n print(f'\\nKinetics families: {rmg2.kinetics_families!r} (resolved at database load time)')", |
There was a problem hiding this comment.
This notebook imports and calls _to_reaction_library_tuples, but the module exports to_reaction_library_tuples (no leading underscore). As written, running the notebook will raise ImportError/NameError in the “Actual Resolution” cell. Update the import and call sites to use the real function name.
| "source": "from rmgpy.data.auto_database import auto_select_libraries, PAH_LIBS, _to_reaction_library_tuples\n\n# Work on a fresh copy so we don't mutate the rmg object used above\nimport copy\nrmg2 = copy.deepcopy(rmg)\n\n# Run the same auto-selection that main.py would run\nauto_select_libraries(rmg2)\n\n# Convert reaction libraries to tuples (as main.py does before load_database)\nif isinstance(rmg2.reaction_libraries, list):\n output_edge = getattr(rmg2, 'reaction_libraries_output_edge', set())\n rmg2.reaction_libraries = _to_reaction_library_tuples(rmg2.reaction_libraries, output_edge)\n\nhas_auto = any(\n getattr(rmg, attr, None) == 'auto'\n or (isinstance(getattr(rmg, attr, None), list) and 'auto' in getattr(rmg, attr))\n for attr in ('thermo_libraries', 'reaction_libraries', 'seed_mechanisms',\n 'transport_libraries', 'kinetics_families')\n)\n\nif not has_auto:\n print('The input file does not use \\'auto\\' in any database field.')\n print('The settings below are exactly what was specified in the input file.\\n')\n\nprint_list('Thermo libraries', rmg2.thermo_libraries)\n\nrxn_lib_names = [name for name, _ in rmg2.reaction_libraries] if isinstance(rmg2.reaction_libraries, list) else rmg2.reaction_libraries\nprint_list('Reaction libraries', rxn_lib_names or [])\n\nedge_libs = [name for name, flag in rmg2.reaction_libraries if flag] if isinstance(rmg2.reaction_libraries, list) else []\nif edge_libs:\n print(f'\\n (output unused edge reactions for: {\", \".join(edge_libs)})')\n\nprint_list('Seed mechanisms', rmg2.seed_mechanisms or [])\nprint_list('Transport libraries', rmg2.transport_libraries or [])\n\nif isinstance(rmg2.kinetics_families, list):\n print_list('Kinetics families', rmg2.kinetics_families)\nelse:\n print(f'\\nKinetics families: {rmg2.kinetics_families!r} (resolved at database load time)')", | |
| "source": "from rmgpy.data.auto_database import auto_select_libraries, PAH_LIBS, to_reaction_library_tuples\n\n# Work on a fresh copy so we don't mutate the rmg object used above\nimport copy\nrmg2 = copy.deepcopy(rmg)\n\n# Run the same auto-selection that main.py would run\nauto_select_libraries(rmg2)\n\n# Convert reaction libraries to tuples (as main.py does before load_database)\nif isinstance(rmg2.reaction_libraries, list):\n output_edge = getattr(rmg2, 'reaction_libraries_output_edge', set())\n rmg2.reaction_libraries = to_reaction_library_tuples(rmg2.reaction_libraries, output_edge)\n\nhas_auto = any(\n getattr(rmg, attr, None) == 'auto'\n or (isinstance(getattr(rmg, attr, None), list) and 'auto' in getattr(rmg, attr))\n for attr in ('thermo_libraries', 'reaction_libraries', 'seed_mechanisms',\n 'transport_libraries', 'kinetics_families')\n)\n\nif not has_auto:\n print('The input file does not use \\'auto\\' in any database field.')\n print('The settings below are exactly what was specified in the input file.\\n')\n\nprint_list('Thermo libraries', rmg2.thermo_libraries)\n\nrxn_lib_names = [name for name, _ in rmg2.reaction_libraries] if isinstance(rmg2.reaction_libraries, list) else rmg2.reaction_libraries\nprint_list('Reaction libraries', rxn_lib_names or [])\n\nedge_libs = [name for name, flag in rmg2.reaction_libraries if flag] if isinstance(rmg2.reaction_libraries, list) else []\nif edge_libs:\n print(f'\\n (output unused edge reactions for: {\", \".join(edge_libs)})')\n\nprint_list('Seed mechanisms', rmg2.seed_mechanisms or [])\nprint_list('Transport libraries', rmg2.transport_libraries or [])\n\nif isinstance(rmg2.kinetics_families, list):\n print_list('Kinetics families', rmg2.kinetics_families)\nelse:\n print(f'\\nKinetics families: {rmg2.kinetics_families!r} (resolved at database load time)')", |
| def determine_chemistry_sets(profile: ChemistryProfile, | ||
| pah_libs_requested: bool = False, | ||
| ) -> List[str]: | ||
| """ | ||
| Determine which chemistry sets to activate based on the detected profile. | ||
|
|
||
| CH pyrolysis logic: | ||
| - CH_pyrolysis_core is always added when C present AND T >= 800 K. | ||
| - PAH_formation is added when: | ||
| (a) C + T >= 800 K + no O in species (pure C/H pyrolysis), OR | ||
| (b) C + T >= 800 K + <PAH_libs> keyword requested by user. | ||
|
|
||
| Args: | ||
| profile: ChemistryProfile instance. | ||
| pah_libs_requested: bool, True if user included <PAH_libs> keyword. | ||
|
|
||
| Returns: | ||
| List of ChemistrySet values in priority order. | ||
| """ | ||
| sets = [ChemistrySet.PRIMARY] | ||
|
|
||
| if profile.has_nitrogen: | ||
| sets.append(ChemistrySet.NITROGEN) | ||
|
|
||
| if profile.has_sulfur: | ||
| sets.append(ChemistrySet.SULFUR) | ||
|
|
||
| if profile.has_oxygen: | ||
| sets.append(ChemistrySet.OXIDATION) | ||
|
|
||
| high_T_carbon = profile.has_carbon and profile.max_temperature >= CH_PYROLYSIS_T_THRESHOLD | ||
|
|
||
| if high_T_carbon: | ||
| sets.append(ChemistrySet.CH_PYROLYSIS_CORE) | ||
|
|
||
| if not profile.has_oxygen or pah_libs_requested: | ||
| sets.append(ChemistrySet.PAH_FORMATION) | ||
|
|
||
| if profile.has_liquid and profile.has_oxygen: | ||
| sets.append(ChemistrySet.LIQUID_OXIDATION) | ||
|
|
||
| if profile.has_surface: | ||
| sets.append(ChemistrySet.SURFACE) | ||
|
|
||
| if profile.has_surface and profile.has_nitrogen: | ||
| sets.append(ChemistrySet.SURFACE_NITROGEN) | ||
|
|
||
| if profile.has_halogens: | ||
| sets.append(ChemistrySet.HALOGENS) | ||
|
|
||
| if profile.has_electrochem: | ||
| sets.append(ChemistrySet.ELECTROCHEM) | ||
|
|
||
| return sets | ||
|
|
||
|
|
||
| def determine_kinetics_families(profile: ChemistryProfile) -> List[str]: | ||
| """ | ||
| Determine which kinetics family sets to activate based on the detected profile. | ||
|
|
||
| These correspond to named sets in RMG-database/input/kinetics/families/recommended.py. | ||
|
|
||
| Args: | ||
| profile: ChemistryProfile instance. | ||
|
|
||
| Returns: | ||
| List of FamilySet values to combine. | ||
| """ | ||
| family_sets = [FamilySet.DEFAULT] | ||
|
|
||
| if profile.has_carbon and profile.max_temperature >= CH_PYROLYSIS_T_THRESHOLD: | ||
| family_sets.append(FamilySet.CH_PYROLYSIS) | ||
|
|
||
| if profile.has_liquid and profile.has_oxygen: | ||
| family_sets.append(FamilySet.LIQUID_PEROXIDE) | ||
|
|
||
| if profile.has_surface: | ||
| family_sets.append(FamilySet.SURFACE) | ||
|
|
||
| if profile.has_halogens: | ||
| family_sets.append(FamilySet.HALOGENS) | ||
|
|
||
| if profile.has_electrochem: | ||
| family_sets.append(FamilySet.ELECTROCHEM) | ||
|
|
||
| return family_sets | ||
|
|
||
|
|
||
| def load_recommended_yml(database_directory: str) -> dict: | ||
| """ | ||
| Load the recommended_libraries.yml file from the RMG database. | ||
|
|
||
| Args: | ||
| database_directory: path to the RMG database 'input' directory. | ||
|
|
||
| Returns: | ||
| dict parsed from YAML. | ||
| """ | ||
| yml_path = os.path.join(database_directory, 'recommended_libraries.yml') | ||
| if not os.path.isfile(yml_path): | ||
| raise InputError(f"Could not find recommended_libraries.yml at {yml_path}. " | ||
| f"This file is required for 'auto' library selection.") | ||
| with open(yml_path, 'r') as f: | ||
| return yaml.safe_load(f) | ||
|
|
||
|
|
||
| def expand_chemistry_sets(recommended_data: dict, | ||
| set_names: List[str], | ||
| ) -> Tuple[List[str], List[str], List[str], List[str]]: | ||
| """ | ||
| Expand named chemistry sets into concrete library lists. | ||
|
|
||
| Args: | ||
| recommended_data: dict from recommended_libraries.yml. | ||
| set_names: list of chemistry set names to expand. | ||
|
|
||
| Returns: | ||
| Tuple of (thermo_libraries, kinetics_libraries, transport_libraries, seed_libraries) | ||
| where each is a list of library name strings. | ||
| """ | ||
| # Primary must always be expanded first so its libraries have highest priority. | ||
| primary_val = ChemistrySet.PRIMARY.value | ||
| has_primary = any(str(s) == primary_val for s in set_names) | ||
| other_sets = [s for s in set_names if str(s) != primary_val] | ||
| set_names = ([ChemistrySet.PRIMARY] if has_primary else []) + other_sets | ||
|
|
||
| thermo, kinetics, transport, seed = [], [], [], [] | ||
|
|
||
| for set_name in set_names: | ||
| if set_name not in recommended_data: | ||
| raise InputError(f"Chemistry set '{set_name}' not found in recommended_libraries.yml. " | ||
| f"Available sets: {list(recommended_data.keys())}") | ||
| set_data = recommended_data[set_name] | ||
|
|
||
| for entry in set_data.get('thermo', []): | ||
| name = entry if isinstance(entry, str) else entry['name'] | ||
| if name not in thermo: | ||
| thermo.append(name) | ||
|
|
||
| for entry in set_data.get('kinetics', []): | ||
| if isinstance(entry, str): | ||
| if entry not in kinetics: | ||
| kinetics.append(entry) | ||
| elif isinstance(entry, dict): | ||
| name = entry['name'] | ||
| if entry.get('seed', False): | ||
| if name not in seed: | ||
| seed.append(name) | ||
| else: | ||
| if name not in kinetics: | ||
| kinetics.append(name) | ||
|
|
||
| for entry in set_data.get('transport', []): | ||
| name = entry if isinstance(entry, str) else entry['name'] | ||
| if name not in transport: | ||
| transport.append(name) | ||
|
|
||
| return thermo, kinetics, transport, seed | ||
|
|
||
|
|
||
| def merge_with_user_libraries(user_spec: Any, auto_libs: List[str]) -> list: | ||
| """ | ||
| Merge user-specified libraries with auto-selected libraries, | ||
| respecting the position of the 'auto' token. <PAH_libs> tokens | ||
| are silently removed (they've already been used as a signal). | ||
|
|
||
| Args: | ||
| user_spec: the user's library specification. Can be: | ||
| - 'auto' (string): fully replace with auto_libs | ||
| - list containing 'auto' token: replace token in-place with auto_libs | ||
| - list without 'auto': return as-is (with <PAH_libs> stripped) | ||
| - None or []: return as-is | ||
| auto_libs: list of auto-selected library names. | ||
|
|
||
| Returns: | ||
| Resolved list of library names. | ||
| """ | ||
| if user_spec == AUTO: | ||
| return list(auto_libs) | ||
|
|
||
| if not isinstance(user_spec, list): | ||
| return user_spec | ||
|
|
||
| # Collect all user-specified library names (excluding special tokens) | ||
| user_lib_names = set() | ||
| for item in user_spec: | ||
| if item not in (AUTO, PAH_LIBS): | ||
| name = item[0] if isinstance(item, tuple) else item | ||
| user_lib_names.add(name) | ||
|
|
||
| # Filter auto libs to exclude any already specified by user | ||
| filtered_auto = [lib for lib in auto_libs if lib not in user_lib_names] | ||
|
|
||
| # Replace tokens in-place | ||
| result = [] | ||
| for item in user_spec: | ||
| if item == AUTO: | ||
| result.extend(filtered_auto) | ||
| elif item == PAH_LIBS: | ||
| continue | ||
| else: | ||
| result.append(item) | ||
|
|
||
| return result |
There was a problem hiding this comment.
Type annotations don’t match the actual values returned/consumed: determine_chemistry_sets() is annotated to return List[str] but returns ChemistrySet enum members, and determine_kinetics_families() similarly returns FamilySet members. Likewise merge_with_user_libraries() is annotated to return list but can return None or a non-list user_spec. Please update the return type hints (and any dependent parameter hints like expand_chemistry_sets(set_names=...)) to reflect the real runtime types, so static analysis and IDE tooling don’t mislead callers.
rmgpy/data/auto_database.py
Outdated
| has_primary = any(str(s) == primary_val for s in set_names) | ||
| other_sets = [s for s in set_names if str(s) != primary_val] | ||
| set_names = ([ChemistrySet.PRIMARY] if has_primary else []) + other_sets |
There was a problem hiding this comment.
expand_chemistry_sets() tries to ensure the primary set is expanded first, but it checks str(s) == 'primary'. For ChemistrySet enum members, str(ChemistrySet.PRIMARY) is 'ChemistrySet.PRIMARY', so has_primary will be false and the reordering logic won’t work if set_names contains enum members (as it does when passed the output of determine_chemistry_sets()). Consider normalizing via getattr(s, 'value', s) (or comparing s == ChemistrySet.PRIMARY / s.value) before doing the primary-first reordering.
| has_primary = any(str(s) == primary_val for s in set_names) | |
| other_sets = [s for s in set_names if str(s) != primary_val] | |
| set_names = ([ChemistrySet.PRIMARY] if has_primary else []) + other_sets | |
| # Normalize set names so we can handle both strings and ChemistrySet enum members. | |
| normalized_names = [getattr(s, "value", s) for s in set_names] | |
| has_primary = any(n == primary_val for n in normalized_names) | |
| other_sets = [n for n in normalized_names if n != primary_val] | |
| # After this point, set_names contains only string names (e.g., "primary"). | |
| set_names = ([primary_val] if has_primary else []) + other_sets |
| # Handle 'auto' token: pass through for later resolution by auto_select_libraries(). | ||
| # '<PAH_libs>' is only valid as a token inside a list, not as a standalone value. | ||
| _LIST_TOKENS = (AUTO, PAH_LIBS) | ||
|
|
||
| if thermoLibraries == AUTO: | ||
| rmg.thermo_libraries = AUTO | ||
| else: | ||
| rmg.thermo_libraries = as_list(thermoLibraries, default=[]) | ||
|
|
||
| if transportLibraries == AUTO: | ||
| rmg.transport_libraries = AUTO | ||
| else: | ||
| rmg.transport_libraries = as_list(transportLibraries, default=None) | ||
|
|
||
| # Store reaction libraries as plain strings; remember which ones had True option | ||
| # (the bool indicates "also output unused edge reactions to chemkin file") | ||
| if reactionLibraries == AUTO: | ||
| rmg.reaction_libraries = AUTO | ||
| rmg.reaction_libraries_output_edge = set() | ||
| else: | ||
| reaction_libraries = as_list(reactionLibraries, default=[]) | ||
| rmg.reaction_libraries = [] | ||
| rmg.reaction_libraries_output_edge = set() | ||
| for item in reaction_libraries: | ||
| if item in _LIST_TOKENS: | ||
| rmg.reaction_libraries.append(item) | ||
| elif isinstance(item, tuple): | ||
| name, option = item | ||
| rmg.reaction_libraries.append(name) | ||
| if option: | ||
| rmg.reaction_libraries_output_edge.add(name) | ||
| else: | ||
| rmg.reaction_libraries.append(item) | ||
|
|
||
| if seedMechanisms == AUTO: | ||
| rmg.seed_mechanisms = AUTO | ||
| else: | ||
| rmg.seed_mechanisms = as_list(seedMechanisms, default=[]) |
There was a problem hiding this comment.
The comment says '<PAH_libs>' is only valid inside a list, but the code currently accepts thermoLibraries == PAH_LIBS / transportLibraries == PAH_LIBS / seedMechanisms == PAH_LIBS (and also reactionLibraries == PAH_LIBS) without raising. This leads to confusing downstream behavior (e.g., auto_select_libraries() will treat it as a special token and _log_lib_list() will iterate over a string). Please add explicit validation to raise InputError when any of these fields is set to '<PAH_libs>' as a standalone value, and also reject tuples like ('<PAH_libs>', True) in reactionLibraries.
Adds automatic library and kinetics family selection to RMG. Users can now write
thermoLibraries='auto'(and same for reaction libraries, transport, seeds, and kinetics families) in their input file, and RMG will pick the right libraries based on what species and reactor conditions are in the input (and based on the correctness of recommended_libraries.yml).The selection logic detects elements (N, S, O, halogens, Li), reactor type (gas/liquid/surface), and temperature to trigger the appropriate
chemistry sets defined in a new
recommended_libraries.ymlfile in RMG-database. Kinetics families are similarly auto-selected from the existingrecommended.py sets.
Key design choices:
to opt in
A Preview notebook at ipython/auto_library_selection.ipynb
We should first merge the db branch ReactionMechanismGenerator/RMG-database#712
PR adapted from the existing implementation in T3 (libraries.yml and code)