feat(pt/dpmodel): add lmdb dataloader by iProzd · Pull Request #5283 · deepmodeling/deepmd-kit

iProzd · 2026-03-03T07:56:19Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for LMDB data format in training workflows, enabling efficient data handling for large-scale datasets.
- Introduced batching strategies that group frames by atom count for optimized training performance.
- Enhanced distributed training capabilities with improved batch sampling across multiple GPUs.
Documentation
- Added example configuration for LMDB-based training setup.

deepmd/dpmodel/utils/lmdb_data.py

+
+    def print_summary(self, name: str, prob: Any) -> None:
+        """Print basic dataset info."""
+        unique_nlocs = sorted(self._nloc_groups.keys())


deepmd/pt/utils/lmdb_dataset.py

+    DataRequirementItem,
+)
+
+log = logging.getLogger(__name__)


source/tests/pt/test_lmdb_dataloader.py

+    def test_getitem_out_of_range(self, lmdb_dir):
+        ds = LmdbDataset(lmdb_dir, type_map=["O", "H"], batch_size=2)
+        with pytest.raises(IndexError):
+            ds[999]


source/tests/pt/test_lmdb_dataloader.py

+                    self._world_size - 1 - self._rank :: self._world_size
+                ]
+
+        s_default = DistributedSameNlocBatchSampler(


coderabbitai · 2026-03-03T08:03:45Z

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive LMDB (Lightning Memory-Mapped Database) support to the DeePMD framework, adding a framework-agnostic data reader with batch sampling utilities, PyTorch integration, training pipeline updates, test support enhancements, and extensive test coverage for uniform and mixed-nloc scenarios.

Changes

Cohort / File(s)	Summary
LMDB Core Framework `deepmd/dpmodel/utils/__init__.py`, `deepmd/dpmodel/utils/lmdb_data.py`	Introduces framework-agnostic LMDB data utilities including LmdbDataReader for metadata/frame decoding, SameNlocBatchSampler and DistributedSameNlocBatchSampler for per-nloc batching, LmdbTestData for test workflows, and merge_lmdb utility. Exports new symbols through module init.
PyTorch LMDB Integration `deepmd/pt/utils/lmdb_dataset.py`, `deepmd/pt/entrypoints/main.py`, `deepmd/pt/train/training.py`	Introduces PyTorch-specific LMDB wrapper (LmdbDataset) with collation and sampler adapters. Extends prepare_trainer_input_single to support LmdbDataset return types. Modifies training.py to construct LMDB-backed dataloaders, handle mixed_batch flag, and compute batch statistics from LMDB readers.
Test and Evaluation `deepmd/entrypoints/test.py`	Adds LMDB-aware test path branching via LmdbTestData and per-nloc group handling. Introduces _LmdbTestDataNlocView wrapper to filter LMDB data by nloc groups and iterates tests per group when multiple nloc values exist.
Example Configuration `examples/lmdb_data/input_lmdb.json`	Provides a complete LMDB training configuration demonstrating model, learning rate, loss, and training setup with water_training.lmdb and water_validation.lmdb data sources.
Test Coverage `source/tests/consistent/test_lmdb_data.py`, `source/tests/pt/test_lmdb_dataloader.py`	Comprehensive test suites validating consistency between LmdbDataReader and LmdbDataset, uniform and mixed-nloc scenarios, batch sampling determinism, collation behavior, distributed batching, and test data generation.

Sequence Diagrams

sequenceDiagram
    participant App as Application<br/>(Trainer)
    participant DS as LmdbDataset
    participant Reader as LmdbDataReader
    participant LMDB as LMDB Database
    participant DL as PyTorch DataLoader
    
    App->>DS: __init__(lmdb_path, type_map,<br/>batch_size)
    DS->>Reader: __init__(lmdb_path,<br/>type_map, batch_size)
    Reader->>LMDB: Open read-only txn
    Reader->>LMDB: Read __metadata__
    Reader->>Reader: Precompute frame_nlocs<br/>& nloc_groups
    
    App->>DS: get dataloader
    DS->>DS: Create SameNlocBatchSampler
    DS->>DL: DataLoader(sampler,<br/>collate_fn)
    
    App->>DL: iter()
    loop Per Batch
        DL->>DS: __getitem__(frame_id)
        DS->>Reader: __getitem__(frame_id)
        Reader->>LMDB: Read frame data
        Reader->>Reader: Decode arrays<br/>& remap keys
        Reader-->>DS: dict[str,ndarray]
        DS->>DS: _collate_lmdb_batch
        DL-->>App: Batched tensor dict
    end

sequenceDiagram
    participant Trainer as Trainer
    participant Main as prepare_trainer<br/>_input_single
    participant Check as is_lmdb
    participant DS as LmdbDataset/Dataset
    participant TL as Training Loop
    
    Trainer->>Main: training_systems,<br/>validation_systems
    Main->>Check: is_lmdb(training_systems)
    Check-->>Main: True/False
    
    alt LMDB Path
        Main->>DS: LmdbDataset(train_path,<br/>type_map, batch_size)
        alt Validation is LMDB
            Main->>DS: LmdbDataset(val_path)
        else Validation is npy
            Main->>DS: Process patterns &<br/>create DpLoaderSet
        end
    else Non-LMDB Path
        Main->>DS: Process patterns
        Main->>DS: DpLoaderSet(train_data)
        Main->>DS: DpLoaderSet(val_data)
    end
    
    Main-->>Trainer: (train_data,<br/>val_data, DPPath)
    Trainer->>TL: get_data_loader(train_data)
    TL->>TL: Instantiate appropriate<br/>loader & sampler
    TL-->>Trainer: (dataloader, iterator)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

new feature, Python, enhancement, Examples

Suggested reviewers

wanghan-iapcm
njzjz
anyangml

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(pt/dpmodel): add lmdb dataloader' directly and clearly summarizes the main change: adding LMDB dataloader functionality to the PyTorch (pt) and dpmodel modules.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

deepmd/pt/train/training.py (1)

529-542: ⚠️ Potential issue | 🔴 Critical

Multi-task LMDB path still assumes .sampler.weights and can crash.

In LMDB mode, dataloaders are batch-sampler based and do not provide sampler weights. The current multi-task code still dereferences .sampler.weights in both summary printing and num_epoch_dict total-batch computation.

Proposed fix

                 training_data[model_key].print_summary(
                     f"training in {model_key}",
-                    to_numpy_array(self.training_dataloader[model_key].sampler.weights),
+                    to_numpy_array(self.training_dataloader[model_key].sampler.weights)
+                    if not isinstance(training_data[model_key], LmdbDataset)
+                    else [1.0],
                 )
@@
                     validation_data[model_key].print_summary(
                         f"validation in {model_key}",
-                        to_numpy_array(
-                            self.validation_dataloader[model_key].sampler.weights
-                        ),
+                        to_numpy_array(
+                            self.validation_dataloader[model_key].sampler.weights
+                        )
+                        if not isinstance(validation_data[model_key], LmdbDataset)
+                        else [1.0],
                     )
@@
                 for model_key in self.model_keys:
-                    sampler_weights = to_numpy_array(
-                        self.training_dataloader[model_key].sampler.weights
-                    )
-                    per_task_total.append(
-                        compute_total_numb_batch(
-                            training_data[model_key].index,
-                            sampler_weights,
-                        )
-                    )
+                    if isinstance(training_data[model_key], LmdbDataset):
+                        per_task_total.append(training_data[model_key].total_batch)
+                    else:
+                        sampler_weights = to_numpy_array(
+                            self.training_dataloader[model_key].sampler.weights
+                        )
+                        per_task_total.append(
+                            compute_total_numb_batch(
+                                training_data[model_key].index,
+                                sampler_weights,
+                            )
+                        )

Also applies to: 583-590

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 529 - 542, The code assumes
sampler.weights exists for training/validation dataloaders (see
training_data[model_key].print_summary calls and the num_epoch_dict total-batch
computation) which breaks for LMDB batch-sampler dataloaders; fix by guarding
access to sampler.weights: when printing call
to_numpy_array(self.training_dataloader[model_key].sampler.weights) only if
hasattr(self.training_dataloader[model_key].sampler, "weights") (or use getattr
with a default None) and pass None or an empty array otherwise, and for
num_epoch_dict total-batch computation replace uses of sum(sampler.weights) with
a safe fallback that uses len(self.training_dataloader[model_key]) (number of
batches) when weights are absent; update the same logic for
validation_dataloader and any other places that reference .sampler.weights
(e.g., the block around num_epoch_dict).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 242-253: The batch_size handling silently coerces unknown strings
and accepts non-positive integers; update the branch in the batch_size setter so
that only "auto" or "auto:<positive int>" strings are allowed (parse "auto:<n>"
into an int and validate n>0), otherwise raise a ValueError with a clear
message; likewise, when batch_size is numeric (the else branch that sets
self.batch_size = int(batch_size)) validate the parsed integer is >0 and raise a
ValueError if not; keep using _compute_batch_size(self._natoms, self._auto_rule)
for the "auto" case but ensure _auto_rule is a positive int after parsing.
- Line 374: The local variable unique_nlocs assigned from
sorted(self._nloc_groups.keys()) is unused and triggers a Ruff F841 lint error;
remove this assignment statement (the unused local unique_nlocs) from the method
where it appears so the code no longer creates unique_nlocs but retains any
needed logic that uses self._nloc_groups directly.
- Around line 880-882: The merge logic assumes src_nlocs has at least nframes
entries and does frame_nlocs.append(int(src_nlocs[i])) without bounds checking;
change this to guard the index: if src_nlocs is not None and i < len(src_nlocs)
then append int(src_nlocs[i]) else compute/derive the per-frame natoms (e.g., by
parsing the frame from the source data or using the fallback natoms value
already available) so that malformed/truncated metadata does not raise
IndexError during the merge; update the block around the frame_nlocs population
(references: frame_nlocs, src_nlocs, nframes) to perform this fallback behavior.
- Around line 705-719: In get_test(), handle the empty-dataset case before using
max on self._nloc_groups: if self._nloc_groups is empty (or len(self._frames) ==
0) raise a clear ValueError (e.g. "No frames in LMDB / no nloc groups
available") instead of letting max(...) raise a cryptic error; otherwise proceed
with the existing logic that sets natoms and frame_indices from the largest
group in self._nloc_groups.
- Around line 263-267: Before building natoms_vec/vec, validate that all atom
type ids in atype are within [0, self._ntypes-1]; detect any ids <0 or >=
self._ntypes and raise a descriptive ValueError (including the offending ids or
their min/max and, if available, the frame index) instead of silently truncating
via np.bincount(... )[: self._ntypes]. Place this check immediately before the
np.bincount call that computes counts (the code using atype and self._ntypes and
filling vec/natoms_vec) so you fail fast and keep natoms_vec consistent with the
frame contents.

In `@deepmd/entrypoints/test.py`:
- Around line 233-266: The append_detail flag is incorrectly derived from cc so
LMDB sub-groups for the first system overwrite detail files; instead track
whether a given sys_label has already been written and pass append_detail=True
for subsequent sub-groups. Modify the loop over data_items: introduce a small
seen map (e.g., seen_systems dict keyed by sys_label) and compute append_detail
= seen_systems.get(sys_label, False) when calling
test_ener/test_dos/test_property (replace the current append_detail=(cc != 0));
after the call set seen_systems[sys_label] = True so later sub-groups append
rather than overwrite.

In `@source/tests/consistent/test_lmdb_data.py`:
- Line 128: The loop variable j in the loop "for j in range(count):" is unused;
rename it to "_" (or "_i"/"_count" per project convention) to satisfy Ruff B007
and avoid lint failures, update any related references in that loop (none
expected), and re-run ruff check . and ruff format . to ensure formatting and
linting pass.
- Around line 405-407: The current test contains a tautological assertion
self.assertEqual(atype.shape[1], atype.shape[1]) which does not validate batch
consistency; replace it with an assertion that compares the nloc dimension
across frames in the batch (use atype as the per-batch array) — e.g. compute a
reference_nloc from the first frame (reference_nloc = atype[0].shape[1]) and
assert every frame's shape[1] equals reference_nloc (or use numpy/all to check
atype[:,1?] consistency) so the test in test_lmdb_data.py actually verifies "All
frames in batch have same nloc".

In `@source/tests/pt/test_lmdb_dataloader.py`:
- Line 480: The test contains Ruff violations: the loop uses an unused loop
variable i and the variable s_default is assigned but never used; fix by
changing the loop to use a throwaway name (e.g., replace "for i in range(10):"
with "for _ in range(10):") or otherwise use i, and remove or use the s_default
assignment (remove the dead "s_default = ..." line if it's not needed or
reference it where intended) so there are no unused variables left.

---

Outside diff comments:
In `@deepmd/pt/train/training.py`:
- Around line 529-542: The code assumes sampler.weights exists for
training/validation dataloaders (see training_data[model_key].print_summary
calls and the num_epoch_dict total-batch computation) which breaks for LMDB
batch-sampler dataloaders; fix by guarding access to sampler.weights: when
printing call
to_numpy_array(self.training_dataloader[model_key].sampler.weights) only if
hasattr(self.training_dataloader[model_key].sampler, "weights") (or use getattr
with a default None) and pass None or an empty array otherwise, and for
num_epoch_dict total-batch computation replace uses of sum(sampler.weights) with
a safe fallback that uses len(self.training_dataloader[model_key]) (number of
batches) when weights are absent; update the same logic for
validation_dataloader and any other places that reference .sampler.weights
(e.g., the block around num_epoch_dict).

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3db25a and 5a10d4d.

📒 Files selected for processing (9)

deepmd/dpmodel/utils/__init__.py
deepmd/dpmodel/utils/lmdb_data.py
deepmd/entrypoints/test.py
deepmd/pt/entrypoints/main.py
deepmd/pt/train/training.py
deepmd/pt/utils/lmdb_dataset.py
examples/lmdb_data/input_lmdb.json
source/tests/consistent/test_lmdb_data.py
source/tests/pt/test_lmdb_dataloader.py

coderabbitai · 2026-03-03T08:03:49Z

deepmd/dpmodel/utils/lmdb_data.py

+        if isinstance(batch_size, str):
+            if batch_size == "auto":
+                self._auto_rule = 32
+            elif batch_size.startswith("auto:"):
+                self._auto_rule = int(batch_size.split(":")[1])
+            else:
+                self._auto_rule = 32
+            # Default batch_size uses first frame's nloc (for total_batch estimate)
+            self.batch_size = _compute_batch_size(self._natoms, self._auto_rule)
+        else:
+            self.batch_size = int(batch_size)
+


⚠️ Potential issue | 🟠 Major

Validate batch_size strictly instead of silently coercing invalid input.

At Line 248, unknown string specs are silently treated as "auto:32", and at Line 252 non-positive integers are accepted. This can hide config mistakes and trigger runtime failures later (e.g., division by zero at Line 393).

Suggested fix

self._auto_rule: int | None = None if isinstance(batch_size, str): if batch_size == "auto": self._auto_rule = 32 elif batch_size.startswith("auto:"): - self._auto_rule = int(batch_size.split(":")[1]) + try: + self._auto_rule = int(batch_size.split(":", 1)[1]) + except ValueError as exc: + raise ValueError( + "batch_size must be 'auto', 'auto:N', or a positive integer" + ) from exc + if self._auto_rule <= 0: + raise ValueError("auto batch_size rule N must be > 0") else: - self._auto_rule = 32 + raise ValueError( + "batch_size must be 'auto', 'auto:N', or a positive integer" + ) # Default batch_size uses first frame's nloc (for total_batch estimate) self.batch_size = _compute_batch_size(self._natoms, self._auto_rule) else: self.batch_size = int(batch_size) + if self.batch_size <= 0: + raise ValueError("batch_size must be > 0")

Also applies to: 393-393

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 242 - 253, The batch_size handling silently coerces unknown strings and accepts non-positive integers; update the branch in the batch_size setter so that only "auto" or "auto:<positive int>" strings are allowed (parse "auto:<n>" into an int and validate n>0), otherwise raise a ValueError with a clear message; likewise, when batch_size is numeric (the else branch that sets self.batch_size = int(batch_size)) validate the parsed integer is >0 and raise a ValueError if not; keep using _compute_batch_size(self._natoms, self._auto_rule) for the "auto" case but ensure _auto_rule is a positive int after parsing.

coderabbitai · 2026-03-03T08:03:49Z

deepmd/dpmodel/utils/lmdb_data.py

+        counts = np.bincount(atype, minlength=self._ntypes)[: self._ntypes]
+        vec = np.empty(self._ntypes + 2, dtype=np.int64)
+        vec[0] = nloc
+        vec[1] = nloc
+        vec[2:] = counts


⚠️ Potential issue | 🟠 Major

Guard against out-of-range atom type ids before building natoms_vec.

np.bincount(... )[: self._ntypes] truncates counts for type ids >= self._ntypes, so natoms_vec can become inconsistent with the frame contents without any error.

Suggested fix

nloc = len(atype) + if atype.size and (np.min(atype) < 0 or np.max(atype) >= self._ntypes): + raise ValueError( + f"atype contains out-of-range ids; expected [0, {self._ntypes - 1}]" + ) counts = np.bincount(atype, minlength=self._ntypes)[: self._ntypes]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 263 - 267, Before building natoms_vec/vec, validate that all atom type ids in atype are within [0, self._ntypes-1]; detect any ids <0 or >= self._ntypes and raise a descriptive ValueError (including the offending ids or their min/max and, if available, the frame index) instead of silently truncating via np.bincount(... )[: self._ntypes]. Place this check immediately before the np.bincount call that computes counts (the code using atype and self._ntypes and filling vec/natoms_vec) so you fail fast and keep natoms_vec consistent with the frame contents.

coderabbitai · 2026-03-03T08:03:49Z

deepmd/dpmodel/utils/lmdb_data.py

+
+    def print_summary(self, name: str, prob: Any) -> None:
+        """Print basic dataset info."""
+        unique_nlocs = sorted(self._nloc_groups.keys())


⚠️ Potential issue | 🟠 Major

Remove unused local to keep Ruff clean (F841).

unique_nlocs is assigned but never used. This is already flagged by static analysis and should be removed to avoid CI lint failures.

Suggested fix

- unique_nlocs = sorted(self._nloc_groups.keys()) nloc_info = ", ".join( f"{nloc}({len(idxs)})" for nloc, idxs in sorted(self._nloc_groups.items()) )

As per coding guidelines, **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

🧰 Tools

🪛 Ruff (0.15.2)

[error] 374-374: Local variable unique_nlocs is assigned to but never used

Remove assignment to unused variable unique_nlocs

(F841)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` at line 374, The local variable unique_nlocs assigned from sorted(self._nloc_groups.keys()) is unused and triggers a Ruff F841 lint error; remove this assignment statement (the unused local unique_nlocs) from the method where it appears so the code no longer creates unique_nlocs but retains any needed logic that uses self._nloc_groups directly.

coderabbitai · 2026-03-03T08:03:49Z

deepmd/dpmodel/utils/lmdb_data.py

+        if nloc is not None:
+            if nloc not in self._nloc_groups:
+                raise ValueError(
+                    f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}"
+                )
+            frame_indices = self._nloc_groups[nloc]
+            natoms = nloc
+        elif len(self._nloc_groups) == 1:
+            # Uniform nloc — use all frames
+            natoms = next(iter(self._nloc_groups))
+            frame_indices = list(range(len(self._frames)))
+        else:
+            # Mixed nloc — use the largest group
+            natoms = max(self._nloc_groups, key=lambda k: len(self._nloc_groups[k]))
+            frame_indices = self._nloc_groups[natoms]


⚠️ Potential issue | 🟡 Minor

Handle empty datasets explicitly in get_test().

If the LMDB has zero frames, self._nloc_groups is empty and max(...) at Line 718 raises a cryptic error path. Add an explicit guard for the empty case.

Suggested fix

if nloc is not None: if nloc not in self._nloc_groups: raise ValueError( f"No frames with nloc={nloc}. Available: {sorted(self._nloc_groups.keys())}" ) frame_indices = self._nloc_groups[nloc] natoms = nloc + elif not self._frames: + raise ValueError("LMDB dataset contains no frames") elif len(self._nloc_groups) == 1:

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 707-709: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 705 - 719, In get_test(), handle the empty-dataset case before using max on self._nloc_groups: if self._nloc_groups is empty (or len(self._frames) == 0) raise a clear ValueError (e.g. "No frames in LMDB / no nloc groups available") instead of letting max(...) raise a cryptic error; otherwise proceed with the existing logic that sets natoms and frame_indices from the largest group in self._nloc_groups.

coderabbitai · 2026-03-03T08:03:49Z

deepmd/dpmodel/utils/lmdb_data.py

+                if src_nlocs is not None:
+                    frame_nlocs.append(int(src_nlocs[i]))
+                else:


⚠️ Potential issue | 🟠 Major

Guard frame_nlocs metadata indexing during merge.

src_nlocs[i] assumes metadata length matches nframes. On malformed/truncated metadata this raises IndexError and aborts merge. Add a bounds check and fall back to frame parsing/fallback natoms.

Suggested fix

- if src_nlocs is not None: - frame_nlocs.append(int(src_nlocs[i])) - else: + if src_nlocs is not None and i < len(src_nlocs): + frame_nlocs.append(int(src_nlocs[i])) + else: frame_raw = msgpack.unpackb(raw, raw=False) atype_raw = frame_raw.get("atom_types") if isinstance(atype_raw, dict): shape = atype_raw.get("shape") or atype_raw.get(b"shape") if shape: frame_nlocs.append(int(shape[0])) else: frame_nlocs.append(fallback_natoms) else: frame_nlocs.append(fallback_natoms)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 880 - 882, The merge logic assumes src_nlocs has at least nframes entries and does frame_nlocs.append(int(src_nlocs[i])) without bounds checking; change this to guard the index: if src_nlocs is not None and i < len(src_nlocs) then append int(src_nlocs[i]) else compute/derive the per-frame natoms (e.g., by parsing the frame from the source data or using the fallback natoms value already available) so that malformed/truncated metadata does not raise IndexError during the merge; update the block around the frame_nlocs population (references: frame_nlocs, src_nlocs, nframes) to perform this fallback behavior.

coderabbitai · 2026-03-03T08:03:49Z

deepmd/entrypoints/test.py

+        for data, sys_label in data_items:
+            if sys_label != system:
+                log.info(f"# testing sub-group : {sys_label}")
+
+            if isinstance(dp, DeepPot):
+                err = test_ener(
+                    dp,
+                    data,
+                    sys_label,
+                    numb_test,
+                    detail_file,
+                    atomic,
+                    append_detail=(cc != 0),
+                )
+            elif isinstance(dp, DeepDOS):
+                err = test_dos(
+                    dp,
+                    data,
+                    sys_label,
+                    numb_test,
+                    detail_file,
+                    atomic,
+                    append_detail=(cc != 0),
+                )
+            elif isinstance(dp, DeepProperty):
+                err = test_property(
+                    dp,
+                    data,
+                    sys_label,
+                    numb_test,
+                    detail_file,
+                    atomic,
+                    append_detail=(cc != 0),
+                )


⚠️ Potential issue | 🟠 Major

append_detail is wrong for LMDB sub-groups and causes detail-file overwrite.

When one LMDB system is split into multiple nloc groups, all sub-groups of the first system still pass append_detail=False, so .e/.f/.v detail files get overwritten.

Proposed fix

- for data, sys_label in data_items: + for sub_idx, (data, sys_label) in enumerate(data_items): + append_detail = (cc != 0) or (sub_idx != 0) if sys_label != system: log.info(f"# testing sub-group : {sys_label}") @@ err = test_ener( @@ - append_detail=(cc != 0), + append_detail=append_detail, ) @@ err = test_dos( @@ - append_detail=(cc != 0), + append_detail=append_detail, ) @@ err = test_property( @@ - append_detail=(cc != 0), + append_detail=append_detail, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@deepmd/entrypoints/test.py` around lines 233 - 266, The append_detail flag is incorrectly derived from cc so LMDB sub-groups for the first system overwrite detail files; instead track whether a given sys_label has already been written and pass append_detail=True for subsequent sub-groups. Modify the loop over data_items: introduce a small seen map (e.g., seen_systems dict keyed by sys_label) and compute append_detail = seen_systems.get(sys_label, False) when calling test_ener/test_dos/test_property (replace the current append_detail=(cc != 0)); after the call set seen_systems[sys_label] = True so later sub-groups append rather than overwrite.

coderabbitai · 2026-03-03T08:03:49Z

source/tests/consistent/test_lmdb_data.py

+        txn.put(b"__metadata__", msgpack.packb(meta, use_bin_type=True))
+        idx = 0
+        for natoms, count in frames_spec:
+            for j in range(count):


⚠️ Potential issue | 🟡 Minor

Rename unused loop variable to satisfy Ruff (B007).

Line 128 uses a loop variable that is never read.

Proposed cleanup

- for j in range(count): + for _j in range(count):

As per coding guidelines **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for j in range(count):

for _j in range(count):

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 128-128: Loop control variable j not used within loop body

Rename unused j to _j

(B007)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/tests/consistent/test_lmdb_data.py` at line 128, The loop variable j in the loop "for j in range(count):" is unused; rename it to "_" (or "_i"/"_count" per project convention) to satisfy Ruff B007 and avoid lint failures, update any related references in that loop (none expected), and re-run ruff check . and ruff format . to ensure formatting and linting pass.

coderabbitai · 2026-03-03T08:03:49Z

source/tests/consistent/test_lmdb_data.py

+                    # All frames in batch have same nloc
+                    self.assertEqual(atype.shape[1], atype.shape[1])
+                break  # just check first batch


⚠️ Potential issue | 🟡 Minor

Tautological assertion makes this test ineffective.

Line 406 compares the same value to itself, so it never validates batch consistency.

Proposed assertion

if atype is not None: - # All frames in batch have same nloc - self.assertEqual(atype.shape[1], atype.shape[1]) + nloc = atype.shape[1] + for i in range(atype.shape[0]): + self.assertEqual(int(batch["natoms"][i, 0]), nloc)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/tests/consistent/test_lmdb_data.py` around lines 405 - 407, The current test contains a tautological assertion self.assertEqual(atype.shape[1], atype.shape[1]) which does not validate batch consistency; replace it with an assertion that compares the nloc dimension across frames in the batch (use atype as the per-batch array) — e.g. compute a reference_nloc from the first frame (reference_nloc = atype[0].shape[1]) and assert every frame's shape[1] equals reference_nloc (or use numpy/all to check atype[:,1?] consistency) so the test in test_lmdb_data.py actually verifies "All frames in batch have same nloc".

coderabbitai · 2026-03-03T08:03:50Z

source/tests/pt/test_lmdb_dataloader.py

+    with env.begin(write=True) as txn:
+        idx = 0
+        for natoms in [4, 6, 8]:
+            for i in range(10):


⚠️ Potential issue | 🟠 Major

Fix Ruff violations before merge (unused loop variable + dead assignment).

Line 617 (s_default) is assigned but never used (F841), and Line 480 uses an unused loop variable (B007).

Proposed cleanup

- for i in range(10): + for _i in range(10): key = format(idx, fmt).encode() frame = _make_frame(natoms=natoms, seed=idx * 100) txn.put(key, msgpack.packb(frame, use_bin_type=True)) frame_nlocs.append(natoms) idx += 1 @@ - s_default = DistributedSameNlocBatchSampler( - reader, rank=0, world_size=2, shuffle=True, seed=42 - ) s_custom = ReversePartition(reader, rank=0, world_size=2, shuffle=True, seed=42)

As per coding guidelines **/*.py: Always run ruff check . and ruff format . before committing changes or CI will fail.

Also applies to: 617-617

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 480-480: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/tests/pt/test_lmdb_dataloader.py` at line 480, The test contains Ruff violations: the loop uses an unused loop variable i and the variable s_default is assigned but never used; fix by changing the loop to use a throwaway name (e.g., replace "for i in range(10):" with "for _ in range(10):") or otherwise use i, and remove or use the s_default assignment (remove the dead "s_default = ..." line if it's not needed or reference it where intended) so there are no unused variables left.

codecov · 2026-03-03T08:32:13Z

Codecov Report

❌ Patch coverage is 78.87097% with 131 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.10%. Comparing base (65eea4b) to head (599d221).
⚠️ Report is 18 commits behind head on master.

Files with missing lines	Patch %	Lines
deepmd/dpmodel/utils/lmdb_data.py	81.88%	77 Missing ⚠️
deepmd/entrypoints/test.py	58.82%	21 Missing ⚠️
deepmd/pt/train/training.py	26.08%	17 Missing ⚠️
deepmd/pt/entrypoints/main.py	52.63%	9 Missing ⚠️
deepmd/pt/utils/lmdb_dataset.py	93.06%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5283      +/-   ##
==========================================
+ Coverage   82.00%   82.10%   +0.09%     
==========================================
  Files         750      755       +5     
  Lines       75082    76411    +1329     
  Branches     3615     3648      +33     
==========================================
+ Hits        61571    62735    +1164     
- Misses      12347    12507     +160     
- Partials     1164     1169       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

iProzd added 2 commits February 28, 2026 02:16

feat(pt dpmodel): add lmdb dataloader

8f554d9

update distributed sampler

5a10d4d

Copilot AI review requested due to automatic review settings March 3, 2026 07:56

github-actions bot added Python Examples labels Mar 3, 2026

iProzd marked this pull request as draft March 3, 2026 07:56

dosubot bot added the new feature label Mar 3, 2026

github-advanced-security bot found potential problems Mar 3, 2026

View reviewed changes

Copilot started reviewing on behalf of iProzd March 3, 2026 07:59 View session

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

Copilot AI reviewed Mar 3, 2026

View reviewed changes

Update pyproject.toml

599d221

Conversation

iProzd commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

Check notice

Check notice

Check notice

Check notice

coderabbitai bot commented Mar 3, 2026

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iProzd commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

codecov bot commented Mar 3, 2026 •

edited

Loading