deepmodeling · dyzheng · May 1, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/DEBUG_NOTES.md b/DEBUG_NOTES.md
@@ -0,0 +1,223 @@
+# Debug Notes for ABACUS Developers
+
+## Issue 1: SDFT GPU ComplexMatrix Crash
+
+During CUDA build testing of the `feat/dftu-pw-port-v2` branch, `16_SDFT_GPU/001_PW_KG_100_GPU` test crashed with ComplexMatrix assertion failure:
+
+```
+abacus_basic_gpu: source/source_base/complexmatrix.h:38: 
+std::complex<double>& ModuleBase::ComplexMatrix::operator()(int, int): 
+Assertion `ir<nr' failed.
+```
+
+## Root Cause Analysis
+
+### Timeline of the Bug
+
+| Commit | Branch | Description |
+|--------|--------|-------------|
+| `4b847b642` | develop | Added lazy-allocation guard for GPU Velocity path |
+| `8a07ea65a` | feat/dftu-pw-port-v2 | **Erroneously removed the guard** |
+| Current | feat/dftu-pw-port-v2 | Fixed by restoring the guard |
+
+### Detailed Explanation
+
+1. **Commit `4b847b642` (develop branch)** introduced GPU memory optimization:
+   - GPU path skips `vkb.create()` in `init()` to save CPU memory (~3.2 GB for large systems)
+   - Added `vkbnc` member to store column dimension independently
+   - Added lazy-allocation guard in `getgradq_vnl()`:
+
+   ```cpp
+   // GPU path skips vkb allocation in init(); allocate now if needed
+   if (this->vkb.nc == 0 && this->nkb > 0 && this->vkbnc > 0) {
+       this->vkb.create(this->nkb, this->vkbnc);
+   }
+   ```
+
+2. **Commit `8a07ea65a` (PR branch)** removed this guard:
+   - Commit message claimed: "Remove vkbnc member from pseudopot_cell_vnl; always allocate vkb ComplexMatrix (simplifies GPU path)"
+   - But the actual code in `vnl_pw.cpp:init()` still had `!use_gpu_` condition preventing allocation on GPU path
+   - This inconsistency caused `getgradq_vnl()` to access unallocated `vkb` when called by `Velocity::init()` for conductivity calculation
+
+### Code Flow
+
+```
+Velocity::init(ik)
+    → ppcell->getgradq_vnl(ucell, ik)
+        → this->vkb(jkb, 0)  // CRASH: vkb.nc == 0 on GPU path
+```
+
+## Lessons Learned
+
+### 1. Understand Code Purpose Before Removing
+
+When deleting code (especially guards or conditional checks):
+
+- **Always understand why it exists** - Check commit history, commit message, and related discussions
+- **Verify assumptions** - The commit message of `8a07ea65a` claimed "always allocate vkb", but actual code contradicted this
+- **Test affected paths** - GPU Velocity path (`op_pw_vel.cpp`) uses `getgradq_vnl()` which needs `vkb`
+
+### 2. GPU/CPU Path Differences Require Special Care
+
+Key patterns in ABACUS GPU code:
+
+| Pattern | CPU Path | GPU Path |
+|---------|----------|----------|
+| vkb allocation | `vkb.create(nkb, npwx)` | Skipped (uses `c_vkb`/`z_vkb` buffers) |
+| Leading dimension | `vkb.nc` | `vkbnc` member |
+| Data access | `vkb.c` or `vkb(i,j)` | `get_vkb_data<FPTYPE>()` |
+
+**Recommendation**: Before modifying GPU-related code:
+1. Check `use_gpu_` flag usage
+2. Verify memory allocation paths for both CPU and GPU
+3. Ensure lazy-allocation guards are preserved if they exist
+
+### 3. PR Branch Sync with Develop
+
+When merging develop into a PR branch:
+
+- **Check for conflicts carefully** - A "remove" in PR may conflict with an "add" in develop
+- **Review commit messages** - If develop commit claims to fix something, ensure the fix survives the merge
+- **Use `git diff` to verify**:
+
+  ```bash
+  # Check if a specific develop commit's changes are present
+  git diff <develop_commit>..HEAD -- <file>
+  ```
+
+### 4. Commit Message Accuracy
+
+**Problem**: `8a07ea65a` commit message was misleading:
+- Claimed: "always allocate vkb ComplexMatrix"
+- Reality: Removed guard, but allocation condition still existed
+
+**Recommendation**: Commit message should accurately describe:
+- What is being removed AND why
+- Impact on different code paths (CPU vs GPU)
+- Any assumptions being changed
+
+### 5. Testing Coverage
+
+**Gap**: The PR branch's testing didn cover GPU Velocity path (`getgradq_vnl`) because:
+- SDFT conductivity tests require specific INPUT settings
+- CUDA build has single-process limitation
+
+**Recommendation**:
+- Ensure GPU-specific code paths are tested
+- Document which tests cover GPU paths in CI
+
+## Recommendations for Future PRs
+
+### 1. Before Removing Code Added by develop
+
+```bash
+# Find who added the code
+git log -p --reverse -- <file> | grep -B10 "code pattern"
+
+# Check if later commits modified it
+git log --oneline <adding_commit>..HEAD -- <file>
+```
+
+### 2. GPU Path Verification Checklist
+
+When modifying `source_pw/module_pwdft/vnl_pw*` files:
+
+- [ ] Check `use_gpu_` flag conditions
+- [ ] Verify `vkb` allocation/deallocation consistency
+- [ ] Ensure `vkbnc` is used for leading dimension on GPU
+- [ ] Check `get_vkb_data()` usage instead of `vkb.c`
+- [ ] Test SDFT conductivity path (Velocity operator)
+
+### 3. Merge Strategy
+
+When merging develop into feature branch:
+
+```bash
+# After merge, check key files
+git diff <pre_merge_commit>..<post_merge_commit> -- \
+    source/source_pw/module_pwdft/vnl_pw.cpp \
+    source/source_pw/module_pwdft/vnl_pw_grad.cpp \
+    source/source_pw/module_pwdft/op_pw_nl.cpp \
+    source/source_pw/module_pwdft/hamilt_pw.cpp
+
+# Verify GPU-specific guards exist
+grep -n "GPU path\|use_gpu_\|vkbnc\|lazy" <file>
+```
+
+## Related Commits
+
+- `4b847b642` - [Fix] Eliminate unnecessary vkb CPU allocation on GPU path (#7296)
+- `8a07ea65a` - Refactor: vnl_pw simplification, remove TD_MovingGauge, minor module updates
+- `c56b8bb70` - Fix(SDFT-GPU): restore lazy-allocation guard for vkb in getgradq_vnl (this fix)
+
+## Files Involved
+
+| File | Purpose |
+|------|---------|
+| `vnl_pw.cpp` | `init()` - vkb allocation control |
+| `vnl_pw_grad.cpp` | `getgradq_vnl()` - velocity gradient calculation |
+| `vnl_pw.h` | `vkbnc` member definition |
+| `op_pw_vel.cpp` | Velocity operator - calls `getgradq_vnl()` |
+| `op_pw_nl.cpp` | Nonlocal operator - uses `vkbnc` for leading dimension |
+| `hamilt_pw.cpp` | Hamiltonian - uses `vkbnc` for leading dimension |
+
+---
+
+## Issue 2: Force/Stress Formula Order Error
+
+### Problem
+
+Tests `035_PW_15_SO` and `099_PW_DJ_SO` failed with force/stress deviations:
+```
+Warning: totalforceref cal=17.146449 ref=17.965510 deviation=0.819061
+Warning: totalstressref cal=104846.48 ref=100582.60 deviation=4263.88
+```
+
+### Root Cause
+
+Commit `1f49ed516` introduced wrong formula order when attempting to "restore npol parameters":
+
+**Wrong formula** (introduced by commit `1f49ed516`):
+```cpp
+local_force[ipol] -= fac * (ps0 * dbb0 + ps1 * dbb2 + ps2 * dbb1 + ps3 * dbb3).real();
+```
+
+**Correct formula** (from develop branch):
+```cpp
+local_force[ipol] -= fac * (ps0 * dbb0 + ps1 * dbb1 + ps2 * dbb2 + ps3 * dbb3).real();
+```
+
+The commit message claimed: "Fix formula order: dbb1+dbb2 instead of dbb2+dbb1" but actually **changed the correct formula to wrong**!
+
+### Affected Locations
+
+| File | Line | Purpose |
+|------|------|---------|
+| `force_op.cpp` | 257 | Nonlocal SOC force (deeq_nc) |
+| `force_op.cpp` | 355 | DFT+U/DeltaSpin force (vu) |
+| `stress_op.cpp` | 235 | Nonlocal SOC stress |
+| `stress_op.cpp` | 321 | DFT+U/DeltaSpin stress |
+| `cuda/force_op.cu` | 273, 378 | GPU versions |
+| `cuda/stress_op.cu` | 364, 980 | GPU versions |
+| `rocm/force_op.hip.cu` | 259, 364 | ROCm versions |
+| `rocm/stress_op.hip.cu` | 325, 969 | ROCm versions |
+
+### Timeline
+
+| Commit | Description |
+|--------|-------------|
+| `eaad9504a` | Fixed formula to ps1*dbb1 + ps2*dbb2 (correct) |
+| `2d90a6eec` | Synced CPU kernels with develop (formula still correct) |
+| `1f49ed516` | **Changed formula to wrong order** (ps1*dbb2 + ps2*dbb1) |
+| `d0193ebbb` | Fixed formula back to correct order |
+
+### Lesson Learned
+
+When a commit message says "fix formula", verify the actual code change matches the description:
+- `1f49ed516` claimed to "fix formula order" but actually **introduced** the bug
+- The correct formula was already present in previous commits
+
+### Important Note
+
+Test `099_PW_DJ_SO` is in `CASES_CPU.txt` only, not `CASES_GPU.txt`. 
+GPU version results may differ from CPU reference values due to numerical precision.
diff --git a/docs/advanced/input_files/input-main.md b/docs/advanced/input_files/input-main.md
@@ -394,6 +394,8 @@
     - [sccut](#sccut)
     - [sc\_drop\_thr](#sc_drop_thr)
     - [sc\_scf\_thr](#sc_scf_thr)
+    - [sc\_direction\_only](#sc_direction_only)
+    - [sc\_lambda\_strategy](#sc_lambda_strategy)
   - [vdW correction](#vdw-correction)
     - [vdw\_method](#vdw_method)
     - [vdw\_s6](#vdw_s6)
@@ -3481,8 +3483,8 @@
 
 - **Type**: Integer
 - **Description**: Determines whether to calculate the plus U correction, which is especially important for correlated electrons.
-  - 1: Calculate plus U correction with radius-adjustable localized projections (with parameter onsite_radius).
-  - 2: Calculate plus U correction using first zeta of NAOs as projections (this is old method for testing).
+  - 1: Calculate plus U correction with radius-adjustable localized projections (with parameter onsite_radius). Supported for both PW and LCAO basis sets.
+  - 2: Calculate plus U correction using first zeta of NAOs as projections (this is old method for testing). Only available for LCAO basis.
   - 0: Do not calculate plus U correction.
 - **Default**: 0
 
@@ -3629,6 +3631,24 @@
 - **Description**: Density error threshold for inner loop of spin-constrained SCF
 - **Default**: 1.0e-4
 
+### sc_direction_only
+
+- **Type**: Boolean
+- **Availability**: *sc_mag_switch is true*
+- **Description**: When true, only the direction of the magnetic moment is constrained to the target direction, while the magnitude is allowed to vary freely. This is useful for studying magnetic anisotropy or when the magnitude of the moment is determined by the electronic structure rather than an external constraint. When false (default), both the direction and magnitude of the magnetic moment are constrained to the target values.
+- **Default**: False
+
+### sc_lambda_strategy
+
+- **Type**: String
+- **Availability**: *sc_mag_switch is true*
+- **Description**: Lambda update strategy for spin-constrained DFT. Available options are:
+  - bfgs: BFGS quasi-Newton method (default, robust and well-tested)
+  - linear_response: linear response method (Scheme B)
+  - augmented_lagrangian: augmented Lagrangian method (Scheme C)
+  - hybrid_delayed: hybrid delayed update (Scheme D)
+- **Default**: bfgs
+
 [back to top](#full-list-of-input-keywords)
 
 ## vdW correction

diff --git a/docs/advanced/scf/construct_H.md b/docs/advanced/scf/construct_H.md
@@ -77,6 +77,6 @@ Here, we use a simple [example calculation](https://github.com/deepmodeling/abac
 
 Conventional functionals, e.g., L(S)DA and GGAs, encounter failures in strongly correlated systems, usually characterized by partially filled *d*/*f* shells. These include transition metals (TM) and their oxides, rare-earth compounds, and actinides, to name a few, where L(S)DA/GGAs typically yield quantitatively or even qualitatively wrong results. To address this failure, an efficient and successful method named DFT+*U*, which inherits the efficiency of L(S)DA/GGA but gains the strength of the Hubbard model in describing the physics of strongly correlatedsystems, has been developed.
 
-Now the DFT+*U* method is accessible in ABACUS. The details of the DFT+*U* method could be found in this [paper](https://doi.org/10.1063/5.0090122). It should be noted that the DFT+*U* works only within the NAO scheme, which means that the value of the keyword `basis_type` must be lcao when DFT+*U* is called. To turn on DFT+*U*, users need to set the value of the `dft_plus_u` keyword in the `INPUT` file to be 1. All relevant parmeters used in DFT+*U* calculations are listed in the [DFT+*U* correction](../input_files/input-main.md#dftu-correction) part of the [list of keywords](../input_files/input-main.md).
+Now the DFT+*U* method is accessible in ABACUS. The details of the DFT+*U* method could be found in this [paper](https://doi.org/10.1063/5.0090122). DFT+*U* is supported for both LCAO (`basis_type = lcao`) and plane-wave (`basis_type = pw`) basis sets. For the PW basis, `dft_plus_u = 1` (radius-adjustable localized projections) is supported with `nspin = 1`, `2`, or `4`. For the LCAO basis, both `dft_plus_u = 1` and `dft_plus_u = 2` are available. To turn on DFT+*U*, users need to set the value of the `dft_plus_u` keyword in the `INPUT` file to be 1. All relevant parameters used in DFT+*U* calculations are listed in the [DFT+*U* correction](../input_files/input-main.md#dftu-correction) part of the [list of keywords](../input_files/input-main.md).
 
 Examples of DFT+*U* calculations are provided in this [directory](https://github.com/deepmodeling/abacus-develop/tree/develop/examples/dft_plus_u).