Unstable LCAO calculation of 004_Li128C75H100O75 #7082
Replies: 17 comments
-
|
Testing at 20230907:
Detail results can be found in: https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.23-32a3fd As we can see, only parallel with mpi can keep the stable results, and is not related to the type of machine. |
Beta Was this translation helpful? Give feedback.
-
|
Testing at 9-11: The 10 elpa are stable, and scalapack are unstable. |
Beta Was this translation helpful? Give feedback.
-
|
9-13: Checked with Bohrium, the 10 jobs of elpa use the same machine type: ecs.u1-c1m4.8xlarge (ali), but we are not sure the cpu are exactly same. I also run 10 jobs on one machine, and both elpa and scalapack display one type of results. Collaborate with Bohrium, we run 10 jobs on 10 cpu "Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz", both elpa and scalapack has one type of results. Conclusion: |
Beta Was this translation helpful? Give feedback.
-
|
9-13 testing on GNU compiler: |
Beta Was this translation helpful? Give feedback.
-
|
@jinzx10 @LiuXiaohui123321 The testing on 004_Li128C75H100O75 is updated here. If you have other testing results of this example, please also update here. |
Beta Was this translation helpful? Give feedback.
-
|
The changes I made to the scalapack diagonalization interface (pdsygvx_ & pzhegvx_ in module_hsolver/diago_blas.cpp) include orfac and a few work-space-related parameters: diff --git a/source/module_hsolver/diago_blas.cpp b/source/module_hsolver/diago_blas.cpp
index 489be9f09..8c725857c 100644
--- a/source/module_hsolver/diago_blas.cpp
+++ b/source/module_hsolver/diago_blas.cpp
@@ -62,7 +62,7 @@ std::pair<int, std::vector<int>> DiagoBlas::pdsygvx_once(const int *const desc,
const int itype = 1, il = 1, iu = GlobalV::NBANDS, one = 1;
int M = 0, NZ = 0, lwork = -1, liwork = -1, info = 0;
double vl = 0, vu = 0;
- const double abstol = 0, orfac = -1;
+ const double abstol = 0, orfac = 0.01;
std::vector<double> work(3, 0);
std::vector<int> iwork(1, 0);
std::vector<int> ifail(GlobalV::NLOCAL, 0);
@@ -109,6 +109,10 @@ std::pair<int, std::vector<int>> DiagoBlas::pdsygvx_once(const int *const desc,
+ ModuleBase::GlobalFunc::TO_STRING(__LINE__));
// GlobalV::ofs_running<<"lwork="<<work[0]<<"\t"<<"liwork="<<iwork[0]<<std::endl;
+
+ work[0] *= 10;
+ iwork[0] *= 10;
+
lwork = work[0];
work.resize(std::max(lwork,3), 0);
liwork = iwork[0];
@@ -184,7 +188,7 @@ std::pair<int, std::vector<int>> DiagoBlas::pzhegvx_once(const int *const desc,
const char jobz = 'V', range = 'I', uplo = 'U';
const int itype = 1, il = 1, iu = GlobalV::NBANDS, one = 1;
int M = 0, NZ = 0, lwork = -1, lrwork = -1, liwork = -1, info = 0;
- const double abstol = 0, orfac = -1;
+ const double abstol = 0, orfac = 0.01;
//Note: pzhegvx_ has a bug
// We must give vl,vu a value, although we do not use range 'V'
// We must give rwork at least a memory of sizeof(double) * 3
@@ -238,6 +242,12 @@ std::pair<int, std::vector<int>> DiagoBlas::pzhegvx_once(const int *const desc,
+ ModuleBase::GlobalFunc::TO_STRING(__LINE__));
// GlobalV::ofs_running<<"lwork="<<work[0]<<"\t"<<"lrwork="<<rwork[0]<<"\t"<<"liwork="<<iwork[0]<<std::endl;
+
+ work[0] *= 10.0;
+ iwork[0] *= 10;
+ rwork[0] *= 10;
+
+
lwork = work[0].real();
work.resize(lwork, 0);
lrwork = rwork[0] + this->degeneracy_max * GlobalV::NLOCAL;
@@ -402,4 +412,4 @@ void DiagoBlas::post_processing(const int info, const std::vector<int> &vec)
}
}According to the source file (https://netlib.org/scalapack/explore-html/d7/dff/pzhegvx_8f_source.html), orthogonality of eigenvectors could be an issue if there are many eigenvectors with close eigenvalues. pzhegvx does provide ways to guarantee orthogonality, but it's very tricky and depends on a few parameters. orfac is the threshold used to determine what eigenvectors are considered close enough that needs reorthogonalization. The default is 1e-3, which I changed to 1e-2 in the above test with modified scalapack. The size of work space array should also increase. In the test I simply increase them by a factor of 10, which should not be optimal and could be improved. |
Beta Was this translation helpful? Give feedback.
-
|
Did Intel compiled ABACUS compiled by |
Beta Was this translation helpful? Give feedback.
-
|
As a side note, I notice that abacus always solves the eigenvalue equation in the basis of all orbitals. I know that many quantum chemistry softwares using gaussian basis entail an extra canonical orthogonalization to "project out" basis orbitals that are almost linearly dependent. Resulting eigenvalue equations are usually more stable. Some explanation can be found in the qchem's manual (https://manual.q-chem.com/latest/sec_Basis_Customization.html) or Szabo & Ostlund's book (ch 3.4.5). I'm not sure if numerical atomic orbitals should use a similar strategy (and it inevitably complicates the code for MPI parallelization where matrices are stored in a block-cyclic format). |
Beta Was this translation helpful? Give feedback.
-
I think it depends on some enviroment variables like CXX or I_MPI_CXX. Environment set up by current Dockerfile.intel would use icpx. |
Beta Was this translation helpful? Give feedback.
-
In my recollection, ABACUS LCAO solve eigenvalue equation by directly doing |
Beta Was this translation helpful? Give feedback.
-
|
This is related to input parameter and device. |
Beta Was this translation helpful? Give feedback.
-
|
Hi all, |
Beta Was this translation helpful? Give feedback.
-
|
@pxlxingliang, is this case still unstable now? Can we close this issue? |
Beta Was this translation helpful? Give feedback.
-
|
I used the latest intel/gnu images with ks_solver genelpa and scalapack_gvx to run this example 10 times. |
Beta Was this translation helpful? Give feedback.
-
|
I have tested the scalapack method on @jinzx10's commit with intel and gnu compiled abacus. Results of 10 runs with intel are stable, while gnu results are unstable. gnu: |
Beta Was this translation helpful? Give feedback.
-
|
@WHUweiqingzhou Is this problem totally solved ? |
Beta Was this translation helpful? Give feedback.
-
|
I will transfer this to discussion. Hopefully we will come up with a better solution in near future. |
Beta Was this translation helpful? Give feedback.













Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the bug
The LCAO calculation of daily test 004_Li128C75H100O75 is unstable.
004_Li128C75H100O75.zip
At current version (20230921 develop branch), the intel compiled calculation is stable, but GNU compiled is unstable.
Details of some testing will be updated at below.
Beta Was this translation helpful? Give feedback.
All reactions