Do Cholesky Decomposition With the GPU by SteveBronder · Pull Request #529 · stan-dev/math

SteveBronder · 2017-04-22T20:38:45Z

Submission Checklist

Run unit tests: ./runTests.py test/unit with g++
Run unit tests: ./runTests.py test/unit with clang++
Run cpplint: make cpplint
Declare copyright holder and open-source license: see below

Summary:

Passes Cholesky Decompositions to the GPU using ViennaCL

Intended Effect:

Gives users the ability to pass computation for Cholesky Decompositions to the GPU

How to Verify:

TODO: Make and run tests

Side Effects:

TODO

Documentation:

TODO

Reviewer Suggestions:

Copyright and Licensing

Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

Yes

SteveBronder · 2017-04-22T22:00:17Z

Q: Should issues I have with this go under a separate issue or be listed here?

Attempting to run the tests with the current branch gives the following errors

$ ./runTests.py test/unit
------------------------------------------------------------
make -j1 test/unit/math_include_test test/unit/multiple_translation_units_test
clang++ -shared -fPIC -o test/unit/libmultiple.so test/unit/multiple_translation_units1.o test/unit/multiple_translation_units2.o
/usr/bin/ld: test/unit/multiple_translation_units1.o: relocation R_X86_64_32S against `_ZTVN4stan4math4variE' can not be used when making a shared object; recompile with -fPIC
test/unit/multiple_translation_units1.o: error adding symbols: Bad value
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem.
ar rv test/libgtest.a test/gtest.o
r - test/gtest.o
make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem.
clang++ -I . -isystem lib/eigen_3.2.9 -isystem lib/boost_1.62.0 -isystem lib/viennacl_1.7.1 -isystemlib/cvodes_2.9.0/include -Wall -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -Wno-unused-function -Wno-uninitialized -Wno-c++11-long-long   -c -O3 -DGTEST_USE_OWN_TR1_TUPLE  -DGTEST_HAS_PTHREAD=0 -Wno-c++11-long-long -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 test/unit/math_include_test.cpp -o test/unit/math_include_test.o
clang: warning: -lOpenCL: 'linker' input unused
clang++ -I . -isystem lib/eigen_3.2.9 -isystem lib/boost_1.62.0 -isystem lib/viennacl_1.7.1 -isystemlib/cvodes_2.9.0/include -Wall -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -Wno-unused-function -Wno-uninitialized -Wno-c++11-long-long    -O3 lib/gtest_1.7.0/src/gtest_main.cc test/unit/math_include_test.o -DGTEST_USE_OWN_TR1_TUPLE  -DGTEST_HAS_PTHREAD=0 -Wno-c++11-long-long -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -o test/unit/math_include_test test/libgtest.a  lib/cvodes_2.9.0/lib/libsundials_nvecserial.a lib/cvodes_2.9.0/lib/libsundials_cvodes.a
make: *** No rule to make target 'test/unit/libmultiple.so', needed by 'test/unit/multiple_translation_units_test.cpp'.  Stop.
rm test/unit/math_include_test.o
make -j1 test/unit/math_include_test test/unit/multiple_translation_units_test failed
exit now (04/22/17 17:56:19 EDT)

running clang++ -v returns

$ clang++ -v
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6.0.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/5.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/6.0.0
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda

I don't have experience with clang, I mostly use gcc or nvcc. What causes this part of the error for clang?

clang: error: linker command failed with exit code 1 (use -v to see invocation)

bob-carpenter · 2017-04-22T22:19:39Z

Rolling the build fix into the same pull request is fine. Afraid I don't know how to fix it, though.

…

On Apr 22, 2017, at 6:00 PM, Steve Bronder ***@***.***> wrote: Q: Should issues I have with this go under a separate issue or be listed here? Attempting to run the tests with the current branch gives the following errors $ ./runTests.py test/unit ------------------------------------------------------------ make -j1 test/unit/math_include_test test/unit/multiple_translation_units_test clang++ -shared -fPIC -o test/unit/libmultiple.so test/unit/multiple_translation_units1.o test/unit/multiple_translation_units2.o /usr/bin/ld: test/unit/multiple_translation_units1.o: relocation R_X86_64_32S against `_ZTVN4stan4math4variE' can not be used when making a shared object; recompile with -fPIC test/unit/multiple_translation_units1.o: error adding symbols: Bad value clang: error: linker command failed with exit code 1 (use -v to see invocation) make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem. ar rv test/libgtest.a test/gtest.o r - test/gtest.o make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem. clang++ -I . -isystem lib/eigen_3.2.9 -isystem lib/boost_1.62.0 -isystem lib/viennacl_1.7.1 -isystemlib/cvodes_2.9.0/include -Wall -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -Wno-unused-function -Wno-uninitialized -Wno-c++11-long-long -c -O3 -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -Wno-c++11-long-long -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 test/unit/math_include_test.cpp -o test/unit/math_include_test.o clang: warning: -lOpenCL: 'linker' input unused clang++ -I . -isystem lib/eigen_3.2.9 -isystem lib/boost_1.62.0 -isystem lib/viennacl_1.7.1 -isystemlib/cvodes_2.9.0/include -Wall -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -Wno-unused-function -Wno-uninitialized -Wno-c++11-long-long -O3 lib/gtest_1.7.0/src/gtest_main.cc test/unit/math_include_test.o -DGTEST_USE_OWN_TR1_TUPLE -DGTEST_HAS_PTHREAD=0 -Wno-c++11-long-long -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -o test/unit/math_include_test test/libgtest.a lib/cvodes_2.9.0/lib/libsundials_nvecserial.a lib/cvodes_2.9.0/lib/libsundials_cvodes.a make: *** No rule to make target 'test/unit/libmultiple.so', needed by 'test/unit/multiple_translation_units_test.cpp'. Stop. rm test/unit/math_include_test.o make -j1 test/unit/math_include_test test/unit/multiple_translation_units_test failed exit now (04/22/17 17:56:19 EDT) running clang++ -v returns $ clang++ -v clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0 Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6.0.0 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/5.4.0 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/6.0.0 Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0 Candidate multilib: .***@***.*** Selected multilib: .***@***.*** Found CUDA installation: /usr/local/cuda I don't have experience with clang, I mostly use gcc or nvcc. What causes this part of the error for clang? clang: error: linker command failed with exit code 1 (use -v to see invocation) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

SteveBronder · 2017-04-22T23:19:42Z

The part about 'test/libgtest.a' may be related to #528

bbbales2 · 2017-04-23T18:00:50Z

For the -fPIC thing, add "-fPIC" to the end of the CFLAGS line in math/makefile.

I think it's unresolved (http://discourse.mc-stan.org/t/stan-math-ubuntu-clang-needs-fpic/339/5). You'll need to do a make clean and then build again.

I'm not sure what the test/libgtest.a warnings are. I see them on Ubuntu 16.04 but since nothing blows up I do not want to investigate haha.

syclik · 2017-04-24T04:05:38Z

Does the test pass on the develop branch? No need to test everything. Just run:
./runTests.py test/unit/multiple_tranalation_units_test.cpp

We may have an incompatibility with that test and gcc 5 that we didn't know about until recently.

SteveBronder · 2017-04-24T23:18:48Z

Compilation Notes:

For the -fPIC thing, add "-fPIC" to the end of the CFLAGS line in math/makefile.

Yes this does the trick!

Does the test pass on the develop branch? No need to test everything. Just run:
./runTests.py test/unit/multiple_tranalation_units_test.cpp

Running ./runTests.py test/unit/multiple_translation_units_test.cpp yields

Running main() from gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from multiple_translation_units
[ RUN      ] multiple_translation_units.compile
[       OK ] multiple_translation_units.compile (0 ms)
[----------] 1 test from multiple_translation_units (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.

running ./runTests.py test/unit is compiling, though it is taking a while. The following warnings will output

...
clang: warning: -lOpenCL: 'linker' input unused
...
make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem.
...

The -lOpenCL warning comes from tests which do not use ViennaCL. Can we place in CFLAGS specifically for the Cholesky test?

The second warning, as noted above comes from something in g++ for linux and does not seem to be a huge issue.

Implementation Notes:

On line 351 of cholesky_decompose.hpp there is an if statement that checks if L_A has more than 500 rows and if so moves over to the GPU with ViennaCL. What is a better way for the user to specify they want to use the GPU? A flag when the user compiles a program in CmdStan?

I think what we want is, if the user specifies they want to do computation on the GPU

Add -DVIENNACL_WITH_OPENCL -lOpenCL to CFLAGS.
Include something so the if statement in cholesky_decompose sends things over to ViennaCL

SteveBronder · 2017-04-26T01:38:22Z

Update: Running ./runTests.py test/unit, the cholesky tests yield

------------------------------------------------------------
test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml"
Running main() from gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from AgradRevMatrix
[ RUN      ] AgradRevMatrix.mat_cholesky
[       OK ] AgradRevMatrix.mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.exception_mat_cholesky
[       OK ] AgradRevMatrix.exception_mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_small
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (176 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_small
[       OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 980.43059810641955, which exceeds prec, where
grad_fd(i) evaluates to -1.8259633576235501,
grad_ad(i) evaluates to -982.2565614640431, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 0.014410414688841633, which exceeds prec, where
grad_fd(i) evaluates to 0.55275936344211618,
grad_ad(i) evaluates to 0.53834894875327455, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 486.57150578139544, which exceeds prec, where
grad_fd(i) evaluates to -494.24178632483142,
grad_ad(i) evaluates to -980.81329210622687, and
prec evaluates to 1e-08.
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (3558 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_large
[       OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms)
[----------] 6 tests from AgradRevMatrix (3734 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (3734 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients

 1 FAILED TEST
test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml" failed
exit now (04/24/17 23:34:55 EDT)

Which is good at this point because at least it compiles. I just started my new job so am going to be a little slow in writing this (hopefully this weekend). At this point I just need to go through the unit tests with a debugger and see what is causing the failure.

@rtrangucci would you have time to take a look at this? I think it is like 90% of the way there, but I'm doing something simple that is giving it the wrong output.

SteveBronder · 2017-04-29T03:03:53Z

Some bad news.

Here is a link to a dropbox folder that contains that cholesky test that I originally ran. However, in this version we do not assume that the matrix is diagonal dominant. Below are the results of the output.
(Spoiler: it fails)

./eigen_vienna_LU_compare.o

----------------------------------------------
----------------------------------------------
## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------


 -- Times for 16x16 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.00027
----------------------------------------------
Time for LLT (Eigen): 5.5e-05
----------------------------------------------
Time for LU (ViennaCL): 0.000161
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.004182
----------------------------------------------
Time for LLT (Eigen): 0.000451
----------------------------------------------
Time for LU (ViennaCL): 0.005883
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.173094
----------------------------------------------
Time for LLT (Eigen): 0.00629
----------------------------------------------
Time for LU (ViennaCL): 0.037217
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 1.85584
----------------------------------------------
Time for LLT (Eigen): 0.033258
----------------------------------------------
Time for LU (ViennaCL): 0.182988
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 9.94796
----------------------------------------------
Time for LLT (Eigen): 0.120078
----------------------------------------------
Time for LU (ViennaCL): 0.584404
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 3600x3600 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 110.272
----------------------------------------------
Time for LLT (Eigen): 0.55694
----------------------------------------------
Time for LU (ViennaCL): 3.04208
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

Also, oddly, it's much slower even at the 3600x3600 case.

Why would the diagonal dominant assumption have such a large effect? How fair is it to assume that the matrix that comes into the Cholesky will be diagonal dominant?

bob-carpenter · 2017-05-01T03:38:53Z

Wow, that really is dramatic. What does diagonal dominant mean? Is that precision the precision of the matching? We can get by with less than 2e-16 precision. Sorry if this sounds stupid, but are you sure that's just a difference of input matrices and that nothing else changed? Also, what's with LU vs. LLT? ...

…

-- Times for 3600x3600 Matrix -- Precision: 2.22045e-16 Time for Partial Pivot LU (Eigen): 110.272 ---------------------------------------------- Time for LLT (Eigen): 0.55694 ---------------------------------------------- Time for LU (ViennaCL): 3.04208 ---------------------------------------------- Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match! ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Do Not Match! ---------------------------------------------- Also, oddly, it's much slower even at the 3600x3600 case. Why would the diagonal dominant assumption have such a large effect? How fair is it to assume that the matrix that comes into the Cholesky will be diagonal dominant?

SteveBronder · 2017-05-01T23:29:22Z

What does diagonal dominant mean?

Diagonal dominant means that for each column the diagonal is the largest value

Is that precision the precision of the matching?

Yes, I need to test how far off we are given less precision. I also want to look at the Slovenian teams code OpenCL code. If their cholesky works with better precision on non diagonal dominant matrices than it should not be too hard to integrate.

Sorry if this sounds stupid, but are you sure that's just a difference of input matrices and that nothing else changed?

I think so, but I will double check.

Also, what's with LU vs. LLT?

To do cholesky in ViennaCL we have to take the L of the LU and divide each column by the diagonal element to get the L of the LLT decomposition. We do this for both the LU of Eigen and ViennaCL to make sure it works on both. My Linear Algebra is rusty, but maybe if it's not diagonal dominant we don't need to do the division? I'll play around with this, though any suggestions would be appreciated.

bob-carpenter · 2017-05-01T23:47:50Z

Thanks. I didnt' see a comment from @betanalpha but we talked about this today and he thought the problem would arise when LU did a lot of pivoting. He thought LLT would be a lot faster.

@rtrangucci mentioned that our use case for these will often have non-diagonal dominant matrices. But if it's really just that diagonal is the largest, then that'll certainly be satisfied by correlation matrices, which have 1 on the diagonal and values strictly between -1 and 1 elsewhere.

bbbales2 · 2017-05-02T02:14:09Z

I ran across this the other day: https://en.wikipedia.org/wiki/Cholesky_decomposition#Stability_of_the_computation (which is I assume what @betanalpha was saying). Apparently Cholesky doesn't require the pivots to be numerically stable, which is why it's nice if you've got a positive definite matrix you want to factor.

SteveBronder · 2017-05-02T04:03:44Z

I think @betanalpha and @bbbales2 are in the money. We are using the LU in the GPU code to get to the LLT, so pivots matter. Testing the Slovenian teams code with the non diagonal dominant matrices also fails (even at 2.22045E-10 precision)

So I'll do a dd to see if other external libraries contain a stable Cholesky for the GPU. Perhaps MAGMA

EDIT: Maybe we can use the ViennaCL preconditioners? Not totally sure if that's what we want, but maybe it will help

New job is keeping me busy, but I can poke around this week and weekend

rok-cesnovar · 2017-05-06T09:30:38Z

I somehow missed this topic on GitHub, so I am a bit late to the party, sorry.

I looked into this and I am not really sure on the input matrix you are using.
It seems that it is not positive definite. I outputted a few of the matrices that the program creates to a file and fed them back to R and tried chol(). All of them report an error "the leading minor of order X is not positive definite" (X changes).

If the input matrix is not positive definite, then the Cholesky should output NaN, thus the precision error.

SteveBronder · 2017-05-06T18:03:21Z

@rok-cesnovar you are correct! Apologies, I did the new tests after work one day and my brain must have been shut off. Fixing that allows for both the GPU versions to be equal to the Eigen version with precision of 2.22045e-12. Though they do not match for deeper precision.

FTR: here is how I am computing the test matrix now

  MatrixXd A = MatrixXd::Random(m,m);
  // Make matrices symmetric positive definite
  for (int i = 0; i < m; i++){
    for (int j = i + 1; j < m; j++){
        A(i,j) = A(j,i);
    }
  }
  A = A * A.transpose();

Here is a dropbox containing the new tests. Below are the results on my machine

## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------


 -- Times for 16x16 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 9.9e-05
----------------------------------------------
Time for LU (ViennaCL): 0.000306
----------------------------------------------
Time for BayesCL 0.001507
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.003874
----------------------------------------------
Time for LU (ViennaCL): 0.017542
----------------------------------------------
Time for BayesCL 0.0042
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.150595
----------------------------------------------
Time for LU (ViennaCL): 0.064673
----------------------------------------------
Time for BayesCL 0.020384
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 1.41396
----------------------------------------------
Time for LU (ViennaCL): 0.283503
----------------------------------------------
Time for BayesCL 0.069564
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 8.09797
----------------------------------------------
Time for LU (ViennaCL): 0.661283
----------------------------------------------
Time for BayesCL 0.171118
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 2500x2500 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 28.8914
----------------------------------------------
Time for LU (ViennaCL): 2.17089
----------------------------------------------
Time for BayesCL 0.482391
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

And here is a table showing the relative speedup as compared to Eigen for the larger matrices

Size	Eigen	ViennaCL	BayesCL
400	1	2.5	7.5
900	1	5	23
1600	1	12	47
2500	1	13	60

So BayesCL gives the best results! Though with the recent update of Eigen 3.3 I should probably rewrite the tests to use the inplace LLT.

So now the question is, supposing these tests are correct, why are the results not matching up in the unit tests?

With the obvious speedups of BayesCL I do think it would be worth using them. Since it's rather easy to use custom kernels in ViennaCL I think we should first get the plain ViennaCL version working and then swap the ViennaCL version with the BayesCL version.

I also want to point out that the ViennaCL team would most likely be very interested in having this LLT decomp in their library. Perhaps we can start talking with them about implementing it there? @rok-cesnovar would you want to reach out to them?

syclik · 2017-05-06T18:55:10Z

Awesome! We need to figure out the issue with the tests. If the results aren't valid, then the timing results don't actually hold. Do you have a handle on the math? Looks like the Cholesky decomposition isn't unique if the input matrix is positive semi-definite. If you're testing with positive semi-definite matrices, then make sure we're recovering the original matrix. I'd also test with positive definite matrices, where there is a unique Cholesky decomposition.

…

On Sat, May 6, 2017 at 2:03 PM, Steve Bronder ***@***.***> wrote: @rok-cesnovar <https://github.com/rok-cesnovar> you are correct! Apologies, I did the new tests after work one day and my brain must have been shut off. Fixing that allows for both the GPU versions to be equal to the Eigen version with precision of 2.22045e-12. Though they do not match for deeper precision. FTR: here is how I am computing the test matrix now MatrixXd A = MatrixXd::Random(m,m); // Make matrices symmetric positive definite for (int i = 0; i < m; i++){ for (int j = i + 1; j < m; j++){ A(i,j) = A(j,i); } } A = A * A.transpose(); Here <https://www.dropbox.com/s/m88qnjj5orgnq5k/eigen_vs_vienna_dense_lu.tar.gz?dl=0> is a dropbox containing the new tests. Below are the results on my machine ## Benchmark ::Eigen Vs. ViennaCL Performance ---------------------------------------------- -- Times for 16x16 Matrix -- Precision: 2.22045e-12 Using the device GeForce GTX 780 on the platform NVIDIA CUDA ---------------------------------------------- Time for LLT (Eigen): 9.9e-05 ---------------------------------------------- Time for LU (ViennaCL): 0.000306 ---------------------------------------------- Time for BayesCL 0.001507 ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Match! ---------------------------------------------- Eigen LLT and BayesCL Lower Diagonal Match! ---------------------------------------------- -- Times for 100x100 Matrix -- Precision: 2.22045e-12 Using the device GeForce GTX 780 on the platform NVIDIA CUDA ---------------------------------------------- Time for LLT (Eigen): 0.003874 ---------------------------------------------- Time for LU (ViennaCL): 0.017542 ---------------------------------------------- Time for BayesCL 0.0042 ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Match! ---------------------------------------------- Eigen LLT and BayesCL Lower Diagonal Match! ---------------------------------------------- -- Times for 400x400 Matrix -- Precision: 2.22045e-12 Using the device GeForce GTX 780 on the platform NVIDIA CUDA ---------------------------------------------- Time for LLT (Eigen): 0.150595 ---------------------------------------------- Time for LU (ViennaCL): 0.064673 ---------------------------------------------- Time for BayesCL 0.020384 ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Match! ---------------------------------------------- Eigen LLT and BayesCL Lower Diagonal Match! ---------------------------------------------- -- Times for 900x900 Matrix -- Precision: 2.22045e-12 Using the device GeForce GTX 780 on the platform NVIDIA CUDA ---------------------------------------------- Time for LLT (Eigen): 1.41396 ---------------------------------------------- Time for LU (ViennaCL): 0.283503 ---------------------------------------------- Time for BayesCL 0.069564 ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Match! ---------------------------------------------- Eigen LLT and BayesCL Lower Diagonal Match! ---------------------------------------------- -- Times for 1600x1600 Matrix -- Precision: 2.22045e-12 Using the device GeForce GTX 780 on the platform NVIDIA CUDA ---------------------------------------------- Time for LLT (Eigen): 8.09797 ---------------------------------------------- Time for LU (ViennaCL): 0.661283 ---------------------------------------------- Time for BayesCL 0.171118 ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Match! ---------------------------------------------- Eigen LLT and BayesCL Lower Diagonal Match! ---------------------------------------------- -- Times for 2500x2500 Matrix -- Precision: 2.22045e-12 Using the device GeForce GTX 780 on the platform NVIDIA CUDA ---------------------------------------------- Time for LLT (Eigen): 28.8914 ---------------------------------------------- Time for LU (ViennaCL): 2.17089 ---------------------------------------------- Time for BayesCL 0.482391 ---------------------------------------------- Eigen LLT and ViennaCL Lower Diagonal Match! ---------------------------------------------- Eigen LLT and BayesCL Lower Diagonal Match! ---------------------------------------------- And here is a table showing the relative speedup as compared to Eigen for the larger matrices Size Eigen ViennaCL BayesCL 400 1 2.5 7.5 900 1 5 23 1600 1 12 47 2500 1 13 60 So BayesCL gives the best results! Though with the recent update of Eigen 3.3 I should probably rewrite the tests to use the inplace LLT. So now the question is, supposing these tests are correct, why are the results not matching up in the unit tests? With the obvious speedups of BayesCL I do think it would be worth using them. Since it's rather easy to use custom kernels in ViennaCL I think we should first get the plain ViennaCL version working and then swap the ViennaCL version with the BayesCL version. I also want to point out that the ViennaCL team would most likely be very interested in having this LLT decomp in their library. Perhaps we can start talking with them about implementing it there? @rok-cesnovar <https://github.com/rok-cesnovar> would you want to reach out to them? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#529 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZ_F96WF2pu6WhUnZ5l_rRG3hR4t3Dnks5r3LXpgaJpZM4NFNUg> .

rtrangucci · 2017-05-06T21:17:34Z

Steve, check out my comment on gpu_cholesky. The gradient isn't being calculated correctly, but I wrote something I think is correct in the comment. Can I run the tests on my own? Rob

…

On May 6, 2017, at 11:55 AM, Daniel Lee ***@***.***> wrote: Awesome! We need to figure out the issue with the tests. If the results aren't valid, then the timing results don't actually hold. Do you have a handle on the math? Looks like the Cholesky decomposition isn't unique if the input matrix is positive semi-definite. If you're testing with positive semi-definite matrices, then make sure we're recovering the original matrix. I'd also test with positive definite matrices, where there is a unique Cholesky decomposition. On Sat, May 6, 2017 at 2:03 PM, Steve Bronder ***@***.***> wrote: > @rok-cesnovar <https://github.com/rok-cesnovar> you are correct! > Apologies, I did the new tests after work one day and my brain must have > been shut off. Fixing that allows for both the GPU versions to be equal to > the Eigen version with precision of 2.22045e-12. Though they do not match > for deeper precision. > > FTR: here is how I am computing the test matrix now > > MatrixXd A = MatrixXd::Random(m,m); > // Make matrices symmetric positive definite > for (int i = 0; i < m; i++){ > for (int j = i + 1; j < m; j++){ > A(i,j) = A(j,i); > } > } > A = A * A.transpose(); > > Here > <https://www.dropbox.com/s/m88qnjj5orgnq5k/eigen_vs_vienna_dense_lu.tar.gz?dl=0> > is a dropbox containing the new tests. Below are the results on my machine > > ## Benchmark ::Eigen Vs. ViennaCL Performance > ---------------------------------------------- > > > -- Times for 16x16 Matrix -- > Precision: 2.22045e-12 > Using the device GeForce GTX 780 on the platform NVIDIA CUDA > ---------------------------------------------- > Time for LLT (Eigen): 9.9e-05 > ---------------------------------------------- > Time for LU (ViennaCL): 0.000306 > ---------------------------------------------- > Time for BayesCL 0.001507 > ---------------------------------------------- > Eigen LLT and ViennaCL Lower Diagonal Match! > ---------------------------------------------- > Eigen LLT and BayesCL Lower Diagonal Match! > ---------------------------------------------- > > -- Times for 100x100 Matrix -- > Precision: 2.22045e-12 > Using the device GeForce GTX 780 on the platform NVIDIA CUDA > ---------------------------------------------- > Time for LLT (Eigen): 0.003874 > ---------------------------------------------- > Time for LU (ViennaCL): 0.017542 > ---------------------------------------------- > Time for BayesCL 0.0042 > ---------------------------------------------- > Eigen LLT and ViennaCL Lower Diagonal Match! > ---------------------------------------------- > Eigen LLT and BayesCL Lower Diagonal Match! > ---------------------------------------------- > > -- Times for 400x400 Matrix -- > Precision: 2.22045e-12 > Using the device GeForce GTX 780 on the platform NVIDIA CUDA > ---------------------------------------------- > Time for LLT (Eigen): 0.150595 > ---------------------------------------------- > Time for LU (ViennaCL): 0.064673 > ---------------------------------------------- > Time for BayesCL 0.020384 > ---------------------------------------------- > Eigen LLT and ViennaCL Lower Diagonal Match! > ---------------------------------------------- > Eigen LLT and BayesCL Lower Diagonal Match! > ---------------------------------------------- > > -- Times for 900x900 Matrix -- > Precision: 2.22045e-12 > Using the device GeForce GTX 780 on the platform NVIDIA CUDA > ---------------------------------------------- > Time for LLT (Eigen): 1.41396 > ---------------------------------------------- > Time for LU (ViennaCL): 0.283503 > ---------------------------------------------- > Time for BayesCL 0.069564 > ---------------------------------------------- > Eigen LLT and ViennaCL Lower Diagonal Match! > ---------------------------------------------- > Eigen LLT and BayesCL Lower Diagonal Match! > ---------------------------------------------- > > -- Times for 1600x1600 Matrix -- > Precision: 2.22045e-12 > Using the device GeForce GTX 780 on the platform NVIDIA CUDA > ---------------------------------------------- > Time for LLT (Eigen): 8.09797 > ---------------------------------------------- > Time for LU (ViennaCL): 0.661283 > ---------------------------------------------- > Time for BayesCL 0.171118 > ---------------------------------------------- > Eigen LLT and ViennaCL Lower Diagonal Match! > ---------------------------------------------- > Eigen LLT and BayesCL Lower Diagonal Match! > ---------------------------------------------- > > -- Times for 2500x2500 Matrix -- > Precision: 2.22045e-12 > Using the device GeForce GTX 780 on the platform NVIDIA CUDA > ---------------------------------------------- > Time for LLT (Eigen): 28.8914 > ---------------------------------------------- > Time for LU (ViennaCL): 2.17089 > ---------------------------------------------- > Time for BayesCL 0.482391 > ---------------------------------------------- > Eigen LLT and ViennaCL Lower Diagonal Match! > ---------------------------------------------- > Eigen LLT and BayesCL Lower Diagonal Match! > ---------------------------------------------- > > And here is a table showing the relative speedup as compared to Eigen for > the larger matrices > Size Eigen ViennaCL BayesCL > 400 1 2.5 7.5 > 900 1 5 23 > 1600 1 12 47 > 2500 1 13 60 > > So BayesCL gives the best results! Though with the recent update of Eigen > 3.3 I should probably rewrite the tests to use the inplace LLT. > > So now the question is, supposing these tests are correct, why are the > results not matching up in the unit tests? > > With the obvious speedups of BayesCL I do think it would be worth using > them. Since it's rather easy to use custom kernels in ViennaCL I think we > should first get the plain ViennaCL version working and then swap the > ViennaCL version with the BayesCL version. > > I also want to point out that the ViennaCL team would most likely be very > interested in having this LLT decomp in their library. Perhaps we can start > talking with them about implementing it there? @rok-cesnovar > <https://github.com/rok-cesnovar> would you want to reach out to them? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#529 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAZ_F96WF2pu6WhUnZ5l_rRG3hR4t3Dnks5r3LXpgaJpZM4NFNUg> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bgoodri · 2017-05-06T21:49:04Z

On Sat, May 6, 2017 at 2:03 PM, Steve Bronder ***@***.***> wrote: FTR: here is how I am computing the test matrix now MatrixXd A = MatrixXd::Random(m,m); // Make matrices symmetric positive definite for (int i = 0; i < m; i++){ for (int j = i + 1; j < m; j++){ A(i,j) = A(j,i); } } A = A * A.transpose(); That may yield a weird distribution for A. But as long as we are testing

this for use in Stan, you might as well do MatrixXd L = stan::math::lkj_corr_cholesky_rng(1.0); MatrixXd A = stan::math::multiply_lower_tri_self_transpose(L); and perhaps also do some tests with a diagonal matrix of very different standard deviations that pre-multiplies lkj_corr_cholesky_rng(1.0).

SteveBronder · 2017-05-06T21:58:51Z

I apologize for the long post. I had time today to go through and really think about everything.

TL;DR: All the tests for stan are passing. I rewrote the benchmark so that it uses the create_mat() function to make the matrix A. Both the GPU results match to 2.22045e-14 precision.

Do you have a handle on the math?

... ehhh, sort of. I may need a little hand holding from time to time. It's been a while since I have done any serious linear algebra. For instance, I was doing A = A * A' to make the test matrix because I googled, "How do I make a positive definite matrix" and that was the first result (which looking further I see that only makes a positive semi-definite matrix)

Benchmark Results

I remade the benchmark so that we use the create_mat() function that is used in stan's cholesky tests to create the matrix A. Since that is used to create an object cov_mat I assume this is a positive definite matrix. Below are the run times and a dropbox containing the code.

----------------------------------------------
## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------


 -- Times for 16x16 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.000119
----------------------------------------------
Time for LU (ViennaCL): 0.000208
----------------------------------------------
Time for BayesCL 0.000987
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.005677
----------------------------------------------
Time for LU (ViennaCL): 0.011431
----------------------------------------------
Time for BayesCL 0.004338
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.230822
----------------------------------------------
Time for LU (ViennaCL): 0.050966
----------------------------------------------
Time for BayesCL 0.029414
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 2.33466
----------------------------------------------
Time for LU (ViennaCL): 0.239549
----------------------------------------------
Time for BayesCL 0.098063
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 10.3096
----------------------------------------------
Time for LU (ViennaCL): 0.625864
----------------------------------------------
Time for BayesCL 0.334106
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 2500x2500 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 48.7875
----------------------------------------------
Time for LU (ViennaCL): 1.84669
----------------------------------------------
Time for BayesCL 0.817809
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

Below are the measure of speedup relative to Eigen. The custom BayesCL kernel is about twice as fast as ViennaCL.

Size	Eigen	ViennaCL	BayesCL
400	1	4.6	11.5
900	1	10.1	25.8
1600	1	16.6	31.2
2500	1	26.5	60.1

stan Tests

@syclik with the most recent commit the tests do pass!

$ ./runTests.py  'test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp' --debug
Running main() from gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from AgradRevMatrix
[ RUN      ] AgradRevMatrix.mat_cholesky
[       OK ] AgradRevMatrix.mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.exception_mat_cholesky
[       OK ] AgradRevMatrix.exception_mat_cholesky (1 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_small
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (165 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_small
[       OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (2621 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_large
[       OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms)
[----------] 6 tests from AgradRevMatrix (2787 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (2787 ms total)
[  PASSED  ] 6 tests.

With the current development version we are really only testing AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients since the check to use a GPU is simply if the matrix has greater than 500 rows. I'm using that for now until we figure out a nice way to let the user specify if they want to use a gpu

One thing to note: As you can see here in cholesky_decompose.hpp we only move to the GPU for computation of L_A. It would be preferable and much faster if we replaced cholesky_block() with cholesky_gpu() for the solver portion as well. I should have something for this by tomorrow. If anyone else wants to help, at this point it's about converting the blocked cholesky's chain function so that it uses ViennaCL (and is not blocked of course). This mostly involves finding the translation from Eigen to ViennaCL.

Some other things

Should I come to the stan meeting this Thursday? I would have to come in from google hangouts since I will be at work.
What is the best way to make this visible to users?
At what point should we move forward with this? If the majority bottleneck was at the computation of L_A then we have that and should look to making this user visible. Otherwise we can wait and also make/fix the cholesky_gpu()

SteveBronder · 2017-05-06T22:19:12Z

@bgoodri would it be better to use create_mat() or

MatrixXd L = stan::math::lkj_corr_cholesky_rng(1.0);
MatrixXd A = stan::math::multiply_lower_tri_self_transpose(L);

and perhaps also do some tests with a diagonal matrix of very different standard deviations that pre-multiplies lkj_corr_cholesky_rng(1.0).

Tmrw I will rewrite the benchmark so that it runs a few different matrices and collects the times and whether they all matched

SteveBronder · 2017-05-06T22:21:34Z

@rtrangucci just saw your post now, will go and look at your comment now!

SteveBronder · 2017-05-06T22:31:30Z

@rtrangucci for some reason I can see part of your comment from the email github sent me, but it won't take me to the full comment. Could you post the comment here?

but how would I go about testing this on my own? Is it pretty easy to run vienna cl on a laptop gpu?

It's easy to run, but it can take a little time setting up. If you have an NVIDIA GPU, I used these instructions to get mine all set up

SteveBronder · 2017-05-06T22:41:22Z

Another question is, once we have the GPU version how do we check it with Travis?

rtrangucci · 2017-05-06T23:06:25Z

@Stevo15025 Here's the comment below:

I think the gradient code should be something like this, starting on line 306:

vcl_L = trans(vcl_L);
vcl_Lbar = prod(vcl_L, vcl_Lbar); 
for (int i = 0; i < L.rows() - 1; ++i)
  for(int j = i + 1; j < L.rows(); ++j)
    vcl_L(j, i) = vcl(i, j);
viennacl::linalg::inplace_solve(vcl_L, vcl_Lbar, viennacl::linalg::upper_tag());
viennacl::linalg::inplace_solve(vcl_L, viennacl::linalg::trans(vcl_Lbar), viennacl::linalg::upper_tag());

SteveBronder · 2017-05-06T23:56:45Z

@rtrangucci I updated the code with your code, though putting cholesky_gpu() back into things makes the tests fail again.

Running main() from gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from AgradRevMatrix
[ RUN      ] AgradRevMatrix.mat_cholesky
[       OK ] AgradRevMatrix.mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.exception_mat_cholesky
[       OK ] AgradRevMatrix.exception_mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_small
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (163 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_small
[       OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 14.279985728486359, which exceeds prec, where
grad_fd(i) evaluates to -1.8259633576235501,
grad_ad(i) evaluates to -16.105949086109909, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 1.563984294565459, which exceeds prec, where
grad_fd(i) evaluates to 0.55275936344211618,
grad_ad(i) evaluates to 2.1167436580075751, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 488.01477513919122, which exceeds prec, where
grad_fd(i) evaluates to -494.24178632483142,
grad_ad(i) evaluates to -982.25656146402264, and
prec evaluates to 1e-08.
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (6691 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_large
[       OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms)
[----------] 6 tests from AgradRevMatrix (6854 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (6854 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients

 1 FAILED TEST
test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml" failed
exit now (05/06/17 19:52:48 EDT)

You seem to be very close though. I'll look more at this tmrw. If you have any problems installing nvcc from the link above feel free to contact me tmrw as I will be around all day.

rtrangucci · 2017-05-07T00:08:26Z

We need to divide diagonal of the result, vcl_Lbar, by 2. That should work. My bad! Rob

…

On May 6, 2017, at 4:56 PM, Steve Bronder ***@***.***> wrote: @rtrangucci I updated the code with your code, though putting cholesky_gpu() back into things makes the tests fail again. Running main() from gtest_main.cc [==========] Running 6 tests from 1 test case. [----------] Global test environment set-up. [----------] 6 tests from AgradRevMatrix [ RUN ] AgradRevMatrix.mat_cholesky [ OK ] AgradRevMatrix.mat_cholesky (0 ms) [ RUN ] AgradRevMatrix.exception_mat_cholesky [ OK ] AgradRevMatrix.exception_mat_cholesky (0 ms) [ RUN ] AgradRevMatrix.mat_cholesky_1st_deriv_small [ OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (163 ms) [ RUN ] AgradRevMatrix.check_varis_on_stack_small [ OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms) [ RUN ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure The difference between grad_fd(i) and grad_ad(i) is 14.279985728486359, which exceeds prec, where grad_fd(i) evaluates to -1.8259633576235501, grad_ad(i) evaluates to -16.105949086109909, and prec evaluates to 1e-08. test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure The difference between grad_fd(i) and grad_ad(i) is 1.563984294565459, which exceeds prec, where grad_fd(i) evaluates to 0.55275936344211618, grad_ad(i) evaluates to 2.1167436580075751, and prec evaluates to 1e-08. test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure The difference between grad_fd(i) and grad_ad(i) is 488.01477513919122, which exceeds prec, where grad_fd(i) evaluates to -494.24178632483142, grad_ad(i) evaluates to -982.25656146402264, and prec evaluates to 1e-08. [ FAILED ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (6691 ms) [ RUN ] AgradRevMatrix.check_varis_on_stack_large [ OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms) [----------] 6 tests from AgradRevMatrix (6854 ms total) [----------] Global test environment tear-down [==========] 6 tests from 1 test case ran. (6854 ms total) [ PASSED ] 5 tests. [ FAILED ] 1 test, listed below: [ FAILED ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients 1 FAILED TEST test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml" failed exit now (05/06/17 19:52:48 EDT) You seem to be very close though. I'll look more at this tmrw. If you have any problems installing nvcc from the link above feel free to contact me tmrw as I will be around all day. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

SteveBronder · 2017-05-07T00:19:12Z

We need to divide diagonal of the result, vcl_Lbar, by 2. That should work. My bad!

No worries!

Added this to the latest commit here. I'm doing the division on the diagonal after we move back to Eigen.

Still off on the tests though?

The difference between grad_fd(i) and grad_ad(i) is 967.97657573553715, which exceeds prec, where
grad_fd(i) evaluates to -1.8259633576235501,
grad_ad(i) evaluates to 966.1506123779136, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 1.563984294565459, which exceeds prec, where
grad_fd(i) evaluates to 0.55275936344211618,
grad_ad(i) evaluates to 2.1167436580075751, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 3.1135055928201041, which exceeds prec, where
grad_fd(i) evaluates to -494.24178632483142,
grad_ad(i) evaluates to -491.12828073201132, and
prec evaluates to 1e-08.

I'm about to run out of the house, but tomorrow I will look over the paper

(for my own reference I'm posting the main bit)

SteveBronder · 2017-05-07T00:20:20Z

Oh, reading the paper I see the division I am doing is in the wrong place

… for cov_matrix_constrain()

… correctly

…directly, tests are passing

…om runTests.py

…iple translation units

…ckend for regular cholesky_gpu

…of foo

SteveBronder · 2017-09-18T02:51:23Z

So all of the GPU tests are passing on my local. I've added the disable_if for the GPU version of cholesky_decomp.

Oddly corr_matrix_transform_test fails and looking at the comparison it does not seem like I change anything between this branch and devel

On a side note, should we think about having a seperate PR for adding ViennaCL to devel? Then this pull will be more like 500 lines and not 100K.

SteveBronder · 2017-09-18T03:09:07Z

Here's a link to the error. Seems to be an operator overload error

SteveBronder · 2017-09-18T03:18:16Z

Nvm corr_matrix_transform_test is passing now. It had something to do the gtest.o file.

Going to run all the tests overnight and if everything passes we should be good to go!

bob-carpenter · 2017-09-18T19:04:24Z

Yes, having a pull request just for ViennaCL would be very helpful.

SteveBronder · 2017-10-04T23:53:26Z

Closing this for #637

SteveBronder self-assigned this Apr 24, 2017

SteveBronder added the feature label Apr 26, 2017

SteveBronder added 15 commits September 17, 2017 16:03

Broke up tests for cholesky_gpu into median and large tests

1f9231c

Changed test comparison matrix size so it fits with the documentation…

60fa0eb

… for cov_matrix_constrain()

Removed test_results.txt file

24fdca3

Added gpu functions to prim of cholesky_decompose

c58a787

Fixed cpplint. Set the default compiler to g++ because of gtest error

f1123c2

Cannot figure out where to put -lOpenCL so that the tests can compile…

78771d6

… correctly

opened up cholesky_decompose_gpu in prim and rev so we can call them …

6f6ba66

…directly, tests are passing

added -lOpenCL to LDFLAGS in default_compiler_options. Removed pdb fr…

fc7a522

…om runTests.py

Use cholesky_decompose_gpu in cholesky_decompose

8665366

Now compiles on g++ using custom compile function

736be54

Did a gross thing to get tests to compile in makefile

a45615b

Moved read in of kernels to ViennaCL file in prim/fun

a8f6b24

Moved loading of kernels to ViennaCL, made them static to handle mult…

65ec154

…iple translation units

Started fixing templated code for cholesky_gpu

8f81100

added Seans template function for the triangular view. Removed GPU ba…

2bf82a7

…ckend for regular cholesky_gpu

SteveBronder force-pushed the gpu_cholesky branch from c104ce7 to 2bf82a7 Compare September 17, 2017 20:06

SteveBronder added 5 commits September 17, 2017 16:31

Updated for some of seans review

3108bd2

changed name of string holding kernels to load_kernel_source instead …

e3409f9

…of foo

ccplint passes

2bd1c6b

Fix up compiler options

271904d

Moved back to clang as default compiler

dbf7067

changed stan::math:sqrt() to simply sqrt()

c55d56c

SteveBronder mentioned this pull request Sep 18, 2017

Add Cholesky GPU functions #625

Closed

3 tasks

SteveBronder closed this Oct 4, 2017

SteveBronder deleted the gpu_cholesky branch July 31, 2019 10:07

Uh oh!

Comments

Conversation

SteveBronder commented Apr 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Summary:

Intended Effect:

How to Verify:

Side Effects:

Documentation:

Reviewer Suggestions:

Copyright and Licensing

Uh oh!

SteveBronder commented Apr 22, 2017

Uh oh!

bob-carpenter commented Apr 22, 2017 via email

Uh oh!

SteveBronder commented Apr 22, 2017

Uh oh!

bbbales2 commented Apr 23, 2017

Uh oh!

syclik commented Apr 24, 2017

Uh oh!

SteveBronder commented Apr 24, 2017

Uh oh!

SteveBronder commented Apr 26, 2017

Uh oh!

SteveBronder commented Apr 29, 2017

Uh oh!

bob-carpenter commented May 1, 2017 via email

Uh oh!

SteveBronder commented May 1, 2017

Uh oh!

bob-carpenter commented May 1, 2017

Uh oh!

bbbales2 commented May 2, 2017

Uh oh!

SteveBronder commented May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rok-cesnovar commented May 6, 2017

Uh oh!

SteveBronder commented May 6, 2017

Uh oh!

syclik commented May 6, 2017 via email

Uh oh!

rtrangucci commented May 6, 2017 via email

Uh oh!

bgoodri commented May 6, 2017 via email

Uh oh!

SteveBronder commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

stan Tests

Uh oh!

SteveBronder commented May 6, 2017

Uh oh!

SteveBronder commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteveBronder commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteveBronder commented May 6, 2017

Uh oh!

rtrangucci commented May 6, 2017

Uh oh!

SteveBronder commented May 6, 2017

Uh oh!

rtrangucci commented May 7, 2017 via email

Uh oh!

SteveBronder commented May 7, 2017

Uh oh!

SteveBronder commented May 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteveBronder commented Sep 18, 2017

Uh oh!

SteveBronder commented Sep 18, 2017

SteveBronder commented Apr 22, 2017 •

edited

Loading

SteveBronder commented May 2, 2017 •

edited

Loading

SteveBronder commented May 6, 2017 •

edited

Loading

SteveBronder commented May 6, 2017 •

edited

Loading

SteveBronder commented May 6, 2017 •

edited

Loading

SteveBronder commented May 7, 2017 •

edited

Loading