Skip to content

Comments

Do Cholesky Decomposition With the GPU#529

Closed
SteveBronder wants to merge 54 commits intodevelopfrom
gpu_cholesky
Closed

Do Cholesky Decomposition With the GPU#529
SteveBronder wants to merge 54 commits intodevelopfrom
gpu_cholesky

Conversation

@SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Apr 22, 2017

Submission Checklist

  • Run unit tests: ./runTests.py test/unit with g++
  • Run unit tests: ./runTests.py test/unit with clang++
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary:

Passes Cholesky Decompositions to the GPU using ViennaCL

Intended Effect:

Gives users the ability to pass computation for Cholesky Decompositions to the GPU

How to Verify:

TODO: Make and run tests

Side Effects:

TODO

Documentation:

TODO

Reviewer Suggestions:

Copyright and Licensing

Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Yes

@SteveBronder
Copy link
Collaborator Author

Q: Should issues I have with this go under a separate issue or be listed here?

Attempting to run the tests with the current branch gives the following errors

$ ./runTests.py test/unit
------------------------------------------------------------
make -j1 test/unit/math_include_test test/unit/multiple_translation_units_test
clang++ -shared -fPIC -o test/unit/libmultiple.so test/unit/multiple_translation_units1.o test/unit/multiple_translation_units2.o
/usr/bin/ld: test/unit/multiple_translation_units1.o: relocation R_X86_64_32S against `_ZTVN4stan4math4variE' can not be used when making a shared object; recompile with -fPIC
test/unit/multiple_translation_units1.o: error adding symbols: Bad value
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem.
ar rv test/libgtest.a test/gtest.o
r - test/gtest.o
make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem.
clang++ -I . -isystem lib/eigen_3.2.9 -isystem lib/boost_1.62.0 -isystem lib/viennacl_1.7.1 -isystemlib/cvodes_2.9.0/include -Wall -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -Wno-unused-function -Wno-uninitialized -Wno-c++11-long-long   -c -O3 -DGTEST_USE_OWN_TR1_TUPLE  -DGTEST_HAS_PTHREAD=0 -Wno-c++11-long-long -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 test/unit/math_include_test.cpp -o test/unit/math_include_test.o
clang: warning: -lOpenCL: 'linker' input unused
clang++ -I . -isystem lib/eigen_3.2.9 -isystem lib/boost_1.62.0 -isystem lib/viennacl_1.7.1 -isystemlib/cvodes_2.9.0/include -Wall -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -DNO_FPRINTF_OUTPUT -pipe -lOpenCL -Wno-unused-function -Wno-uninitialized -Wno-c++11-long-long    -O3 lib/gtest_1.7.0/src/gtest_main.cc test/unit/math_include_test.o -DGTEST_USE_OWN_TR1_TUPLE  -DGTEST_HAS_PTHREAD=0 -Wno-c++11-long-long -isystem lib/gtest_1.7.0/include -isystem lib/gtest_1.7.0 -o test/unit/math_include_test test/libgtest.a  lib/cvodes_2.9.0/lib/libsundials_nvecserial.a lib/cvodes_2.9.0/lib/libsundials_cvodes.a
make: *** No rule to make target 'test/unit/libmultiple.so', needed by 'test/unit/multiple_translation_units_test.cpp'.  Stop.
rm test/unit/math_include_test.o
make -j1 test/unit/math_include_test test/unit/multiple_translation_units_test failed
exit now (04/22/17 17:56:19 EDT)

running clang++ -v returns

$ clang++ -v
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/6.0.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/5.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/6.0.0
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda

I don't have experience with clang, I mostly use gcc or nvcc. What causes this part of the error for clang?

clang: error: linker command failed with exit code 1 (use -v to see invocation)

@bob-carpenter
Copy link
Member

bob-carpenter commented Apr 22, 2017 via email

@SteveBronder
Copy link
Collaborator Author

The part about 'test/libgtest.a' may be related to #528

@bbbales2
Copy link
Member

For the -fPIC thing, add "-fPIC" to the end of the CFLAGS line in math/makefile.

I think it's unresolved (http://discourse.mc-stan.org/t/stan-math-ubuntu-clang-needs-fpic/339/5). You'll need to do a make clean and then build again.

I'm not sure what the test/libgtest.a warnings are. I see them on Ubuntu 16.04 but since nothing blows up I do not want to investigate haha.

@syclik
Copy link
Member

syclik commented Apr 24, 2017

Does the test pass on the develop branch? No need to test everything. Just run:
./runTests.py test/unit/multiple_tranalation_units_test.cpp

We may have an incompatibility with that test and gcc 5 that we didn't know about until recently.

@SteveBronder
Copy link
Collaborator Author

Compilation Notes:

For the -fPIC thing, add "-fPIC" to the end of the CFLAGS line in math/makefile.

Yes this does the trick!

Does the test pass on the develop branch? No need to test everything. Just run:
./runTests.py test/unit/multiple_tranalation_units_test.cpp

Running ./runTests.py test/unit/multiple_translation_units_test.cpp yields

Running main() from gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from multiple_translation_units
[ RUN      ] multiple_translation_units.compile
[       OK ] multiple_translation_units.compile (0 ms)
[----------] 1 test from multiple_translation_units (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.

running ./runTests.py test/unit is compiling, though it is taking a while. The following warnings will output

...
clang: warning: -lOpenCL: 'linker' input unused
...
make: Warning: Archive 'test/libgtest.a' seems to have been created in deterministic mode. 'test/gtest.o' will always be updated. Please consider passing the U flag to ar to avoid the problem.
...

The -lOpenCL warning comes from tests which do not use ViennaCL. Can we place in CFLAGS specifically for the Cholesky test?

The second warning, as noted above comes from something in g++ for linux and does not seem to be a huge issue.


Implementation Notes:

On line 351 of cholesky_decompose.hpp there is an if statement that checks if L_A has more than 500 rows and if so moves over to the GPU with ViennaCL. What is a better way for the user to specify they want to use the GPU? A flag when the user compiles a program in CmdStan?

I think what we want is, if the user specifies they want to do computation on the GPU

  1. Add -DVIENNACL_WITH_OPENCL -lOpenCL to CFLAGS.
  2. Include something so the if statement in cholesky_decompose sends things over to ViennaCL

@SteveBronder SteveBronder self-assigned this Apr 24, 2017
@SteveBronder
Copy link
Collaborator Author

Update: Running ./runTests.py test/unit, the cholesky tests yield

------------------------------------------------------------
test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml"
Running main() from gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from AgradRevMatrix
[ RUN      ] AgradRevMatrix.mat_cholesky
[       OK ] AgradRevMatrix.mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.exception_mat_cholesky
[       OK ] AgradRevMatrix.exception_mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_small
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (176 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_small
[       OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 980.43059810641955, which exceeds prec, where
grad_fd(i) evaluates to -1.8259633576235501,
grad_ad(i) evaluates to -982.2565614640431, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 0.014410414688841633, which exceeds prec, where
grad_fd(i) evaluates to 0.55275936344211618,
grad_ad(i) evaluates to 0.53834894875327455, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 486.57150578139544, which exceeds prec, where
grad_fd(i) evaluates to -494.24178632483142,
grad_ad(i) evaluates to -980.81329210622687, and
prec evaluates to 1e-08.
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (3558 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_large
[       OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms)
[----------] 6 tests from AgradRevMatrix (3734 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (3734 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients

 1 FAILED TEST
test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml" failed
exit now (04/24/17 23:34:55 EDT)

Which is good at this point because at least it compiles. I just started my new job so am going to be a little slow in writing this (hopefully this weekend). At this point I just need to go through the unit tests with a debugger and see what is causing the failure.

@rtrangucci would you have time to take a look at this? I think it is like 90% of the way there, but I'm doing something simple that is giving it the wrong output.

@SteveBronder
Copy link
Collaborator Author

Some bad news.

Here is a link to a dropbox folder that contains that cholesky test that I originally ran. However, in this version we do not assume that the matrix is diagonal dominant. Below are the results of the output.
(Spoiler: it fails)

./eigen_vienna_LU_compare.o

----------------------------------------------
----------------------------------------------
## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------


 -- Times for 16x16 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.00027
----------------------------------------------
Time for LLT (Eigen): 5.5e-05
----------------------------------------------
Time for LU (ViennaCL): 0.000161
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.004182
----------------------------------------------
Time for LLT (Eigen): 0.000451
----------------------------------------------
Time for LU (ViennaCL): 0.005883
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 0.173094
----------------------------------------------
Time for LLT (Eigen): 0.00629
----------------------------------------------
Time for LU (ViennaCL): 0.037217
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 1.85584
----------------------------------------------
Time for LLT (Eigen): 0.033258
----------------------------------------------
Time for LU (ViennaCL): 0.182988
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 9.94796
----------------------------------------------
Time for LLT (Eigen): 0.120078
----------------------------------------------
Time for LU (ViennaCL): 0.584404
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

 -- Times for 3600x3600 Matrix --
Precision: 2.22045e-16
Time for Partial Pivot LU (Eigen): 110.272
----------------------------------------------
Time for LLT (Eigen): 0.55694
----------------------------------------------
Time for LU (ViennaCL): 3.04208
----------------------------------------------
Eigen LU Partial Pivot and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Do Not Match!
----------------------------------------------

Also, oddly, it's much slower even at the 3600x3600 case.

Why would the diagonal dominant assumption have such a large effect? How fair is it to assume that the matrix that comes into the Cholesky will be diagonal dominant?

@bob-carpenter
Copy link
Member

bob-carpenter commented May 1, 2017 via email

@SteveBronder
Copy link
Collaborator Author

What does diagonal dominant mean?

Diagonal dominant means that for each column the diagonal is the largest value

Is that precision the precision of the matching?

Yes, I need to test how far off we are given less precision. I also want to look at the Slovenian teams code OpenCL code. If their cholesky works with better precision on non diagonal dominant matrices than it should not be too hard to integrate.

Sorry if this sounds stupid, but are you sure that's just a difference of input matrices and that nothing else changed?

I think so, but I will double check.

Also, what's with LU vs. LLT?

To do cholesky in ViennaCL we have to take the L of the LU and divide each column by the diagonal element to get the L of the LLT decomposition. We do this for both the LU of Eigen and ViennaCL to make sure it works on both. My Linear Algebra is rusty, but maybe if it's not diagonal dominant we don't need to do the division? I'll play around with this, though any suggestions would be appreciated.

@bob-carpenter
Copy link
Member

Thanks. I didnt' see a comment from @betanalpha but we talked about this today and he thought the problem would arise when LU did a lot of pivoting. He thought LLT would be a lot faster.

@rtrangucci mentioned that our use case for these will often have non-diagonal dominant matrices. But if it's really just that diagonal is the largest, then that'll certainly be satisfied by correlation matrices, which have 1 on the diagonal and values strictly between -1 and 1 elsewhere.

@bbbales2
Copy link
Member

bbbales2 commented May 2, 2017

I ran across this the other day: https://en.wikipedia.org/wiki/Cholesky_decomposition#Stability_of_the_computation (which is I assume what @betanalpha was saying). Apparently Cholesky doesn't require the pivots to be numerically stable, which is why it's nice if you've got a positive definite matrix you want to factor.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented May 2, 2017

I think @betanalpha and @bbbales2 are in the money. We are using the LU in the GPU code to get to the LLT, so pivots matter. Testing the Slovenian teams code with the non diagonal dominant matrices also fails (even at 2.22045E-10 precision)

So I'll do a dd to see if other external libraries contain a stable Cholesky for the GPU. Perhaps MAGMA

EDIT: Maybe we can use the ViennaCL preconditioners? Not totally sure if that's what we want, but maybe it will help

New job is keeping me busy, but I can poke around this week and weekend

@rok-cesnovar
Copy link
Member

I somehow missed this topic on GitHub, so I am a bit late to the party, sorry.

I looked into this and I am not really sure on the input matrix you are using.
It seems that it is not positive definite. I outputted a few of the matrices that the program creates to a file and fed them back to R and tried chol(). All of them report an error "the leading minor of order X is not positive definite" (X changes).

If the input matrix is not positive definite, then the Cholesky should output NaN, thus the precision error.

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar you are correct! Apologies, I did the new tests after work one day and my brain must have been shut off. Fixing that allows for both the GPU versions to be equal to the Eigen version with precision of 2.22045e-12. Though they do not match for deeper precision.

FTR: here is how I am computing the test matrix now

  MatrixXd A = MatrixXd::Random(m,m);
  // Make matrices symmetric positive definite
  for (int i = 0; i < m; i++){
    for (int j = i + 1; j < m; j++){
        A(i,j) = A(j,i);
    }
  }
  A = A * A.transpose();

Here is a dropbox containing the new tests. Below are the results on my machine

## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------


 -- Times for 16x16 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 9.9e-05
----------------------------------------------
Time for LU (ViennaCL): 0.000306
----------------------------------------------
Time for BayesCL 0.001507
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.003874
----------------------------------------------
Time for LU (ViennaCL): 0.017542
----------------------------------------------
Time for BayesCL 0.0042
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.150595
----------------------------------------------
Time for LU (ViennaCL): 0.064673
----------------------------------------------
Time for BayesCL 0.020384
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 1.41396
----------------------------------------------
Time for LU (ViennaCL): 0.283503
----------------------------------------------
Time for BayesCL 0.069564
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 8.09797
----------------------------------------------
Time for LU (ViennaCL): 0.661283
----------------------------------------------
Time for BayesCL 0.171118
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 2500x2500 Matrix --
Precision: 2.22045e-12
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 28.8914
----------------------------------------------
Time for LU (ViennaCL): 2.17089
----------------------------------------------
Time for BayesCL 0.482391
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

And here is a table showing the relative speedup as compared to Eigen for the larger matrices

Size Eigen ViennaCL BayesCL
400 1 2.5 7.5
900 1 5 23
1600 1 12 47
2500 1 13 60

So BayesCL gives the best results! Though with the recent update of Eigen 3.3 I should probably rewrite the tests to use the inplace LLT.

So now the question is, supposing these tests are correct, why are the results not matching up in the unit tests?

With the obvious speedups of BayesCL I do think it would be worth using them. Since it's rather easy to use custom kernels in ViennaCL I think we should first get the plain ViennaCL version working and then swap the ViennaCL version with the BayesCL version.

I also want to point out that the ViennaCL team would most likely be very interested in having this LLT decomp in their library. Perhaps we can start talking with them about implementing it there? @rok-cesnovar would you want to reach out to them?

@syclik
Copy link
Member

syclik commented May 6, 2017 via email

@rtrangucci
Copy link
Contributor

rtrangucci commented May 6, 2017 via email

@bgoodri
Copy link
Contributor

bgoodri commented May 6, 2017 via email

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented May 6, 2017

I apologize for the long post. I had time today to go through and really think about everything.

TL;DR: All the tests for stan are passing. I rewrote the benchmark so that it uses the create_mat() function to make the matrix A. Both the GPU results match to 2.22045e-14 precision.

Do you have a handle on the math?

... ehhh, sort of. I may need a little hand holding from time to time. It's been a while since I have done any serious linear algebra. For instance, I was doing A = A * A' to make the test matrix because I googled, "How do I make a positive definite matrix" and that was the first result (which looking further I see that only makes a positive semi-definite matrix)

Benchmark Results

I remade the benchmark so that we use the create_mat() function that is used in stan's cholesky tests to create the matrix A. Since that is used to create an object cov_mat I assume this is a positive definite matrix. Below are the run times and a dropbox containing the code.

----------------------------------------------
## Benchmark ::Eigen Vs. ViennaCL Performance 
----------------------------------------------


 -- Times for 16x16 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.000119
----------------------------------------------
Time for LU (ViennaCL): 0.000208
----------------------------------------------
Time for BayesCL 0.000987
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 100x100 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.005677
----------------------------------------------
Time for LU (ViennaCL): 0.011431
----------------------------------------------
Time for BayesCL 0.004338
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 400x400 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 0.230822
----------------------------------------------
Time for LU (ViennaCL): 0.050966
----------------------------------------------
Time for BayesCL 0.029414
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 900x900 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 2.33466
----------------------------------------------
Time for LU (ViennaCL): 0.239549
----------------------------------------------
Time for BayesCL 0.098063
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 1600x1600 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 10.3096
----------------------------------------------
Time for LU (ViennaCL): 0.625864
----------------------------------------------
Time for BayesCL 0.334106
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

 -- Times for 2500x2500 Matrix --
Precision: 2.22045e-14
Using the device GeForce GTX 780 on the platform NVIDIA CUDA
----------------------------------------------
Time for LLT (Eigen): 48.7875
----------------------------------------------
Time for LU (ViennaCL): 1.84669
----------------------------------------------
Time for BayesCL 0.817809
----------------------------------------------
Eigen LLT and ViennaCL Lower Diagonal Match!
----------------------------------------------
Eigen LLT and BayesCL Lower Diagonal Match!
----------------------------------------------

Below are the measure of speedup relative to Eigen. The custom BayesCL kernel is about twice as fast as ViennaCL.

Size Eigen ViennaCL BayesCL
400 1 4.6 11.5
900 1 10.1 25.8
1600 1 16.6 31.2
2500 1 26.5 60.1

stan Tests

@syclik with the most recent commit the tests do pass!

$ ./runTests.py  'test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp' --debug
Running main() from gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from AgradRevMatrix
[ RUN      ] AgradRevMatrix.mat_cholesky
[       OK ] AgradRevMatrix.mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.exception_mat_cholesky
[       OK ] AgradRevMatrix.exception_mat_cholesky (1 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_small
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (165 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_small
[       OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (2621 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_large
[       OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms)
[----------] 6 tests from AgradRevMatrix (2787 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (2787 ms total)
[  PASSED  ] 6 tests.

With the current development version we are really only testing AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients since the check to use a GPU is simply if the matrix has greater than 500 rows. I'm using that for now until we figure out a nice way to let the user specify if they want to use a gpu

One thing to note: As you can see here in cholesky_decompose.hpp we only move to the GPU for computation of L_A. It would be preferable and much faster if we replaced cholesky_block() with cholesky_gpu() for the solver portion as well. I should have something for this by tomorrow. If anyone else wants to help, at this point it's about converting the blocked cholesky's chain function so that it uses ViennaCL (and is not blocked of course). This mostly involves finding the translation from Eigen to ViennaCL.

Some other things

  1. Should I come to the stan meeting this Thursday? I would have to come in from google hangouts since I will be at work.
  2. What is the best way to make this visible to users?
  3. At what point should we move forward with this? If the majority bottleneck was at the computation of L_A then we have that and should look to making this user visible. Otherwise we can wait and also make/fix the cholesky_gpu()

@SteveBronder
Copy link
Collaborator Author

@bgoodri would it be better to use create_mat() or

MatrixXd L = stan::math::lkj_corr_cholesky_rng(1.0);
MatrixXd A = stan::math::multiply_lower_tri_self_transpose(L);

and perhaps also do some tests with a diagonal matrix of very different standard deviations that pre-multiplies lkj_corr_cholesky_rng(1.0).

Tmrw I will rewrite the benchmark so that it runs a few different matrices and collects the times and whether they all matched

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented May 6, 2017

@rtrangucci just saw your post now, will go and look at your comment now!

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented May 6, 2017

@rtrangucci for some reason I can see part of your comment from the email github sent me, but it won't take me to the full comment. Could you post the comment here?

but how would I go about testing this on my own? Is it pretty easy to run vienna cl on a laptop gpu?

It's easy to run, but it can take a little time setting up. If you have an NVIDIA GPU, I used these instructions to get mine all set up

@SteveBronder
Copy link
Collaborator Author

Another question is, once we have the GPU version how do we check it with Travis?

@rtrangucci
Copy link
Contributor

@Stevo15025 Here's the comment below:

I think the gradient code should be something like this, starting on line 306:

vcl_L = trans(vcl_L);
vcl_Lbar = prod(vcl_L, vcl_Lbar); 
for (int i = 0; i < L.rows() - 1; ++i)
  for(int j = i + 1; j < L.rows(); ++j)
    vcl_L(j, i) = vcl(i, j);
viennacl::linalg::inplace_solve(vcl_L, vcl_Lbar, viennacl::linalg::upper_tag());
viennacl::linalg::inplace_solve(vcl_L, viennacl::linalg::trans(vcl_Lbar), viennacl::linalg::upper_tag()); 

@SteveBronder
Copy link
Collaborator Author

@rtrangucci I updated the code with your code, though putting cholesky_gpu() back into things makes the tests fail again.

Running main() from gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from AgradRevMatrix
[ RUN      ] AgradRevMatrix.mat_cholesky
[       OK ] AgradRevMatrix.mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.exception_mat_cholesky
[       OK ] AgradRevMatrix.exception_mat_cholesky (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_small
[       OK ] AgradRevMatrix.mat_cholesky_1st_deriv_small (163 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_small
[       OK ] AgradRevMatrix.check_varis_on_stack_small (0 ms)
[ RUN      ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 14.279985728486359, which exceeds prec, where
grad_fd(i) evaluates to -1.8259633576235501,
grad_ad(i) evaluates to -16.105949086109909, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 1.563984294565459, which exceeds prec, where
grad_fd(i) evaluates to 0.55275936344211618,
grad_ad(i) evaluates to 2.1167436580075751, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 488.01477513919122, which exceeds prec, where
grad_fd(i) evaluates to -494.24178632483142,
grad_ad(i) evaluates to -982.25656146402264, and
prec evaluates to 1e-08.
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients (6691 ms)
[ RUN      ] AgradRevMatrix.check_varis_on_stack_large
[       OK ] AgradRevMatrix.check_varis_on_stack_large (0 ms)
[----------] 6 tests from AgradRevMatrix (6854 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test case ran. (6854 ms total)
[  PASSED  ] 5 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] AgradRevMatrix.mat_cholesky_1st_deriv_large_gradients

 1 FAILED TEST
test/unit/math/rev/mat/fun/cholesky_decompose_test --gtest_output="xml:test/unit/math/rev/mat/fun/cholesky_decompose_test.xml" failed
exit now (05/06/17 19:52:48 EDT)

You seem to be very close though. I'll look more at this tmrw. If you have any problems installing nvcc from the link above feel free to contact me tmrw as I will be around all day.

@rtrangucci
Copy link
Contributor

rtrangucci commented May 7, 2017 via email

@SteveBronder
Copy link
Collaborator Author

We need to divide diagonal of the result, vcl_Lbar, by 2. That should work. My bad!

No worries!

Added this to the latest commit here. I'm doing the division on the diagonal after we move back to Eigen.

Still off on the tests though?

The difference between grad_fd(i) and grad_ad(i) is 967.97657573553715, which exceeds prec, where
grad_fd(i) evaluates to -1.8259633576235501,
grad_ad(i) evaluates to 966.1506123779136, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 1.563984294565459, which exceeds prec, where
grad_fd(i) evaluates to 0.55275936344211618,
grad_ad(i) evaluates to 2.1167436580075751, and
prec evaluates to 1e-08.
test/unit/math/rev/mat/fun/cholesky_decompose_test.cpp:260: Failure
The difference between grad_fd(i) and grad_ad(i) is 3.1135055928201041, which exceeds prec, where
grad_fd(i) evaluates to -494.24178632483142,
grad_ad(i) evaluates to -491.12828073201132, and
prec evaluates to 1e-08.

I'm about to run out of the house, but tomorrow I will look over the paper

(for my own reference I'm posting the main bit)
image

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented May 7, 2017

Oh, reading the paper I see the division I am doing is in the wrong place

@SteveBronder
Copy link
Collaborator Author

So all of the GPU tests are passing on my local. I've added the disable_if for the GPU version of cholesky_decomp.

Oddly corr_matrix_transform_test fails and looking at the comparison it does not seem like I change anything between this branch and devel

On a side note, should we think about having a seperate PR for adding ViennaCL to devel? Then this pull will be more like 500 lines and not 100K.

@SteveBronder
Copy link
Collaborator Author

Here's a link to the error. Seems to be an operator overload error

@SteveBronder
Copy link
Collaborator Author

Nvm corr_matrix_transform_test is passing now. It had something to do the gtest.o file.

Going to run all the tests overnight and if everything passes we should be good to go!

@SteveBronder SteveBronder mentioned this pull request Sep 18, 2017
3 tasks
@bob-carpenter
Copy link
Member

bob-carpenter commented Sep 18, 2017 via email

@SteveBronder
Copy link
Collaborator Author

Closing this for #637

@SteveBronder SteveBronder deleted the gpu_cholesky branch July 31, 2019 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants