-
Notifications
You must be signed in to change notification settings - Fork 93
Use 32-bit addressing in take1 when possible #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… to do it in 64 bits Also use the new GpuKernel_setarg option to avoid allocating a buffer for the arguments.
src/gpuarray_array.c
Outdated
size_t argp; | ||
GpuKernel k; | ||
unsigned int j; | ||
unsigned int _n[2], _o; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where this is used? I don't see it being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from previous code.
I finished my review |
I should have fixed all the problems. |
Running
Not sure how to get the full information. I'm running the benchmarks to check the performance. |
This is probably just that the GPU is used. those tests don't handle that On Tue, May 10, 2016 at 4:18 PM, Pascal Lamblin [email protected]
|
Yes, I did that, I actually ran |
It seem a bug, @abergeron will check. |
Strange error when running the benchmark:
|
Problem should be fixed. |
The code seems to run. I still have to relaunch with profiling and compare. |
Updated timings with that change below (total time spent in that Op, in seconds).
|
should we merge this? I think what @abergeron told about making GpuJoin reuse elemwise to get all the speed benefit would be useful here too. If so, this would make this PR useless. But if we merge now, we will have now some good speed up. So if people absolutly need float16, they won't have too much slow down. I vote to merge now. |
I did not review the code and I'm not sure I understand how the other solution would interact with this one. I'm OK with merging if you think it is a good idea. |
Btw, the overall performance was similar to the old back-end, due to Gemm being faster. |
I reviewed it complety and I think it is good to merge. This could use the GpuElemwise kernel to make a copy for each element, but mark them as being able to run in parallel. But it would be probably better to implement the dimensions collapsing here. There is function that can be reused and this would make only 1 kernel call. |
I need to leave, When I can I'll run gpuarray and Theano tests to make sure it work well and I'll merge. |
I can run the tests. |
I'm running them. On Thu, May 12, 2016 at 1:38 PM, Pascal Lamblin [email protected]
|
Me too. The gpuarray tests passed, I'm running the Theano ones now. |
The Theano one passed, I'm finishing the gpuarray ones. so merging. On Thu, May 12, 2016 at 4:05 PM, Pascal Lamblin [email protected]
|
This be faster and doesn't change the public interface.