Skip to content

Conversation

abergeron
Copy link
Member

This be faster and doesn't change the public interface.

… to do it in 64 bits

Also use the new GpuKernel_setarg option to avoid allocating a buffer
for the arguments.
size_t argp;
GpuKernel k;
unsigned int j;
unsigned int _n[2], _o;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this is used? I don't see it being used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from previous code.

@nouiz
Copy link
Member

nouiz commented May 10, 2016

I finished my review

@abergeron
Copy link
Member Author

I should have fixed all the problems.

@lamblin
Copy link
Member

lamblin commented May 10, 2016

Running make tests I have:

3/6 Test #3: test_array .......................***Failed    0.52 sec

Not sure how to get the full information.

I'm running the benchmarks to check the performance.

@nouiz
Copy link
Member

nouiz commented May 10, 2016

This is probably just that the GPU is used. those tests don't handle that
well. Use the env var DEVICE=cudaN with the right number.

On Tue, May 10, 2016 at 4:18 PM, Pascal Lamblin [email protected]
wrote:

Running make tests I have:

3/6 Test #3: test_array .......................***Failed 0.52 sec

Not sure how to get the full information.

I'm running the benchmarks to check the performance.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#166 (comment)

@lamblin
Copy link
Member

lamblin commented May 10, 2016

Yes, I did that, I actually ran DEVICE=cuda3 make test, and checked with nvidia-smi that it was running on the right gpu.

@nouiz
Copy link
Member

nouiz commented May 10, 2016

It seem a bug, @abergeron will check.

@lamblin
Copy link
Member

lamblin commented May 10, 2016

Strange error when running the benchmark:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/home/benchmark/code/rnn_exps/theano/ptb/train_lm.py in <module>()
     23     config = getattr(config_lm, args.proto)()
     24     logger.info("Model options:\n{}".format(pprint.pformat(config)))
---> 25     train(**config)

/home/benchmark/code/rnn_exps/theano/ptb/lm.pyc in train(dim_word, dim, encoder, max_epochs, finish_after, dispFreq, decay_c, lrate, n_words, maxlen, batch_size, valid_batch_size, max_grad_norm, nlayers, data_path, use_dropout)
    596 
    597             # compute cost, grads and copy grads to shared variables
--> 598             cost = f_grad_shared(x)
    599 
    600             # do the update on parameters

/home/benchmark/repos/Theano/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    906                     node=self.fn.nodes[self.fn.position_of_error],
    907                     thunk=thunk,
--> 908                     storage_map=getattr(self.fn, 'storage_map', None))
    909             else:
    910                 # old-style linkers raise their own exceptions

/home/benchmark/repos/Theano/theano/gof/link.pyc in raise_with_op(node, thunk, exc_info, storage_map)
    312         # extra long error message in that case.
    313         pass
--> 314     reraise(exc_type, exc_value, exc_trace)
    315 
    316 

/home/benchmark/repos/Theano/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    893         try:
    894             outputs =\
--> 895                 self.fn() if output_subset is None else\
    896                 self.fn(output_subset=output_subset)
    897         except Exception:

RuntimeError: Success
Apply node that caused the error: GpuAdvancedSubtensor1(Wemb, GpuReshape{1}.0)
Toposort index: 39
Inputs types: [GpuArrayType<None>(float32, (False, False)), GpuArrayType<None>(int64, (False,))]
Inputs shapes: [(10000, 200), (400,)]
Inputs strides: [(800, 4), (8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuReshape{3}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)]]

@abergeron
Copy link
Member Author

Problem should be fixed.

@lamblin
Copy link
Member

lamblin commented May 11, 2016

The code seems to run. I still have to relaunch with profiling and compare.

@lamblin
Copy link
Member

lamblin commented May 12, 2016

Updated timings with that change below (total time spent in that Op, in seconds).
TL;DR: It helps, but it's not all of it.

model op old new this PR
small GpuAdvancedSubtensor1 2.2 16 9.8
large GpuAdvancedSubtensor1 1.1 10.5 8.2
small GpuAdvancedIncSubtensor1 77 209 131
large GpuAdvancedIncSubtensor1 75 204 129
small GpuAdvancedIncSubtensor1_dev20 0.861 0.68 0.475
large GpuAdvancedIncSubtensor1_dev20 0.518 0.438 0.370

@nouiz
Copy link
Member

nouiz commented May 12, 2016

should we merge this? I think what @abergeron told about making GpuJoin reuse elemwise to get all the speed benefit would be useful here too. If so, this would make this PR useless.

But if we merge now, we will have now some good speed up. So if people absolutly need float16, they won't have too much slow down.

I vote to merge now.

@lamblin
Copy link
Member

lamblin commented May 12, 2016

I did not review the code and I'm not sure I understand how the other solution would interact with this one. I'm OK with merging if you think it is a good idea.

@lamblin
Copy link
Member

lamblin commented May 12, 2016

Btw, the overall performance was similar to the old back-end, due to Gemm being faster.

@nouiz
Copy link
Member

nouiz commented May 12, 2016

I reviewed it complety and I think it is good to merge.

This could use the GpuElemwise kernel to make a copy for each element, but mark them as being able to run in parallel.

But it would be probably better to implement the dimensions collapsing here. There is function that can be reused and this would make only 1 kernel call.

@nouiz
Copy link
Member

nouiz commented May 12, 2016

I need to leave, When I can I'll run gpuarray and Theano tests to make sure it work well and I'll merge.

@lamblin
Copy link
Member

lamblin commented May 12, 2016

I can run the tests.

@nouiz
Copy link
Member

nouiz commented May 12, 2016

I'm running them.

On Thu, May 12, 2016 at 1:38 PM, Pascal Lamblin [email protected]
wrote:

I can run the tests.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#166 (comment)

@lamblin
Copy link
Member

lamblin commented May 12, 2016

Me too. The gpuarray tests passed, I'm running the Theano ones now.

@nouiz
Copy link
Member

nouiz commented May 12, 2016

The Theano one passed, I'm finishing the gpuarray ones. so merging.

On Thu, May 12, 2016 at 4:05 PM, Pascal Lamblin [email protected]
wrote:

Me too. The gpuarray tests passed, I'm running the Theano ones now.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#166 (comment)

@nouiz nouiz merged commit b38727e into Theano:master May 12, 2016
@abergeron abergeron deleted the take1_call32 branch May 12, 2016 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants