Use 32-bit addressing in take1 when possible #166

abergeron · 2016-05-09T23:49:01Z

This be faster and doesn't change the public interface.

… to do it in 64 bits Also use the new GpuKernel_setarg option to avoid allocating a buffer for the arguments.

nouiz · 2016-05-10T00:14:54Z

src/gpuarray_array.c

  size_t argp;
  GpuKernel k;
  unsigned int j;
+  unsigned int _n[2], _o;


Where this is used? I don't see it being used.

Leftover from previous code.

nouiz · 2016-05-10T00:32:58Z

I finished my review

abergeron · 2016-05-10T18:02:08Z

I should have fixed all the problems.

lamblin · 2016-05-10T20:18:24Z

Running make tests I have:

3/6 Test #3: test_array .......................***Failed    0.52 sec

Not sure how to get the full information.

I'm running the benchmarks to check the performance.

nouiz · 2016-05-10T20:38:52Z

This is probably just that the GPU is used. those tests don't handle that
well. Use the env var DEVICE=cudaN with the right number.

On Tue, May 10, 2016 at 4:18 PM, Pascal Lamblin [email protected]
wrote:

Running make tests I have:

3/6 Test #3: test_array .......................***Failed 0.52 sec

Not sure how to get the full information.

I'm running the benchmarks to check the performance.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#166 (comment)

lamblin · 2016-05-10T20:42:30Z

Yes, I did that, I actually ran DEVICE=cuda3 make test, and checked with nvidia-smi that it was running on the right gpu.

nouiz · 2016-05-10T20:44:46Z

It seem a bug, @abergeron will check.

lamblin · 2016-05-10T21:00:10Z

Strange error when running the benchmark:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/home/benchmark/code/rnn_exps/theano/ptb/train_lm.py in <module>()
     23     config = getattr(config_lm, args.proto)()
     24     logger.info("Model options:\n{}".format(pprint.pformat(config)))
---> 25     train(**config)

/home/benchmark/code/rnn_exps/theano/ptb/lm.pyc in train(dim_word, dim, encoder, max_epochs, finish_after, dispFreq, decay_c, lrate, n_words, maxlen, batch_size, valid_batch_size, max_grad_norm, nlayers, data_path, use_dropout)
    596 
    597             # compute cost, grads and copy grads to shared variables
--> 598             cost = f_grad_shared(x)
    599 
    600             # do the update on parameters

/home/benchmark/repos/Theano/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    906                     node=self.fn.nodes[self.fn.position_of_error],
    907                     thunk=thunk,
--> 908                     storage_map=getattr(self.fn, 'storage_map', None))
    909             else:
    910                 # old-style linkers raise their own exceptions

/home/benchmark/repos/Theano/theano/gof/link.pyc in raise_with_op(node, thunk, exc_info, storage_map)
    312         # extra long error message in that case.
    313         pass
--> 314     reraise(exc_type, exc_value, exc_trace)
    315 
    316 

/home/benchmark/repos/Theano/theano/compile/function_module.pyc in __call__(self, *args, **kwargs)
    893         try:
    894             outputs =\
--> 895                 self.fn() if output_subset is None else\
    896                 self.fn(output_subset=output_subset)
    897         except Exception:

RuntimeError: Success
Apply node that caused the error: GpuAdvancedSubtensor1(Wemb, GpuReshape{1}.0)
Toposort index: 39
Inputs types: [GpuArrayType<None>(float32, (False, False)), GpuArrayType<None>(int64, (False,))]
Inputs shapes: [(10000, 200), (400,)]
Inputs strides: [(800, 4), (8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuReshape{3}(GpuAdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)]]

abergeron · 2016-05-11T17:49:17Z

Problem should be fixed.

lamblin · 2016-05-11T21:05:10Z

The code seems to run. I still have to relaunch with profiling and compare.

lamblin · 2016-05-12T15:51:33Z

Updated timings with that change below (total time spent in that Op, in seconds).
TL;DR: It helps, but it's not all of it.

model	op	old	new	this PR
small	GpuAdvancedSubtensor1	2.2	16	9.8
large	GpuAdvancedSubtensor1	1.1	10.5	8.2
small	GpuAdvancedIncSubtensor1	77	209	131
large	GpuAdvancedIncSubtensor1	75	204	129
small	GpuAdvancedIncSubtensor1_dev20	0.861	0.68	0.475
large	GpuAdvancedIncSubtensor1_dev20	0.518	0.438	0.370

nouiz · 2016-05-12T15:54:45Z

should we merge this? I think what @abergeron told about making GpuJoin reuse elemwise to get all the speed benefit would be useful here too. If so, this would make this PR useless.

But if we merge now, we will have now some good speed up. So if people absolutly need float16, they won't have too much slow down.

I vote to merge now.

lamblin · 2016-05-12T16:04:32Z

I did not review the code and I'm not sure I understand how the other solution would interact with this one. I'm OK with merging if you think it is a good idea.

lamblin · 2016-05-12T16:05:17Z

Btw, the overall performance was similar to the old back-end, due to Gemm being faster.

nouiz · 2016-05-12T16:21:39Z

I reviewed it complety and I think it is good to merge.

This could use the GpuElemwise kernel to make a copy for each element, but mark them as being able to run in parallel.

But it would be probably better to implement the dimensions collapsing here. There is function that can be reused and this would make only 1 kernel call.

nouiz · 2016-05-12T16:22:08Z

I need to leave, When I can I'll run gpuarray and Theano tests to make sure it work well and I'll merge.

lamblin · 2016-05-12T17:38:34Z

I can run the tests.

nouiz · 2016-05-12T19:57:29Z

I'm running them.

On Thu, May 12, 2016 at 1:38 PM, Pascal Lamblin [email protected]
wrote:

I can run the tests.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#166 (comment)

lamblin · 2016-05-12T20:05:08Z

Me too. The gpuarray tests passed, I'm running the Theano ones now.

nouiz · 2016-05-12T20:07:01Z

The Theano one passed, I'm finishing the gpuarray ones. so merging.

On Thu, May 12, 2016 at 4:05 PM, Pascal Lamblin [email protected]
wrote:

Me too. The gpuarray tests passed, I'm running the Theano ones now.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#166 (comment)

Add the option to do addressing in 32 bits since it is 3 times slower…

8713cfd

… to do it in 64 bits Also use the new GpuKernel_setarg option to avoid allocating a buffer for the arguments.

abergeron mentioned this pull request May 9, 2016

Slowdown in the new back-end for some Ops Theano/Theano#4448

Closed

13 tasks

nouiz reviewed May 10, 2016
View reviewed changes

Fix some mistakes following review.

df7632b

Fix the casting.

d22c478

Fix the deref in kernel code.

d32cd66

nouiz merged commit b38727e into Theano:master May 12, 2016

abergeron deleted the take1_call32 branch May 12, 2016 21:26

Use 32-bit addressing in take1 when possible #166

Use 32-bit addressing in take1 when possible #166

Uh oh!

Conversation

abergeron commented May 9, 2016

Uh oh!

nouiz May 10, 2016

Choose a reason for hiding this comment

Uh oh!

abergeron May 10, 2016

Choose a reason for hiding this comment

Uh oh!

nouiz commented May 10, 2016

Uh oh!

abergeron commented May 10, 2016

Uh oh!

lamblin commented May 10, 2016

Uh oh!

nouiz commented May 10, 2016

Uh oh!

lamblin commented May 10, 2016

Uh oh!

nouiz commented May 10, 2016

Uh oh!

lamblin commented May 10, 2016

Uh oh!

abergeron commented May 11, 2016

Uh oh!

lamblin commented May 11, 2016

Uh oh!

lamblin commented May 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nouiz commented May 12, 2016

Uh oh!

lamblin commented May 12, 2016

Uh oh!

lamblin commented May 12, 2016

Uh oh!

nouiz commented May 12, 2016

Uh oh!

nouiz commented May 12, 2016

Uh oh!

lamblin commented May 12, 2016

Uh oh!

nouiz commented May 12, 2016

Uh oh!

lamblin commented May 12, 2016

Uh oh!

nouiz commented May 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lamblin commented May 12, 2016 •

edited

Loading