Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Conversation

@cmichi
Copy link
Contributor

@cmichi cmichi commented Feb 21, 2019

This addresses #1519 and this PR is not supposed to be merged, it's just here for discussion and can be closed afterwards.

The idea is to gain insight into how much optimization room we have for calls into wasm. To get an idea of the "ground truth" I use wasm2c to convert our node_runtime.wasm to C, which I then compile as a shared library, which I link to the substrate executable.

I then timed invocations of call_in_wasm_module in a very simple manner and ran cargo run --release -- --dev for 30 blocks. I did the same for wasm calls (here is the timed master branch with forced wasm method invocations: https://github.com/paritytech/substrate/tree/cmichi-time-wasm-calls) and for Rust native method invocations (https://github.com/paritytech/substrate/tree/cmichi-time-native-calls). No transactions took place on the chain, only empty blocks were produced (though there already happens a lot of stuff, even for empty blocks).

This is only a very rough benchmark! I'm totally aware of that. For example, system interruptions by the OS scheduler are not accounted for and could skew the results. But: a better benchmark ‒ e.g. benchmarking --import-blocks of a chain dump and monitoring via perf-stat(1) ‒ would require a lot more work. This is because substrate handles multiple wasmi ModuleInstances and this branch uses shared memory with the so, which basically means only one "instance" can be run at a time.

Either way, I ran the measurements on an otherwise idling system and I think the results are sufficient to give a rough idea how much headroom there is for the different methods.

wasm2c_wo_bc is the shared library compiled with explicit bounds checking disabled, and it can be considered as more realistic, since a real wasm VM implementation most likely will have signal based bounds checking mechanism which is zero cost for the non-exceptional case.

# Core_execute_block
  rust_native:        457 µs (100.0%)
 wasm2c_wo_bc:       1958 µs (23.3%)
       wasm2c:       1894 µs (24.1%)
        wasmi:       7168 µs (6.4%)

# Core_version
  rust_native:          5 µs (100.0%)
 wasm2c_wo_bc:         10 µs (52.0%)
       wasm2c:         10 µs (49.9%)
        wasmi:        155 µs (3.4%)

# Core_authorities
  rust_native:         14 µs (100.0%)
 wasm2c_wo_bc:         32 µs (43.4%)
       wasm2c:         32 µs (42.7%)
        wasmi:        210 µs (6.6%)

# Core_initialise_block
  rust_native:        161 µs (100.0%)
 wasm2c_wo_bc:       1190 µs (13.5%)
       wasm2c:       1092 µs (14.7%)
        wasmi:       4896 µs (3.3%)

# BlockBuilder_apply_extrinsic
  rust_native:         44 µs (100.0%)
 wasm2c_wo_bc:        314 µs (14.0%)
       wasm2c:        310 µs (14.2%)
        wasmi:       1979 µs (2.2%)

# BlockBuilder_finalise_block
  rust_native:        467 µs (100.0%)
 wasm2c_wo_bc:        832 µs (56.2%)
       wasm2c:        801 µs (58.4%)
        wasmi:       1012 µs (46.2%)

# BlockBuilder_inherent_extrinsics
  rust_native:         18 µs (100.0%)
 wasm2c_wo_bc:         93 µs (19.6%)
       wasm2c:         88 µs (20.9%)
        wasmi:        615 µs (3.0%)

# AuraApi_slot_duration
  rust_native:         24 µs (100.0%)
 wasm2c_wo_bc:         29 µs (82.8%)
       wasm2c:         31 µs (77.4%)
        wasmi:         70 µs (34.3%)

# GrandpaApi_grandpa_authorities
  rust_native:         20 µs (100.0%)
 wasm2c_wo_bc:         34 µs (58.8%)
       wasm2c:         33 µs (60.6%)
        wasmi:        118 µs (16.9%)

# GrandpaApi_grandpa_pending_change
  rust_native:          2 µs (100.0%)
 wasm2c_wo_bc:          8 µs (21.7%)
       wasm2c:          9 µs (20.6%)
        wasmi:         73 µs (2.4%)

@cmichi cmichi force-pushed the cmichi-profile-native-compiled-wasm-2nd-approach branch from 37e7c84 to 497080b Compare February 21, 2019 14:08
@pepyakin pepyakin added the A3-in_progress Pull request is in progress. No review needed at this stage. label Feb 21, 2019
@pepyakin pepyakin added A0-please_review Pull request needs code review. A1-onice and removed A3-in_progress Pull request is in progress. No review needed at this stage. A4-gotissues labels Feb 21, 2019
@cmichi cmichi force-pushed the cmichi-profile-native-compiled-wasm-2nd-approach branch from 4bad507 to f735373 Compare February 21, 2019 14:26
@cmichi cmichi force-pushed the cmichi-profile-native-compiled-wasm-2nd-approach branch from 5a03290 to ead0efa Compare February 21, 2019 16:47
@eira-fransham
Copy link

eira-fransham commented Feb 21, 2019

This is really fantastic - it's a great ideal case to aim towards with our Wasm if nothing else. One way to mitigate the cost of functions that will be called many hundreds of thousands of times like the runtime functions is to patch the calls directly instead of using the relocation table, but I don't know if that's necessary since for the runtime we expect users wanting the most speed to use native code directly anyway. A couple questions:

  • What's going on with AuraApi_slot_duration?
  • I don't think I quite follow the testing methodology - you're trying to have it exit immediately by processing no transactions so you only test the calling cost? That doesn't take into account inefficiencies like stack overflow checking or even just poor compilation of the Wasm itself even in the small amount of code that gets executed.
  • As @pepyakin pointed out: why did you not do --release?

@gavofyork gavofyork removed A0-please_review Pull request needs code review. A4-gotissues labels Feb 21, 2019
@cmichi
Copy link
Contributor Author

cmichi commented Feb 21, 2019

The initial version of the numbers in the first post was gathered without --release, I have updated them with everything re-executed in release mode.

@Vurich Thanks!

What's going on with AuraApi_slot_duration?

No idea, for AuraApi_slot_duration and GrandpaApi_grandpa_pending_change wasmi strangely performs better than the wasm2c version. I'll need to look at the implementation to see if something comes up.

I don't think I quite follow the testing methodology - you're trying to have it exit immediately by processing no transactions so you only test the calling cost? That doesn't take into account inefficiencies like stack overflow checking or even just poor compilation of the Wasm itself even in the small amount of code that gets executed.

Yeah, I totally agree, processing transactions would of course provide better insights. This limitation was not done intentionally though, it's just the lowest hanging fruit to test. Getting it to work for a more complex chain would require a more extensive implementation.

Regarding "so you only test the calling cost?": the implementation behind these methods is actually executed (e.g. for Core_execute_block).

@rphmeier
Copy link
Contributor

What's going on with AuraApi_slot_duration

It's only called at genesis, FWIW

Run `cargo run --release -- purge-chain --dev && cargo run --release -- --dev`.

Besides this nothing works (no tests, no import, etc.).
`make default` has been executed in this commit.

wasm2c is available at https://github.com/WebAssembly/wabt/tree/master/wasm2c.
Run it this way:

	git checkout cmichi-time-native-calls
	cargo run --release -- purge-chain --dev
	cargo run --release -- --dev > /tmp/rust_native 2>&1

	git checkout cmichi-time-wasm-calls
	cargo run --release -- purge-chain --dev
	cargo run --release -- --dev > /tmp/wasmi 2>&1

	git checkout cmichi-profile-native-compiled-wasm-2nd-approach
	cargo run --release -- purge-chain --dev
	cargo run --release -- --dev > /tmp/wasm2 2>&1

	git checkout cmichi-profile-native-compiled-wasm-2nd-approach
	sed -i '/MEMCHECK(mem, addr, t1);/d' node_runtime.c
	gcc -fPIC -rdynamic -shared -o libnode_runtime.so node_runtime.c wasm-rt-impl.c
	cp -f libnode_runtime.so target/release/deps/
	cargo run --release -- purge-chain --dev
	cargo run --release -- --dev > /tmp/wasm2c_wo_bc 2>&1

	./compile-stats.sh
@cmichi cmichi force-pushed the cmichi-profile-native-compiled-wasm-2nd-approach branch from 6d04269 to 726d95a Compare February 22, 2019 07:59
@cmichi
Copy link
Contributor Author

cmichi commented Feb 22, 2019

Found the issue with AuraApi_slot_duration and GrandpaApi_grandpa_pending_change ‒ I moved the timer in front of the immediate invocation of the method in wasmi/wasm2c. In the prior version I had timed the call_in_wasm_module method, which did a bit more then just invoke the method externally.

I updated the first post with the re-run metrics.

@pepyakin
Copy link
Contributor

pepyakin commented Feb 27, 2019

I'm going to close this to clear up the PR list.

@pepyakin pepyakin closed this Feb 27, 2019
@arkpar arkpar deleted the cmichi-profile-native-compiled-wasm-2nd-approach branch April 23, 2020 18:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants