Ensure clean wasm instances via synchronous clone. #2938

cmichi · 2019-06-25T09:07:56Z

Addresses #2051.

The runtime cache which substrate currently uses does reuse one runtime instance for every call.
In order to enable this the instance is cleaned up for every call. This leads to the problems detailed in
#2051.

In the process of creating #2931 to address the issue it turned out that cloning a wasmi::ModuleInstance synchronously is not that expensive. This PR implements a basic proof of concept (no tests, still a global variable, no delayed cache eviction) just to see how it performs.

This PR creates a template instance on first fetch from the cache and clones it synchronously for succeeding fetches. Since prepare_module is already called before every call into wasm this works.

I benchmarked it against master using perf stat and our transaction factory (lmk if you have a better idea for creating a benchmark):

cargo run --release -- purge-chain -y --dev &&
perf stat --repeat=5 -o /tmp/perf-clone cargo run --release -- factory --dev --mode MasterToNToM --num 1000 --rounds 100 1>/dev/null 2>/dev/null

master:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 10000 --rounds 100' (5 runs):

         21,584.53 msec task-clock:u              #    0.998 CPUs utilized            ( +- 93.64% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            41,107      page-faults:u             #    0.002 M/sec                    ( +-  7.56% )
    64,584,552,076      cycles:u                  #    2.992 GHz                      ( +- 94.74% )
   142,408,240,803      instructions:u            #    2.20  insn per cycle           ( +- 95.74% )
     7,747,580,746      branches:u                #  358.941 M/sec                    ( +- 88.41% )
       117,765,225      branch-misses:u           #    1.52% of all branches          ( +- 85.88% )

             21.62 +- 20.18 seconds time elapsed  ( +- 93.34% )

This PR:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 10000 --rounds 100' (5 runs):

          8,889.98 msec task-clock:u              #    0.993 CPUs utilized            ( +- 87.14% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            37,445      page-faults:u             #    0.004 M/sec                    ( +- 13.80% )
    25,898,434,573      cycles:u                  #    2.913 GHz                      ( +- 89.00% )
    50,380,155,387      instructions:u            #    1.95  insn per cycle           ( +- 89.64% )
     4,465,108,088      branches:u                #  502.263 M/sec                    ( +- 81.44% )
        69,295,237      branch-misses:u           #    1.55% of all branches          ( +- 78.28% )

              8.95 +- 7.74 seconds time elapsed  ( +- 86.48% )

The speedup might be because the implementation is currently very simple, whereas the runtime cache from master does a bit more (operations on a HashMap, …). So there is probably some slow-down when we implement e.g. delayed cache eviction. Still I think it looks very promising and we wouldn't have to use threadsafe types for wasmi.

If there are no complaints I suggest to implement this approach. @pepyakin @bkchr @arkpar, wdyt?

UPD(pepyakin):

Closes #2967

@pepyakin

Original is from @pepyakin in 3d7b27f. I adapted it to work with the latest master.

core/sr-api-macros/tests/runtime_calls.rs

arkpar · 2019-06-26T06:32:18Z

IIRC deep cloning a module instance is not that simple. Internally it uses reference counted objects, such as memory. So simple clone creates another instance with shared memory.

bkchr · 2019-06-26T07:09:42Z

I think you are right @arkpar, @pepyakin highlighted this test that still returns the modified global variable, so the memory is not reset correctly. Or at least it seems to not be reset correctly.

cmichi · 2019-06-26T20:14:25Z

@arkpar @bkchr Yes, you're right. I tried to implement deep clone functions for ModuleInstance and its inner types on a wasmi branch, but it gets quite hairy at some point because of Weak<ModuleInstance>.

So the approach which I just pushed emerged after talking to @pepyakin: simply preserving the initial memory by reading from the exports and restoring it for each runtime (all synchronously).

Initial benchmarks look good (lmk if you want other benchmarks):

master:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 1000 --rounds 100' (5 runs):

          6,019.73 msec task-clock:u              #    0.989 CPUs utilized            ( +- 81.48% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            33,222      page-faults:u             #    0.006 M/sec                    ( +-  2.61% )
    16,804,226,940      cycles:u                  #    2.792 GHz                      ( +- 85.18% )
    36,395,262,835      instructions:u            #    2.17  insn per cycle           ( +- 86.91% )
     2,357,296,545      branches:u                #  391.595 M/sec                    ( +- 66.29% )
        38,134,897      branch-misses:u           #    1.62% of all branches          ( +- 61.69% )

              6.09 +- 4.91 seconds time elapsed  ( +- 80.70% )

This PR:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 1000 --rounds 100' (5 runs):

          4,511.51 msec task-clock:u              #    0.965 CPUs utilized            ( +- 79.33% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            34,572      page-faults:u             #    0.008 M/sec                    ( +-  7.06% )
    12,663,707,807      cycles:u                  #    2.807 GHz                      ( +- 81.83% )
    23,694,510,249      instructions:u            #    1.87  insn per cycle           ( +- 80.48% )
     2,320,776,356      branches:u                #  514.412 M/sec                    ( +- 65.84% )
        36,606,709      branch-misses:u           #    1.58% of all branches          ( +- 62.18% )

              4.68 +- 3.67 seconds time elapsed  ( +- 78.59% )

I mentioned this before, the benchmark might get worse if we decide for more complicated logic (like delayed cache eviction).

pepyakin

Gave a brief review

core/executor/src/wasm_runtimes_cache.rs

pepyakin · 2019-06-27T16:28:48Z

core/executor/src/wasm_runtimes_cache.rs

+		match runtime_preproc {
+			RuntimePreproc::InvalidCode => {
+				let code = ext.original_storage(well_known_keys::CODE).unwrap_or(vec![]);
+				Err(Error::InvalidCode(code))


Maybe a little bit offtopic, but I wonder what is the point of returning the code here?

This error enum is defined here and requires the code, I just extracted this part from native_executor. Not sure about the reasoning, but since the original code (and mine as well) later outputs the code in a trace! this could have been a reason.

I am not sure about usability of this and I am going to remove this.

core/executor/src/wasm_runtimes_cache.rs

pepyakin · 2019-06-28T16:06:31Z

Looking at the impl, it doesn't account for possible updates coming from :code.

…-wasm-instances-v2

pepyakin · 2019-07-01T12:58:18Z

I just pushed the fixed version (although, still not ready for the review).

I benchmarked it like this:

cargo run --release -- purge-chain -y --dev && \
  time cargo run --release -- factory --dev --mode MasterTo1 --num 1500 1>/dev/null 2>/dev/null

(It would be cool to generate the chain of blocks and then export them and then import but this is blocked by #2977).

Note that I choose the different params, because apparently transaction-factory doesn't handle the case when the master runs out of funds. I also used time instead of perf simply because I don't have it. I did 3 runs and took the average of them.

As the baseline, I took the latest master commit e63598b.

master: 29.2467s
this PR: 30.85s

So there is a gap, but it is not terribly big: 1ms per block.

There are still areas for improvement. We could decrease the amount of work by avoiding copying the entire linear memory space, but asking runtimes for special global values which specify where does the heap start so we can only copy everything before that and then memset the rest (which should be much faster than copying). Such a global is already published by rustc compiler.

FWIW, I also still have hopes that pooling of wasm instances can give noticeble improvement for import times.

pepyakin · 2019-07-03T16:57:04Z

This approach has some deficiencies which if solved cause massive slow-downs. That made me to research another approach. Will create a PR shortly.

cmichi added 3 commits June 24, 2019 15:27

Add test from original bug report

a1820f4

Original is from @pepyakin in 3d7b27f. I adapted it to work with the latest master.

No longer cleanup module instance

adaa8dc

Replace runtime cache with synchronous clone

ccba1ef

cmichi added the A3-in_progress Pull request is in progress. No review needed at this stage. label Jun 25, 2019

cmichi requested a review from pepyakin June 25, 2019 09:07

devops-parity added the A4-gotissues label Jun 25, 2019

pepyakin reviewed Jun 25, 2019

View reviewed changes

core/sr-api-macros/tests/runtime_calls.rs Outdated Show resolved Hide resolved

cmichi added 2 commits June 26, 2019 19:29

Fix test

e3dba7f

Preserve initial runtime memory and restore it on fetch

0feabfc

Remove leftover comment

f69f0ef

pepyakin self-requested a review June 27, 2019 11:36

pepyakin reviewed Jun 27, 2019

View reviewed changes

cmichi added 6 commits June 28, 2019 09:00

Fix style

65f4616

Improve variable naming

a9a9014

Replace get_into() with get()

1abeb70

Handle missing memory export better

91a8f82

Return earlier when creating runtime first time

1b76de2

Improve comments

1a34c15

pepyakin self-assigned this Jun 28, 2019

pepyakin added 4 commits June 28, 2019 12:59

fmt

8d926b1

Fix #2967.

2802866

Eradicate code from Error::InvalidCode

6f16b45

tidy

457d6c6

cmichi mentioned this pull request Jun 28, 2019

Ensure clean wasm instances #2931

Closed

pepyakin added 2 commits July 1, 2019 11:50

A state snapshot doc.

35b2deb

Store multiple runtimes by hash.

60fdbf5

Merge remote-tracking branch 'origin/master' into cmichi-ensure-clean…

6d34c0a

…-wasm-instances-v2

pepyakin closed this Jul 3, 2019

bkchr deleted the cmichi-ensure-clean-wasm-instances-v2 branch July 3, 2019 17:20

pepyakin mentioned this pull request Jul 3, 2019

Fair reusing of wasm runtime instances #3011

Merged

2 tasks

Ensure clean wasm instances via synchronous clone. #2938

Ensure clean wasm instances via synchronous clone. #2938

Uh oh!

Conversation

cmichi commented Jun 25, 2019 • edited by pepyakin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

arkpar commented Jun 26, 2019

Uh oh!

bkchr commented Jun 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmichi commented Jun 26, 2019

Uh oh!

pepyakin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pepyakin Jun 27, 2019

Choose a reason for hiding this comment

Uh oh!

cmichi Jun 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pepyakin Jun 28, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pepyakin commented Jun 28, 2019

Uh oh!

pepyakin commented Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepyakin commented Jul 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cmichi commented Jun 25, 2019 •

edited by pepyakin

Loading

bkchr commented Jun 26, 2019 •

edited

Loading

cmichi Jun 28, 2019 •

edited

Loading

pepyakin commented Jul 1, 2019 •

edited

Loading