Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Conversation

@cmichi
Copy link
Contributor

@cmichi cmichi commented Jun 25, 2019

Addresses #2051.

The runtime cache which substrate currently uses does reuse one runtime instance for every call.
In order to enable this the instance is cleaned up for every call. This leads to the problems detailed in
#2051.

In the process of creating #2931 to address the issue it turned out that cloning a wasmi::ModuleInstance synchronously is not that expensive. This PR implements a basic proof of concept (no tests, still a global variable, no delayed cache eviction) just to see how it performs.

This PR creates a template instance on first fetch from the cache and clones it synchronously for succeeding fetches. Since prepare_module is already called before every call into wasm this works.

I benchmarked it against master using perf stat and our transaction factory (lmk if you have a better idea for creating a benchmark):

cargo run --release -- purge-chain -y --dev &&
perf stat --repeat=5 -o /tmp/perf-clone cargo run --release -- factory --dev --mode MasterToNToM --num 1000 --rounds 100 1>/dev/null 2>/dev/null

master:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 10000 --rounds 100' (5 runs):

         21,584.53 msec task-clock:u              #    0.998 CPUs utilized            ( +- 93.64% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            41,107      page-faults:u             #    0.002 M/sec                    ( +-  7.56% )
    64,584,552,076      cycles:u                  #    2.992 GHz                      ( +- 94.74% )
   142,408,240,803      instructions:u            #    2.20  insn per cycle           ( +- 95.74% )
     7,747,580,746      branches:u                #  358.941 M/sec                    ( +- 88.41% )
       117,765,225      branch-misses:u           #    1.52% of all branches          ( +- 85.88% )

             21.62 +- 20.18 seconds time elapsed  ( +- 93.34% )

This PR:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 10000 --rounds 100' (5 runs):

          8,889.98 msec task-clock:u              #    0.993 CPUs utilized            ( +- 87.14% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            37,445      page-faults:u             #    0.004 M/sec                    ( +- 13.80% )
    25,898,434,573      cycles:u                  #    2.913 GHz                      ( +- 89.00% )
    50,380,155,387      instructions:u            #    1.95  insn per cycle           ( +- 89.64% )
     4,465,108,088      branches:u                #  502.263 M/sec                    ( +- 81.44% )
        69,295,237      branch-misses:u           #    1.55% of all branches          ( +- 78.28% )

              8.95 +- 7.74 seconds time elapsed  ( +- 86.48% )

The speedup might be because the implementation is currently very simple, whereas the runtime cache from master does a bit more (operations on a HashMap, …). So there is probably some slow-down when we implement e.g. delayed cache eviction. Still I think it looks very promising and we wouldn't have to use threadsafe types for wasmi.

If there are no complaints I suggest to implement this approach. @pepyakin @bkchr @arkpar, wdyt?

UPD(pepyakin):

Closes #2967

@cmichi cmichi added the A3-in_progress Pull request is in progress. No review needed at this stage. label Jun 25, 2019
@cmichi cmichi requested a review from pepyakin June 25, 2019 09:07
@arkpar
Copy link
Member

arkpar commented Jun 26, 2019

IIRC deep cloning a module instance is not that simple. Internally it uses reference counted objects, such as memory. So simple clone creates another instance with shared memory.

@bkchr
Copy link
Member

bkchr commented Jun 26, 2019

I think you are right @arkpar, @pepyakin highlighted this test that still returns the modified global variable, so the memory is not reset correctly. Or at least it seems to not be reset correctly.

@cmichi
Copy link
Contributor Author

cmichi commented Jun 26, 2019

@arkpar @bkchr Yes, you're right. I tried to implement deep clone functions for ModuleInstance and its inner types on a wasmi branch, but it gets quite hairy at some point because of Weak<ModuleInstance>.

So the approach which I just pushed emerged after talking to @pepyakin: simply preserving the initial memory by reading from the exports and restoring it for each runtime (all synchronously).

Initial benchmarks look good (lmk if you want other benchmarks):

master:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 1000 --rounds 100' (5 runs):

          6,019.73 msec task-clock:u              #    0.989 CPUs utilized            ( +- 81.48% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            33,222      page-faults:u             #    0.006 M/sec                    ( +-  2.61% )
    16,804,226,940      cycles:u                  #    2.792 GHz                      ( +- 85.18% )
    36,395,262,835      instructions:u            #    2.17  insn per cycle           ( +- 86.91% )
     2,357,296,545      branches:u                #  391.595 M/sec                    ( +- 66.29% )
        38,134,897      branch-misses:u           #    1.62% of all branches          ( +- 61.69% )

              6.09 +- 4.91 seconds time elapsed  ( +- 80.70% )

This PR:

 Performance counter stats for 'cargo run --release -- factory --dev --mode MasterToNToM --num 1000 --rounds 100' (5 runs):

          4,511.51 msec task-clock:u              #    0.965 CPUs utilized            ( +- 79.33% )
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
            34,572      page-faults:u             #    0.008 M/sec                    ( +-  7.06% )
    12,663,707,807      cycles:u                  #    2.807 GHz                      ( +- 81.83% )
    23,694,510,249      instructions:u            #    1.87  insn per cycle           ( +- 80.48% )
     2,320,776,356      branches:u                #  514.412 M/sec                    ( +- 65.84% )
        36,606,709      branch-misses:u           #    1.58% of all branches          ( +- 62.18% )

              4.68 +- 3.67 seconds time elapsed  ( +- 78.59% )

I mentioned this before, the benchmark might get worse if we decide for more complicated logic (like delayed cache eviction).

@pepyakin pepyakin self-requested a review June 27, 2019 11:36
Copy link
Contributor

@pepyakin pepyakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a brief review

match runtime_preproc {
RuntimePreproc::InvalidCode => {
let code = ext.original_storage(well_known_keys::CODE).unwrap_or(vec![]);
Err(Error::InvalidCode(code))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little bit offtopic, but I wonder what is the point of returning the code here?

Copy link
Contributor Author

@cmichi cmichi Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error enum is defined here and requires the code, I just extracted this part from native_executor. Not sure about the reasoning, but since the original code (and mine as well) later outputs the code in a trace! this could have been a reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about usability of this and I am going to remove this.

@pepyakin pepyakin self-assigned this Jun 28, 2019
@pepyakin
Copy link
Contributor

Looking at the impl, it doesn't account for possible updates coming from :code.

@pepyakin
Copy link
Contributor

pepyakin commented Jul 1, 2019

I just pushed the fixed version (although, still not ready for the review).

I benchmarked it like this:

cargo run --release -- purge-chain -y --dev && \
  time cargo run --release -- factory --dev --mode MasterTo1 --num 1500 1>/dev/null 2>/dev/null

(It would be cool to generate the chain of blocks and then export them and then import but this is blocked by #2977).

Note that I choose the different params, because apparently transaction-factory doesn't handle the case when the master runs out of funds. I also used time instead of perf simply because I don't have it. I did 3 runs and took the average of them.

As the baseline, I took the latest master commit e63598b.

master: 29.2467s
this PR: 30.85s

So there is a gap, but it is not terribly big: 1ms per block.

There are still areas for improvement. We could decrease the amount of work by avoiding copying the entire linear memory space, but asking runtimes for special global values which specify where does the heap start so we can only copy everything before that and then memset the rest (which should be much faster than copying). Such a global is already published by rustc compiler.

FWIW, I also still have hopes that pooling of wasm instances can give noticeble improvement for import times.

@pepyakin
Copy link
Contributor

pepyakin commented Jul 3, 2019

This approach has some deficiencies which if solved cause massive slow-downs. That made me to research another approach. Will create a PR shortly.

@pepyakin pepyakin closed this Jul 3, 2019
@bkchr bkchr deleted the cmichi-ensure-clean-wasm-instances-v2 branch July 3, 2019 17:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A3-in_progress Pull request is in progress. No review needed at this stage.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wasm stack pointer is not restored to its initial value

6 participants