First-pass IMT implementation of FlushBaskets. #277

bbockelm · 2016-10-06T04:02:01Z

This is based on @dpiparo 's work to parallelize GetEntry. The basic idea is, when we are flushing all active branches, we do each branch in parallel. We have to maintain mutual exclusion when interacting with the TTree or TFile, but we can parallelize the compression of the baskets (which is a significant amount of CPU time).

Note the least satisfactory part of this work is having to use a mutex to access the byte-counters in TTree; this is because these fields are serialized and std::atomic<> is not serializable. Any hints as to how to get around this?

Setting MainEvent.cxx in the test sub-directory to use this (with LZMA as the compression algorithm), I get:

RealTime=76.340815 seconds, CpuTime=131.770000 seconds

@pcanal @Dr15Jones - this spun off from our discussion about CMSSW efficiency. It's really easy to parallelize FlushBaskets using a tbb::task_group that I later wait for. However, continuation-style programming is difficult here because FlushBaskets is called from deep callstacks. Further, there's a lot of state in the basket itself we'd need to unravel.

Looking at stack traces for the sample Event program, the next most advantageous place to parallelize compression is here:

#11 0x00007f00743e80fe in R__zipMultipleAlgorithm 
#12 0x00007f00729aec25 in TBasket::WriteBuffer 
#13 0x00007f00729b53f3 in TBranch::WriteBasket 
#14 0x00007f00729b5c95 in TBranch::Fill 
#15 0x00007f00729cb630 in TBranchElement::Fill
#16 0x00007f00729cb418 in TBranchElement::Fill 
#17 0x00007f00729cb418 in TBranchElement::Fill 
#18 0x00007f0072a063f3 in TTree::Fill

The idea would be to make WriteBuffer kick off a separate task, but block TBranch::Fill (and a handful of other functions, such as anything that can change the branch's TFile) from being called until the WriteBuffer task was completed. Harder than this approach, but not impossible.

dpiparo · 2016-10-06T04:54:04Z

Hi @bbockelm ,

nice job.
For the serialisation std::atomic might not be hard to achieve given its layout. From the second part of your message, I do not understand if the work is finished or not: can you elaborate?
Do you have runtime reduction figures for CMS Reco, stemming from an actual Fall16 reco job or back-of-the-envelope calculations ?
A clarification: the main developer behind IMT capabilities is @etejedor .

Cheers,
D

etejedor · 2016-10-06T12:37:28Z

Hi @bbockelm ,

I see that you are using ROOT::Internal::TParBranchProcessingRAII here:
https://github.com/root-mirror/root/pull/277/files#diff-de4250b19a609451f620d99107b0d4a4R4813

This RAII activates the locks needed for reading branches in parallel. Are these locks exactly the same in the case of writing? If not, you should use a different RAII and put in place only those locks that you actually need.

Cheers,
Enric

pcanal · 2016-10-06T14:33:59Z

tree/tree/inc/TTree.h

   TTree& operator=(const TTree& tt);   // not implemented

+#ifdef R__USE_IMT
+   mutable std::mutex fCounterMutex;      ///<!Lock to protect counters


You should probably using the TSpinMutex.

:) Yes - but the intent is to throw this away and avoid all locking whatsoever.

Done in latest version.

pcanal · 2016-10-06T14:35:34Z

tree/tree/inc/TTree.h

   virtual Int_t           GetTimerInterval() const { return fTimerInterval; }
           TBuffer*        GetTransientBuffer(Int_t size);
-   virtual Long64_t        GetTotBytes() const { return fTotBytes; }
+   virtual Long64_t        GetTotBytes() const { std::lock_guard<std::mutex> sentry(fCounterMutex); return fTotBytes; }


For correct operation, shouldn't you also force the assignment to a local variable:
std::lock_guardstd::mutex sentry(fCounterMutex);
auto retValue = fTotBytes;
return retValue;

I believe this is correct ( @Dr15Jones - can you confirm); however, I prefer to remove this lock if possible.

What's the best advice for using std::atomic? Maybe use a RAII type like with GetEntry? Would this work:

When the parallel flush basket starts, set a (transient) flag with a scoped RAII-style object.

If the transient flag is set, use transient std::atomic-based counters. (This should have no overhead when IMT is disabled).

When parallel flush ends (no parallel work is going on), add the std::atomic-based counters to the existing, non-atomic fTotBytes and fZipBytes and zero them out.

The use of a local variable depends on when compilers run destructors, before or after copying the value out to memory/register for returning from the function.

Ah, ok: I thought C++ provided a guarantee.

(Good to know, but I think I can just remove the lock entirely)

Done in latest version - lock removed.

pcanal · 2016-10-06T14:36:28Z

tree/tree/src/TTree.cxx

   if (!fDirectory || fDirectory == gROOT || !fDirectory->IsWritable()) return 0;
   if (gDebug > 0) {
-      printf("AutoSave Tree:%s after %lld bytes written\n",GetName(),fTotBytes);
+      printf("AutoSave Tree:%s after %lld bytes written\n",GetName(),GetTotBytes());


Please use the ROOT printing routine (Info is likely to be the one you are looking for

I believe there are a few other printf statements - should I also try to grab those?

Possibly but then in a separate commit. thanks.

Done in latest version - all debugging is now a call to Info.

pcanal · 2016-10-06T14:40:35Z

tree/tree/inc/TTree.h

   virtual TFriendElement *AddFriend(TTree* tree, const char* alias = "", Bool_t warn = kFALSE);
-   virtual void            AddTotBytes(Int_t tot) { fTotBytes += tot; }
-   virtual void            AddZipBytes(Int_t zip) { fZipBytes += zip; }
+   virtual void            AddTotBytes(Int_t tot) { std::lock_guard<std::mutex> sentry(fCounterMutex); fTotBytes += tot; }


For production code, we would need to make those optional as we need to not penalize the serial code or even multi-thread code not using the IMT.

Done - I believe I have gotten everything down to a single boolean flag.

bbockelm · 2016-10-06T15:31:07Z

@etejedor - Apologies: I had only looked at the latest commit and saw Danilo's name and overlooked yours. Nice work! Indeed, TParBranchProcessingRAII does not need to be present there. I will remove it.

@dpiparo - From the other ticket, when we switch from the existing compression approach (HC-LZMA) to zlib level 1, the CPU efficiency goes up from 83% to 95%. Basically, the majority of long threads stalls comes from waiting on the flush to finish. Here's a nice graph Chris put together: https://dl.dropboxusercontent.com/u/11356841/rereco_1000_stall.pdf

Dr15Jones · 2016-10-06T20:02:30Z

Since TTree uses manual Streamer code, why not just make the data members being protected by the mutex atomic?

pcanal · 2016-10-06T20:07:45Z

@Dr15Jones Currently StreamerInfo does not yet support std::atomic (we could add it straightforwardly if we have a guarantee that std::atomic and T have the same memory layout).
Eventhough there is a custom Streamer for TTree, it still uses the StreamerInfo for most of it I/O.

bbockelm · 2016-10-06T20:11:51Z

I believe there is explicitly no guarantee that std::atomic<T> and T have the same memory layout - in fact, it's legal for the compiler to implement this with a mutex. In all likelihood, it is the same layout on X86 but maybe not on obscure-platform-XYZ.

However, since we only need this to be thread-safe for the scope of the FlushBaskets call, we can do the thread safety work in transient members and add the intermediate results back to the non-thread-safe persisted members.

I think.

…mode.

bbockelm · 2016-10-07T03:57:19Z

Latest commit switches to transient atomics while the parallel flush is ongoing; after completion, the values in the atomics are added back into the main non-thread-safe counter.

Axel-Naumann

Tiny #include inconsistencies...

Axel-Naumann · 2016-10-07T06:13:36Z

tree/tree/inc/TTree.h

 #include "TVirtualTreePlayer.h"
 #endif

+#include <mutex>


I'd expect <atomic> instead?

Good catch. That was overlooked in changing the mutex to atomics. Fixed.

Axel-Naumann · 2016-10-07T06:20:01Z

io/io/inc/TFile.h


 #ifdef R__USE_IMT
   static ROOT::TRWSpinLock fgRwLock;    ///<!Read-write lock to protect global PID list
+   std::mutex               fWriteMutex; ///<!Lock for writing baskets / keys into the file.


I don't see #include <mutex>?

dpiparo · 2016-10-07T06:55:35Z

Hi @bbockelm,
thanks for the plot. My question was slightly different. Did you try a rereco with CMSSW version built on top a ROOT* build containing this patch? If yes, what was the improvement?

Dr15Jones · 2016-10-07T13:35:27Z

tree/tree/src/TTree.cxx

+      std::atomic<Int_t> nerrpar(0);
+      std::atomic<Int_t> nbpar(0);
+      std::atomic<Int_t> pos(0);
+      tbb::task_group g;


We must be extremely careful when embedding calls to TBB into legacy code. Holding any shared resource (which a C++ object itself may be viewed as such a resource) can lead to deadlocks. See

https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/401006

for details. This is one reason why the CMS threaded-framework is built using explicit tasks rather than implicit ones. For the case of writing, I can't think of anyone being dependent on the TFile, although we must be certain that no ROOT locks are being held by the thread calling g.wait().
The use of a tbb::task_arena would avoid the deadlock but does not appear to give us a way to limit the number of available threads.

Hi Chris,

Indeed.

My logic here is:

This function is high up enough in the call stack to verify it is not called with any global ROOT mutexes held.

The locking within the function was made per-TFile.

TTree and TFile are both thread-unsafe objects (cannot have two threads doing const or non-const calls into the same object), meaning there should not be multiple callers to this function regardless.

So, the remaining danger is if the caller (CMSSW) is holding a mutex of its own that other CMSSW tasks might need.

I will write these concerns up into a comment, provide an example of dangerous behavior, and add them to the function's documentation.

Brian

Here's an example of a deadlock that could have happened in older versions of CMSSW:

CMSSW takes a mutex on the output module.

CMSSW performs a ROOT call which causes a FlushBaskets operation.

ROOT creates many basket flushing tasks and then wait on the task group.

TBB's runtime notice the thread blocks in wait and schedules another CMSSW task on the same thread.

The other CMSSW also needs to access the output module, hence takes a mutex.

Since the thread already holds the output module's mutex, trying to acquire it again causes a deadlock.

Now, Chris tells me that CMSSW has switched away from the mutex for output module access - but this might be a good example.

I pushed a long comment for FlushBaskets outlining the above.

The other CMSSW also needs to access the output module, hence takes a mutex.
Since the thread already holds the output module's mutex, trying to acquire it again causes a deadlock.

This of course depend on whether the mutex that is used is recursive or not (TMutex is, std::mutex is not).

Certainly!

The point is that working code with IMT-disabled may stop working when IMT is enabled. One would hope that issues would be fixable once the user understands the underlying issue. Hence the extensive comment in the code.

bbockelm · 2016-10-07T13:55:37Z

@dpiparo - doing a build with CMSSW will be extremely hard until CMSSW has at least a test-release based on ROOT 6.08 (and, even then, no guarantees the reco application would work).

Best I can do currently is (a) this improves straightforward examples, (b) CMSSW blocking on FlushBaskets remains the top contributor to thread stalls, and (c) the stalls go away if we switch the tree to a faster compression algorithm (such as zlib).

pcanal · 2016-10-07T14:06:02Z

@bbockelm Didn't Dan show that LZMA with the smaller windows was on par performance/stall wise with zlib? [Nonetheless, this patch should still help even further :) ]

bbockelm · 2016-10-07T14:33:37Z

@pcanal - The LZMA tweaks caused stalls to decrease by an order-of-magnitude; unfortunately, they started as two-orders-magnitude larger than the next source of stalling.

bbockelm · 2016-11-04T19:35:58Z

@dpiparo - PR9 in roottest has some minimal coverage of this.

bbockelm · 2016-11-04T19:36:54Z

@karies - if I'm reading the ticket history correctly, you requested that I fix up the #includes; should be done. Can you +1 the PR - or do you see something else?

pcanal · 2016-11-04T19:39:33Z

tree/tree/inc/TTree.h

   TTree& operator=(const TTree& tt);   // not implemented

+#ifdef R__USE_IMT
+   mutable Bool_t fIMTFlush{false};               ///<! True if we are doing a multithreaded flush.


Odd to see a mutable non-atomic. Is that really the intention? (i.e. is it really never possible to modify/read it in parallel?)

Yes, that's the intention. Well, the real problem here is that FlushBaskets is declared const: the mutable here is working around that.

The fIMTFlush is only called from within FlushBaskets, which is not allowed to be called from multiple threads. It's done outside the parallel portion, meaning there must be a synchronization after it is mutated.

I wanted to avoid making it atomic to fulfill the request that there is no performance penalty when ROOT is used in non-IMT mode.

Fair enough, a comment here or at the use place along the line of this answer ought to be added.

pcanal · 2016-11-04T19:40:31Z

tree/tree/inc/TTree.h

-   virtual void            AddZipBytes(Int_t zip) { fZipBytes += zip; }
+   // As the TBasket invokes Add{Tot,Zip}Bytes on its parent tree, we must do these updates in a thread-safe
+   // manner only when we are flushing multiple baskets in parallel.
+   virtual void            AddTotBytes(Int_t tot) { if (fIMTFlush) { fIMTTotBytes += tot; } else { fTotBytes += tot; } }


This will fail to compile if R__USE_IMT is off.

pcanal · 2016-11-04T19:42:24Z

tree/tree/src/TBasket.cxx

      return -1;
   }
   fMotherDir = file; // fBranch->GetDirectory();
+#ifdef R__USE_IMT


Given the slightly unusual use pattern (take the lock then explicit unlock then lock). Can you add comment on what this is propecting (and why it will not provoke a dead lock).

pcanal · 2016-11-04T19:46:26Z

tree/tree/src/TTree.cxx

   return -1;
 }

+struct BoolRAIIToggle {


Consider putting this class in the anonymous namespace.

pcanal · 2016-11-16T16:44:31Z

test/MainEvent.cxx

   Int_t branchStyle = 1; //new style by default
   if (split < 0) {branchStyle = 0; split = -1-split;}

+   ROOT::EnableImplicitMT(4);


This should be optional (so we need to add a command line argument) and default to not be enabled (as is the test now complain if ROOT was not built with IMT enabled).

Ugh. Those were test modifications to MainEvent.cxx that I did not intend to commit. Will remove them all.

I disagree ;). I think it is a useful new testing swtich ... it just need to be made conditional. thanks.

Would you prefer that it is automatically enabled or add it as a command line flag?

As a command line flag so that the default stay as is and one can test with and without it with the same build.

pcanal · 2017-01-10T19:11:07Z

Merged.

Thanks,
Philippe.

1. Fix - abort tree draw operation faster 1. Fix - catch exception when parsing TF1 formula 2. Fix - properly check THStack histograms axes when doing sum 3. Fix - correctly handle negative offset on time axis 4. Fix - do not use `inset` because of old Chrome browsers 5. Fix - properly provide object hints 1. Fix - draw histograms with negative bins root-project#276 2. Fix - correctly read TLeaf with fixed-size array 3. Fix - bug in options handling in startGUI 4. Fix - greyscale support in TLegend drawing 5. Fix - correctly use text font for TGaxis title 6. Fix - preserve auto colors in THStack root-project#277 7. Fix - correctly set pave name root-project#278

1. Fix - abort tree draw operation faster 1. Fix - catch exception when parsing TF1 formula 2. Fix - properly check THStack histograms axes when doing sum 3. Fix - correctly handle negative offset on time axis 4. Fix - do not use `inset` because of old Chrome browsers 5. Fix - properly provide object hints 1. Fix - draw histograms with negative bins #276 2. Fix - correctly read TLeaf with fixed-size array 3. Fix - bug in options handling in startGUI 4. Fix - greyscale support in TLegend drawing 5. Fix - correctly use text font for TGaxis title 6. Fix - preserve auto colors in THStack #277 7. Fix - correctly set pave name #278

First-pass IMT implementation of FlushBaskets.

c247dcc

bbockelm mentioned this pull request Oct 6, 2016

Study the threading efficiency of Fall 2016 re-RECO jobs cms-sw/cmssw#16104

Closed

pcanal reviewed Oct 6, 2016

View reviewed changes

Replace mutex with atomics while we are in the parallel flush basket …

0be50e3

…mode.

Axel-Naumann requested changes Oct 7, 2016

View reviewed changes

bbockelm added 2 commits October 7, 2016 07:24

Switch out printf->Info for debug statements.

88def72

Correct includes.

9235428

Dr15Jones reviewed Oct 7, 2016

View reviewed changes

Add comment about pitfalls of IMT mode.

249cb14

pcanal reviewed Nov 4, 2016

View reviewed changes

tree/tree/src/TTree.cxx

return -1;

}

struct BoolRAIIToggle {

Copy link

Member

pcanal Nov 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting this class in the anonymous namespace.

Fixups from code review.

e14e674

pcanal reviewed Nov 16, 2016

View reviewed changes

bbockelm added 2 commits November 17, 2016 15:33

Remove inadvertent inclusion of testing code to MainEvent.cxx.

71b2161

Allow user to toggle IMT mode explicitly.

cca9e30

pcanal closed this Jan 10, 2017

This was referenced Nov 24, 2017

[cxxmodules] Also preload TMVA/TreePlayer #1355

Closed

[cxxmodules] Preload tmva tree player graf #1365

Closed

phsft-bot mentioned this pull request Nov 6, 2018

fix root 9762 on v6-14-00-patches #2929

Merged

phsft-bot mentioned this pull request Mar 26, 2019

[ROOT-9707][DF] Defer declaration of type aliases to the interpreter until right before the event loop #3602

Merged

phsft-bot mentioned this pull request May 1, 2019

[cxxmodules] Don't declare strings from rootmap for modules #3043

Merged

phsft-bot mentioned this pull request Jan 28, 2020

[cxxmodules] Implement global module indexing to improve performance. #4016

Merged

linev mentioned this pull request Feb 23, 2024

[jsroot] 7.5.x from 14.02.2024 #14804

Merged

First-pass IMT implementation of FlushBaskets. #277

First-pass IMT implementation of FlushBaskets. #277

Uh oh!

Conversation

bbockelm commented Oct 6, 2016

Uh oh!

dpiparo commented Oct 6, 2016

Uh oh!

etejedor commented Oct 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbockelm commented Oct 6, 2016

Uh oh!

Dr15Jones commented Oct 6, 2016

Uh oh!

pcanal commented Oct 6, 2016

Uh oh!

bbockelm commented Oct 6, 2016

Uh oh!

bbockelm commented Oct 7, 2016

Uh oh!

Axel-Naumann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dpiparo commented Oct 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbockelm commented Oct 7, 2016

Uh oh!

pcanal commented Oct 7, 2016

Uh oh!

bbockelm commented Oct 7, 2016

Uh oh!

bbockelm commented Nov 4, 2016

Uh oh!

bbockelm commented Nov 4, 2016