Skip to content

Conversation

@bbockelm
Copy link
Contributor

@bbockelm bbockelm commented Oct 6, 2016

This is based on @dpiparo 's work to parallelize GetEntry. The basic idea is, when we are flushing all active branches, we do each branch in parallel. We have to maintain mutual exclusion when interacting with the TTree or TFile, but we can parallelize the compression of the baskets (which is a significant amount of CPU time).

Note the least satisfactory part of this work is having to use a mutex to access the byte-counters in TTree; this is because these fields are serialized and std::atomic<> is not serializable. Any hints as to how to get around this?

Setting MainEvent.cxx in the test sub-directory to use this (with LZMA as the compression algorithm), I get:

RealTime=76.340815 seconds, CpuTime=131.770000 seconds

@pcanal @Dr15Jones - this spun off from our discussion about CMSSW efficiency. It's really easy to parallelize FlushBaskets using a tbb::task_group that I later wait for. However, continuation-style programming is difficult here because FlushBaskets is called from deep callstacks. Further, there's a lot of state in the basket itself we'd need to unravel.

Looking at stack traces for the sample Event program, the next most advantageous place to parallelize compression is here:

#11 0x00007f00743e80fe in R__zipMultipleAlgorithm 
#12 0x00007f00729aec25 in TBasket::WriteBuffer 
#13 0x00007f00729b53f3 in TBranch::WriteBasket 
#14 0x00007f00729b5c95 in TBranch::Fill 
#15 0x00007f00729cb630 in TBranchElement::Fill
#16 0x00007f00729cb418 in TBranchElement::Fill 
#17 0x00007f00729cb418 in TBranchElement::Fill 
#18 0x00007f0072a063f3 in TTree::Fill

The idea would be to make WriteBuffer kick off a separate task, but block TBranch::Fill (and a handful of other functions, such as anything that can change the branch's TFile) from being called until the WriteBuffer task was completed. Harder than this approach, but not impossible.

@dpiparo
Copy link
Member

dpiparo commented Oct 6, 2016

Hi @bbockelm ,

nice job.
For the serialisation std::atomic might not be hard to achieve given its layout. From the second part of your message, I do not understand if the work is finished or not: can you elaborate?
Do you have runtime reduction figures for CMS Reco, stemming from an actual Fall16 reco job or back-of-the-envelope calculations ?
A clarification: the main developer behind IMT capabilities is @etejedor .

Cheers,
D

@etejedor
Copy link
Collaborator

etejedor commented Oct 6, 2016

Hi @bbockelm ,

I see that you are using ROOT::Internal::TParBranchProcessingRAII here:
https://github.com/root-mirror/root/pull/277/files#diff-de4250b19a609451f620d99107b0d4a4R4813

This RAII activates the locks needed for reading branches in parallel. Are these locks exactly the same in the case of writing? If not, you should use a different RAII and put in place only those locks that you actually need.

Cheers,
Enric

TTree& operator=(const TTree& tt); // not implemented

#ifdef R__USE_IMT
mutable std::mutex fCounterMutex; ///<!Lock to protect counters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably using the TSpinMutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) Yes - but the intent is to throw this away and avoid all locking whatsoever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest version.

virtual Int_t GetTimerInterval() const { return fTimerInterval; }
TBuffer* GetTransientBuffer(Int_t size);
virtual Long64_t GetTotBytes() const { return fTotBytes; }
virtual Long64_t GetTotBytes() const { std::lock_guard<std::mutex> sentry(fCounterMutex); return fTotBytes; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correct operation, shouldn't you also force the assignment to a local variable:
std::lock_guardstd::mutex sentry(fCounterMutex);
auto retValue = fTotBytes;
return retValue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is correct ( @Dr15Jones - can you confirm); however, I prefer to remove this lock if possible.

What's the best advice for using std::atomic? Maybe use a RAII type like with GetEntry? Would this work:

  • When the parallel flush basket starts, set a (transient) flag with a scoped RAII-style object.
  • If the transient flag is set, use transient std::atomic-based counters. (This should have no overhead when IMT is disabled).
  • When parallel flush ends (no parallel work is going on), add the std::atomic-based counters to the existing, non-atomic fTotBytes and fZipBytes and zero them out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a local variable depends on when compilers run destructors, before or after copying the value out to memory/register for returning from the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok: I thought C++ provided a guarantee.

(Good to know, but I think I can just remove the lock entirely)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest version - lock removed.

if (!fDirectory || fDirectory == gROOT || !fDirectory->IsWritable()) return 0;
if (gDebug > 0) {
printf("AutoSave Tree:%s after %lld bytes written\n",GetName(),fTotBytes);
printf("AutoSave Tree:%s after %lld bytes written\n",GetName(),GetTotBytes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the ROOT printing routine (Info is likely to be the one you are looking for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are a few other printf statements - should I also try to grab those?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly but then in a separate commit. thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest version - all debugging is now a call to Info.

virtual TFriendElement *AddFriend(TTree* tree, const char* alias = "", Bool_t warn = kFALSE);
virtual void AddTotBytes(Int_t tot) { fTotBytes += tot; }
virtual void AddZipBytes(Int_t zip) { fZipBytes += zip; }
virtual void AddTotBytes(Int_t tot) { std::lock_guard<std::mutex> sentry(fCounterMutex); fTotBytes += tot; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For production code, we would need to make those optional as we need to not penalize the serial code or even multi-thread code not using the IMT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I believe I have gotten everything down to a single boolean flag.

@bbockelm
Copy link
Contributor Author

bbockelm commented Oct 6, 2016

@etejedor - Apologies: I had only looked at the latest commit and saw Danilo's name and overlooked yours. Nice work! Indeed, TParBranchProcessingRAII does not need to be present there. I will remove it.

@dpiparo - From the other ticket, when we switch from the existing compression approach (HC-LZMA) to zlib level 1, the CPU efficiency goes up from 83% to 95%. Basically, the majority of long threads stalls comes from waiting on the flush to finish. Here's a nice graph Chris put together: https://dl.dropboxusercontent.com/u/11356841/rereco_1000_stall.pdf

@Dr15Jones
Copy link
Collaborator

Since TTree uses manual Streamer code, why not just make the data members being protected by the mutex atomic?

@pcanal
Copy link
Member

pcanal commented Oct 6, 2016

@Dr15Jones Currently StreamerInfo does not yet support std::atomic (we could add it straightforwardly if we have a guarantee that std::atomic and T have the same memory layout).
Eventhough there is a custom Streamer for TTree, it still uses the StreamerInfo for most of it I/O.

@bbockelm
Copy link
Contributor Author

bbockelm commented Oct 6, 2016

I believe there is explicitly no guarantee that std::atomic<T> and T have the same memory layout - in fact, it's legal for the compiler to implement this with a mutex. In all likelihood, it is the same layout on X86 but maybe not on obscure-platform-XYZ.

However, since we only need this to be thread-safe for the scope of the FlushBaskets call, we can do the thread safety work in transient members and add the intermediate results back to the non-thread-safe persisted members.

I think.

@bbockelm
Copy link
Contributor Author

bbockelm commented Oct 7, 2016

Latest commit switches to transient atomics while the parallel flush is ongoing; after completion, the values in the atomics are added back into the main non-thread-safe counter.

Copy link
Member

@Axel-Naumann Axel-Naumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny #include inconsistencies...

#include "TVirtualTreePlayer.h"
#endif

#include <mutex>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect <atomic> instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That was overlooked in changing the mutex to atomics. Fixed.


#ifdef R__USE_IMT
static ROOT::TRWSpinLock fgRwLock; ///<!Read-write lock to protect global PID list
std::mutex fWriteMutex; ///<!Lock for writing baskets / keys into the file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see #include <mutex>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@dpiparo
Copy link
Member

dpiparo commented Oct 7, 2016

Hi @bbockelm,
thanks for the plot. My question was slightly different. Did you try a rereco with CMSSW version built on top a ROOT* build containing this patch? If yes, what was the improvement?

std::atomic<Int_t> nerrpar(0);
std::atomic<Int_t> nbpar(0);
std::atomic<Int_t> pos(0);
tbb::task_group g;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must be extremely careful when embedding calls to TBB into legacy code. Holding any shared resource (which a C++ object itself may be viewed as such a resource) can lead to deadlocks. See

https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/401006

for details. This is one reason why the CMS threaded-framework is built using explicit tasks rather than implicit ones. For the case of writing, I can't think of anyone being dependent on the TFile, although we must be certain that no ROOT locks are being held by the thread calling g.wait().
The use of a tbb::task_arena would avoid the deadlock but does not appear to give us a way to limit the number of available threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chris,

Indeed.

My logic here is:

  • This function is high up enough in the call stack to verify it is not called with any global ROOT mutexes held.
  • The locking within the function was made per-TFile.
  • TTree and TFile are both thread-unsafe objects (cannot have two threads doing const or non-const calls into the same object), meaning there should not be multiple callers to this function regardless.

So, the remaining danger is if the caller (CMSSW) is holding a mutex of its own that other CMSSW tasks might need.

I will write these concerns up into a comment, provide an example of dangerous behavior, and add them to the function's documentation.

Brian

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of a deadlock that could have happened in older versions of CMSSW:

  • CMSSW takes a mutex on the output module.
  • CMSSW performs a ROOT call which causes a FlushBaskets operation.
  • ROOT creates many basket flushing tasks and then wait on the task group.
  • TBB's runtime notice the thread blocks in wait and schedules another CMSSW task on the same thread.
  • The other CMSSW also needs to access the output module, hence takes a mutex.
  • Since the thread already holds the output module's mutex, trying to acquire it again causes a deadlock.

Now, Chris tells me that CMSSW has switched away from the mutex for output module access - but this might be a good example.

I pushed a long comment for FlushBaskets outlining the above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other CMSSW also needs to access the output module, hence takes a mutex.
Since the thread already holds the output module's mutex, trying to acquire it again causes a deadlock.

This of course depend on whether the mutex that is used is recursive or not (TMutex is, std::mutex is not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly!

The point is that working code with IMT-disabled may stop working when IMT is enabled. One would hope that issues would be fixable once the user understands the underlying issue. Hence the extensive comment in the code.

@bbockelm
Copy link
Contributor Author

bbockelm commented Oct 7, 2016

@dpiparo - doing a build with CMSSW will be extremely hard until CMSSW has at least a test-release based on ROOT 6.08 (and, even then, no guarantees the reco application would work).

Best I can do currently is (a) this improves straightforward examples, (b) CMSSW blocking on FlushBaskets remains the top contributor to thread stalls, and (c) the stalls go away if we switch the tree to a faster compression algorithm (such as zlib).

@pcanal
Copy link
Member

pcanal commented Oct 7, 2016

@bbockelm Didn't Dan show that LZMA with the smaller windows was on par performance/stall wise with zlib? [Nonetheless, this patch should still help even further :) ]

@bbockelm
Copy link
Contributor Author

bbockelm commented Oct 7, 2016

@pcanal - The LZMA tweaks caused stalls to decrease by an order-of-magnitude; unfortunately, they started as two-orders-magnitude larger than the next source of stalling.

@bbockelm
Copy link
Contributor Author

bbockelm commented Nov 4, 2016

@dpiparo - PR9 in roottest has some minimal coverage of this.

@bbockelm
Copy link
Contributor Author

bbockelm commented Nov 4, 2016

@karies - if I'm reading the ticket history correctly, you requested that I fix up the #includes; should be done. Can you +1 the PR - or do you see something else?

TTree& operator=(const TTree& tt); // not implemented

#ifdef R__USE_IMT
mutable Bool_t fIMTFlush{false}; ///<! True if we are doing a multithreaded flush.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd to see a mutable non-atomic. Is that really the intention? (i.e. is it really never possible to modify/read it in parallel?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the intention. Well, the real problem here is that FlushBaskets is declared const: the mutable here is working around that.

The fIMTFlush is only called from within FlushBaskets, which is not allowed to be called from multiple threads. It's done outside the parallel portion, meaning there must be a synchronization after it is mutated.

I wanted to avoid making it atomic to fulfill the request that there is no performance penalty when ROOT is used in non-IMT mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, a comment here or at the use place along the line of this answer ought to be added.

virtual void AddZipBytes(Int_t zip) { fZipBytes += zip; }
// As the TBasket invokes Add{Tot,Zip}Bytes on its parent tree, we must do these updates in a thread-safe
// manner only when we are flushing multiple baskets in parallel.
virtual void AddTotBytes(Int_t tot) { if (fIMTFlush) { fIMTTotBytes += tot; } else { fTotBytes += tot; } }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail to compile if R__USE_IMT is off.

return -1;
}
fMotherDir = file; // fBranch->GetDirectory();
#ifdef R__USE_IMT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the slightly unusual use pattern (take the lock then explicit unlock then lock). Can you add comment on what this is propecting (and why it will not provoke a dead lock).

return -1;
}

struct BoolRAIIToggle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting this class in the anonymous namespace.

Int_t branchStyle = 1; //new style by default
if (split < 0) {branchStyle = 0; split = -1-split;}

ROOT::EnableImplicitMT(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be optional (so we need to add a command line argument) and default to not be enabled (as is the test now complain if ROOT was not built with IMT enabled).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Those were test modifications to MainEvent.cxx that I did not intend to commit. Will remove them all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree ;). I think it is a useful new testing swtich ... it just need to be made conditional. thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer that it is automatically enabled or add it as a command line flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a command line flag so that the default stay as is and one can test with and without it with the same build.

@pcanal
Copy link
Member

pcanal commented Jan 10, 2017

Merged.

Thanks,
Philippe.

@pcanal pcanal closed this Jan 10, 2017
linev added a commit to linev/root that referenced this pull request Feb 23, 2024
1. Fix - abort tree draw operation faster

1. Fix - catch exception when parsing TF1 formula
2. Fix - properly check THStack histograms axes when doing sum
3. Fix - correctly handle negative offset on time axis
4. Fix - do not use `inset` because of old Chrome browsers
5. Fix - properly provide object hints

1. Fix - draw histograms with negative bins root-project#276
2. Fix - correctly read TLeaf with fixed-size array
3. Fix - bug in options handling in startGUI
4. Fix - greyscale support in TLegend drawing
5. Fix - correctly use text font for TGaxis title
6. Fix - preserve auto colors in THStack root-project#277
7. Fix - correctly set pave name root-project#278
linev added a commit that referenced this pull request Feb 26, 2024
1. Fix - abort tree draw operation faster

1. Fix - catch exception when parsing TF1 formula
2. Fix - properly check THStack histograms axes when doing sum
3. Fix - correctly handle negative offset on time axis
4. Fix - do not use `inset` because of old Chrome browsers
5. Fix - properly provide object hints

1. Fix - draw histograms with negative bins #276
2. Fix - correctly read TLeaf with fixed-size array
3. Fix - bug in options handling in startGUI
4. Fix - greyscale support in TLegend drawing
5. Fix - correctly use text font for TGaxis title
6. Fix - preserve auto colors in THStack #277
7. Fix - correctly set pave name #278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants