-
Notifications
You must be signed in to change notification settings - Fork 1.4k
First-pass IMT implementation of FlushBaskets. #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @bbockelm , nice job. Cheers, |
|
Hi @bbockelm , I see that you are using ROOT::Internal::TParBranchProcessingRAII here: This RAII activates the locks needed for reading branches in parallel. Are these locks exactly the same in the case of writing? If not, you should use a different RAII and put in place only those locks that you actually need. Cheers, |
tree/tree/inc/TTree.h
Outdated
| TTree& operator=(const TTree& tt); // not implemented | ||
|
|
||
| #ifdef R__USE_IMT | ||
| mutable std::mutex fCounterMutex; ///<!Lock to protect counters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably using the TSpinMutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) Yes - but the intent is to throw this away and avoid all locking whatsoever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in latest version.
tree/tree/inc/TTree.h
Outdated
| virtual Int_t GetTimerInterval() const { return fTimerInterval; } | ||
| TBuffer* GetTransientBuffer(Int_t size); | ||
| virtual Long64_t GetTotBytes() const { return fTotBytes; } | ||
| virtual Long64_t GetTotBytes() const { std::lock_guard<std::mutex> sentry(fCounterMutex); return fTotBytes; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For correct operation, shouldn't you also force the assignment to a local variable:
std::lock_guardstd::mutex sentry(fCounterMutex);
auto retValue = fTotBytes;
return retValue;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is correct ( @Dr15Jones - can you confirm); however, I prefer to remove this lock if possible.
What's the best advice for using std::atomic? Maybe use a RAII type like with GetEntry? Would this work:
- When the parallel flush basket starts, set a (transient) flag with a scoped RAII-style object.
- If the transient flag is set, use transient
std::atomic-based counters. (This should have no overhead when IMT is disabled). - When parallel flush ends (no parallel work is going on), add the
std::atomic-based counters to the existing, non-atomicfTotBytesandfZipBytesand zero them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a local variable depends on when compilers run destructors, before or after copying the value out to memory/register for returning from the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok: I thought C++ provided a guarantee.
(Good to know, but I think I can just remove the lock entirely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in latest version - lock removed.
tree/tree/src/TTree.cxx
Outdated
| if (!fDirectory || fDirectory == gROOT || !fDirectory->IsWritable()) return 0; | ||
| if (gDebug > 0) { | ||
| printf("AutoSave Tree:%s after %lld bytes written\n",GetName(),fTotBytes); | ||
| printf("AutoSave Tree:%s after %lld bytes written\n",GetName(),GetTotBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the ROOT printing routine (Info is likely to be the one you are looking for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are a few other printf statements - should I also try to grab those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly but then in a separate commit. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in latest version - all debugging is now a call to Info.
tree/tree/inc/TTree.h
Outdated
| virtual TFriendElement *AddFriend(TTree* tree, const char* alias = "", Bool_t warn = kFALSE); | ||
| virtual void AddTotBytes(Int_t tot) { fTotBytes += tot; } | ||
| virtual void AddZipBytes(Int_t zip) { fZipBytes += zip; } | ||
| virtual void AddTotBytes(Int_t tot) { std::lock_guard<std::mutex> sentry(fCounterMutex); fTotBytes += tot; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For production code, we would need to make those optional as we need to not penalize the serial code or even multi-thread code not using the IMT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I believe I have gotten everything down to a single boolean flag.
|
@etejedor - Apologies: I had only looked at the latest commit and saw Danilo's name and overlooked yours. Nice work! Indeed, @dpiparo - From the other ticket, when we switch from the existing compression approach (HC-LZMA) to zlib level 1, the CPU efficiency goes up from 83% to 95%. Basically, the majority of long threads stalls comes from waiting on the flush to finish. Here's a nice graph Chris put together: https://dl.dropboxusercontent.com/u/11356841/rereco_1000_stall.pdf |
|
Since TTree uses manual Streamer code, why not just make the data members being protected by the mutex atomic? |
|
@Dr15Jones Currently StreamerInfo does not yet support std::atomic (we could add it straightforwardly if we have a guarantee that std::atomic and T have the same memory layout). |
|
I believe there is explicitly no guarantee that However, since we only need this to be thread-safe for the scope of the I think. |
|
Latest commit switches to transient atomics while the parallel flush is ongoing; after completion, the values in the atomics are added back into the main non-thread-safe counter. |
Axel-Naumann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny #include inconsistencies...
tree/tree/inc/TTree.h
Outdated
| #include "TVirtualTreePlayer.h" | ||
| #endif | ||
|
|
||
| #include <mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect <atomic> instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. That was overlooked in changing the mutex to atomics. Fixed.
|
|
||
| #ifdef R__USE_IMT | ||
| static ROOT::TRWSpinLock fgRwLock; ///<!Read-write lock to protect global PID list | ||
| std::mutex fWriteMutex; ///<!Lock for writing baskets / keys into the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see #include <mutex>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
Hi @bbockelm, |
| std::atomic<Int_t> nerrpar(0); | ||
| std::atomic<Int_t> nbpar(0); | ||
| std::atomic<Int_t> pos(0); | ||
| tbb::task_group g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must be extremely careful when embedding calls to TBB into legacy code. Holding any shared resource (which a C++ object itself may be viewed as such a resource) can lead to deadlocks. See
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/401006
for details. This is one reason why the CMS threaded-framework is built using explicit tasks rather than implicit ones. For the case of writing, I can't think of anyone being dependent on the TFile, although we must be certain that no ROOT locks are being held by the thread calling g.wait().
The use of a tbb::task_arena would avoid the deadlock but does not appear to give us a way to limit the number of available threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Chris,
Indeed.
My logic here is:
- This function is high up enough in the call stack to verify it is not called with any global ROOT mutexes held.
- The locking within the function was made per-
TFile. TTreeandTFileare both thread-unsafe objects (cannot have two threads doing const or non-const calls into the same object), meaning there should not be multiple callers to this function regardless.
So, the remaining danger is if the caller (CMSSW) is holding a mutex of its own that other CMSSW tasks might need.
I will write these concerns up into a comment, provide an example of dangerous behavior, and add them to the function's documentation.
Brian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of a deadlock that could have happened in older versions of CMSSW:
- CMSSW takes a mutex on the output module.
- CMSSW performs a ROOT call which causes a
FlushBasketsoperation. - ROOT creates many basket flushing tasks and then
waiton the task group. - TBB's runtime notice the thread blocks in
waitand schedules another CMSSW task on the same thread. - The other CMSSW also needs to access the output module, hence takes a mutex.
- Since the thread already holds the output module's mutex, trying to acquire it again causes a deadlock.
Now, Chris tells me that CMSSW has switched away from the mutex for output module access - but this might be a good example.
I pushed a long comment for FlushBaskets outlining the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other CMSSW also needs to access the output module, hence takes a mutex.
Since the thread already holds the output module's mutex, trying to acquire it again causes a deadlock.
This of course depend on whether the mutex that is used is recursive or not (TMutex is, std::mutex is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly!
The point is that working code with IMT-disabled may stop working when IMT is enabled. One would hope that issues would be fixable once the user understands the underlying issue. Hence the extensive comment in the code.
|
@dpiparo - doing a build with CMSSW will be extremely hard until CMSSW has at least a test-release based on ROOT 6.08 (and, even then, no guarantees the reco application would work). Best I can do currently is (a) this improves straightforward examples, (b) CMSSW blocking on |
|
@bbockelm Didn't Dan show that LZMA with the smaller windows was on par performance/stall wise with zlib? [Nonetheless, this patch should still help even further :) ] |
|
@pcanal - The LZMA tweaks caused stalls to decrease by an order-of-magnitude; unfortunately, they started as two-orders-magnitude larger than the next source of stalling. |
|
@dpiparo - PR9 in roottest has some minimal coverage of this. |
|
@karies - if I'm reading the ticket history correctly, you requested that I fix up the |
| TTree& operator=(const TTree& tt); // not implemented | ||
|
|
||
| #ifdef R__USE_IMT | ||
| mutable Bool_t fIMTFlush{false}; ///<! True if we are doing a multithreaded flush. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd to see a mutable non-atomic. Is that really the intention? (i.e. is it really never possible to modify/read it in parallel?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the intention. Well, the real problem here is that FlushBaskets is declared const: the mutable here is working around that.
The fIMTFlush is only called from within FlushBaskets, which is not allowed to be called from multiple threads. It's done outside the parallel portion, meaning there must be a synchronization after it is mutated.
I wanted to avoid making it atomic to fulfill the request that there is no performance penalty when ROOT is used in non-IMT mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, a comment here or at the use place along the line of this answer ought to be added.
| virtual void AddZipBytes(Int_t zip) { fZipBytes += zip; } | ||
| // As the TBasket invokes Add{Tot,Zip}Bytes on its parent tree, we must do these updates in a thread-safe | ||
| // manner only when we are flushing multiple baskets in parallel. | ||
| virtual void AddTotBytes(Int_t tot) { if (fIMTFlush) { fIMTTotBytes += tot; } else { fTotBytes += tot; } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail to compile if R__USE_IMT is off.
| return -1; | ||
| } | ||
| fMotherDir = file; // fBranch->GetDirectory(); | ||
| #ifdef R__USE_IMT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the slightly unusual use pattern (take the lock then explicit unlock then lock). Can you add comment on what this is propecting (and why it will not provoke a dead lock).
| return -1; | ||
| } | ||
|
|
||
| struct BoolRAIIToggle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider putting this class in the anonymous namespace.
test/MainEvent.cxx
Outdated
| Int_t branchStyle = 1; //new style by default | ||
| if (split < 0) {branchStyle = 0; split = -1-split;} | ||
|
|
||
| ROOT::EnableImplicitMT(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be optional (so we need to add a command line argument) and default to not be enabled (as is the test now complain if ROOT was not built with IMT enabled).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Those were test modifications to MainEvent.cxx that I did not intend to commit. Will remove them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree ;). I think it is a useful new testing swtich ... it just need to be made conditional. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer that it is automatically enabled or add it as a command line flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a command line flag so that the default stay as is and one can test with and without it with the same build.
|
Merged. Thanks, |
1. Fix - abort tree draw operation faster 1. Fix - catch exception when parsing TF1 formula 2. Fix - properly check THStack histograms axes when doing sum 3. Fix - correctly handle negative offset on time axis 4. Fix - do not use `inset` because of old Chrome browsers 5. Fix - properly provide object hints 1. Fix - draw histograms with negative bins root-project#276 2. Fix - correctly read TLeaf with fixed-size array 3. Fix - bug in options handling in startGUI 4. Fix - greyscale support in TLegend drawing 5. Fix - correctly use text font for TGaxis title 6. Fix - preserve auto colors in THStack root-project#277 7. Fix - correctly set pave name root-project#278
1. Fix - abort tree draw operation faster 1. Fix - catch exception when parsing TF1 formula 2. Fix - properly check THStack histograms axes when doing sum 3. Fix - correctly handle negative offset on time axis 4. Fix - do not use `inset` because of old Chrome browsers 5. Fix - properly provide object hints 1. Fix - draw histograms with negative bins #276 2. Fix - correctly read TLeaf with fixed-size array 3. Fix - bug in options handling in startGUI 4. Fix - greyscale support in TLegend drawing 5. Fix - correctly use text font for TGaxis title 6. Fix - preserve auto colors in THStack #277 7. Fix - correctly set pave name #278
This is based on @dpiparo 's work to parallelize
GetEntry. The basic idea is, when we are flushing all active branches, we do each branch in parallel. We have to maintain mutual exclusion when interacting with theTTreeorTFile, but we can parallelize the compression of the baskets (which is a significant amount of CPU time).Note the least satisfactory part of this work is having to use a mutex to access the byte-counters in
TTree; this is because these fields are serialized andstd::atomic<>is not serializable. Any hints as to how to get around this?Setting
MainEvent.cxxin thetestsub-directory to use this (with LZMA as the compression algorithm), I get:@pcanal @Dr15Jones - this spun off from our discussion about CMSSW efficiency. It's really easy to parallelize
FlushBasketsusing atbb::task_groupthat I later wait for. However, continuation-style programming is difficult here becauseFlushBasketsis called from deep callstacks. Further, there's a lot of state in the basket itself we'd need to unravel.Looking at stack traces for the sample
Eventprogram, the next most advantageous place to parallelize compression is here:The idea would be to make
WriteBufferkick off a separate task, but blockTBranch::Fill(and a handful of other functions, such as anything that can change the branch'sTFile) from being called until theWriteBuffertask was completed. Harder than this approach, but not impossible.