Optimizations for TBufferMerger and parallel I/O in general #666

amadio · 2017-06-20T15:08:27Z

These changes are based on an analysis with VTune that showed that most waiting happened around TClass::LoadClassInfo(). Returning early without taking the lock when the information is already loaded yields very significant performance improvements. Images with each thread activity before and after the optimizations are attached. The x-axis represents time, light green means a thread is active, and dark green and brown mean running.

Before changes:

After changes:

phsft-bot · 2017-06-20T15:08:43Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

pcanal · 2017-06-20T15:15:12Z

core/meta/src/TClass.cxx

 {
-   R__LOCKGUARD(gInterpreterMutex);
+   // Load class info if not already loaded and auto-parsing is not suspended
+   if (!fClassInfo && !gInterpreter->IsAutoParsingSuspended()) {


For this to work properly fClassInfo needs to become an atomic (i.e. the test here can happen in the middle of fClassInfo being updated and lead to returning a partial set pointer.

Mixing the two tests is semantic incorrect. You can have a situation where fClassInfo is not set and when AutoParsing is suspended but fClassInfo can still be set because the information is already loaded/present in the AST.

I cannot see how a pointer can be only partially set.

Your second comment does make sense, though. What do I need to avoid when auto parsing is suspended? It's not clear to me. I thought AutoParse(GetName()) should only be called if it's not suspended...

BTW, if fClassInfo is being set during the if statement, then the lock is acquired. By the time we can continue, fClassInfo should be either not set or fully set, so the check for early return will take care that we don't redo the work or return an incomplete result.

I cannot see how a pointer can be only partially set.

Re-read the 'raison d'etre' of atomics. Without fClassInfo being an atomic, there are race condition where a thread is writing to a memory area at the same time another thread is reading this same are and only a partial copy of the writing (let's the first 4 out of 8 bytes) is seen by the reader.

What do I need to avoid when auto parsing is suspended?
if (!fClassInfo && !gInterpreter->IsAutoParsingSuspended()) {

The point I am trying to make is that if fClassInfo is a nullptr and fCanLoadClassInfo is true and
IsAutoParsingSuspend is true, then there is still case where fClassInfo can be set (because the AST already contains the information.

What do I need to avoid when auto parsing is suspended?

Just the setting of fCanLoadClassInfo and the warning about the information missing since in this case the lookup failure (if any) was not fully authoritative ...

BTW, if fClassInfo is being set during the if statement, then the lock is acquired. By the time we can continue, fClassInfo should be either not set or fully set, so the check for early return will take care that we don't redo the work or return an incomplete result.

Unfortunately not accurate. In the (rare, very rare but real and seen by CMSSW) cases where fClassInfo is partially filled when the if statement happens, then the function returns immediately and then there is (still) a chance that the reading thread will then proceed to try to use this partially filled value of fClassInfo.

In the end, if fClassInfo is made atomic, the start of the function can be

if (fClassInfo == nullptr && fCanLoadClassInfo) { ....

or

if (fClassInfo != nullptr || !fCanLoadClassInfo) return;

I am not sure which one is more efficient ...
You may even want to use R__likely/R__unlikely

if (R__unlikely(fClassInfo == nullptr && fCanLoadClassInfo)) { .... // or if (R__likely(fClassInfo != nullptr || !fCanLoadClassInfo)) return;

Cheers,
Philippe.

Since fCanLoadClassInfo is atomic, we should probably short-circuit on it rather than fClassInfo. That is, we should use if (!fCanLoadClassInfo || fClassInfo) return; then set fCanLoadClassInfo to false once it's already loaded.

I think you have a good point in that case we can simplify it to just

if (R__likely(!fCanLoadClassInfo)) return; // fClassInfo already loaded or not loadable. R___LOCKGUARD(gROOTMutext); if (!fCanLoadClassInfo || fClassInfo) return; // in case some other thread beat us to the punch and we are already done. ... // Previous version of the code

pcanal · 2017-06-20T15:16:03Z

core/meta/src/TClass.cxx

-   if (!gInterpreter->IsAutoParsingSuspended()) {
-      if (!fClassInfo) {
+      if (!fClassInfo)
+         gInterpreter->SetClassInfo(const_cast<TClass*>(this));


Why remove the comment?

The function is called SetClassInfo. The comment is redundant in my opinion.

The ambiguity/unusual thing is that it sets a data member of the 'parameter' rather then object it is called upon.

Are you saying that every call to this function needs a comment to clarify that it sets the fClassInfo pointer? That doesn't make sense to me.

because of the odd pattern and the static_cast being used, yes I felt compelled to do so (well in 3 out of the 4 actual calls ... but if somehow I am wrong that the pattern is odd/confusing, then indeed it is fine to omit the comment ...

pcanal · 2017-06-20T15:17:39Z

core/meta/src/TClass.cxx

 {
-   R__LOCKGUARD(gInterpreterMutex);
+   // Load class info if not already loaded and auto-parsing is not suspended
+   if (!fClassInfo && !gInterpreter->IsAutoParsingSuspended()) {


Mixing the two tests is semantic incorrect. You can have a situation where fClassInfo is not set and when AutoParsing is suspended but fClassInfo can still be set because the information is already loaded/present in the AST.

pcanal · 2017-06-20T15:18:49Z

core/meta/src/TClass.cxx


-   // If another thread executed LoadClassInfo at about the same time
-   // as this thread return early since the work was done.
-   if (!fCanLoadClassInfo) return;


If I read correct this test is removed, why?

If the comment above this test is accurate, this should be covered by the test if (fClassInfo) return; right after acquiring the lock.

The comment is not fully accurate. There is more semantic to fCanLoadClassInfo, it also cover the case where fClassInfo is null but the TClass knows it will not be able to load it (and thus there is not point in wasting CPU re-finding out that information).

And also fCanLoadClassInfo is an atomic.

pcanal · 2017-06-20T15:49:37Z

On a general note, very nice find! Thanks!

Does the second graph means that in that scenario the 'merging' is now the bottle neck (i.e the one long thread)?

amadio · 2017-06-20T15:51:05Z

Yes, the second graph basically shows that the bottleneck is now the disk speed. :-)

amadio · 2017-06-20T15:51:48Z

I am asking for access to an NVMe SSD that is installed on this server. With that I will collect more scaling data, which should be much better than before.

pcanal · 2017-06-20T15:53:04Z

The graph 'seems' to also indicates that there is only 'one' cluster per thread .. is that plausible?
I am deriving this guess from the fact that the merger thread seems to only start doing something after the worker thread has mostly done all their work.

phsft-bot · 2017-06-20T15:53:54Z

Build failed on centos7/gcc49.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.roottest.root.multicore.roottest_root_multicore_tclass_methods

phsft-bot · 2017-06-20T15:54:44Z

Build failed on slc6/gcc49.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.roottest.root.multicore.roottest_root_multicore_tclass_methods

pcanal · 2017-06-20T15:59:45Z

I am asking for access to an NVMe SSD that is installed on this server.

To emulate an infinite speed disk, I suspect you can also write the file named "/dev/null" :)

More realistically, this is actually good news and we ought to be able to compare the run-time of this process to the run-time of the equivalent copy or dd of full resulting file ... humm this would involve/include the reading so a completely fair comparison ....

amadio · 2017-06-20T16:01:02Z

Well, the failing test shows that there is indeed a problem with my changes. I will push a new version shortly.

phsft-bot · 2017-06-20T16:09:46Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

phsft-bot · 2017-06-20T16:11:59Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

amadio · 2017-06-20T16:14:05Z

I removed the other two commits because I still see a race condition on TList::RecursiveRemove that causes a crash of the TBufferMerger test.

phsft-bot · 2017-06-20T16:14:57Z

Build failed on centos7/gcc49.
See console output.

Errors:

/mnt/build/jenkins/workspace/root-pullrequests-build/root/core/meta/src/TClass.cxx:5468:29: error: ‘kFalse’ was not declared in this scope
/mnt/build/jenkins/workspace/root-pullrequests-build/root/core/meta/src/TClass.cxx:5486:27: error: ‘kFalse’ was not declared in this scope

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

pcanal · 2017-06-20T16:15:04Z

I removed the other two commits because I still see a race condition on
What were the stack trace?

phsft-bot · 2017-06-20T16:15:31Z

Build failed on mac1012/native.
See console output.

Errors:

/Volumes/HDD2/ec/build/workspace/root-pullrequests-build/root/core/meta/src/TClass.cxx:5468:29: error: use of undeclared identifier 'kFalse'
/Volumes/HDD2/ec/build/workspace/root-pullrequests-build/root/core/meta/src/TClass.cxx:5486:27: error: use of undeclared identifier 'kFalse'; did you mean 'false'?

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

phsft-bot · 2017-06-20T16:16:14Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

amadio · 2017-06-20T16:17:02Z

...
[ RUN      ] TBufferMerger.ParallelTreeFill

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================

Thread 2 (Thread 0x7f75ebd7a700 (LWP 834)):
#0  0x00007f75eee0eb7b in __GI___waitpid (pid=841, stat_loc=stat_loc
entry=0x7f75ebd72bc0, options=options
entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:29
#1  0x00007f75eed961eb in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:148
#2  0x00007f75eff097d8 in TUnixSystem::Exec (shellcmd=<optimized out>, this=0x1e074f0) at /home/amadio/src/root/core/unix/src/TUnixSystem.cxx:2118
#3  TUnixSystem::StackTrace (this=0x1e074f0) at /home/amadio/src/root/core/unix/src/TUnixSystem.cxx:2412
#4  0x00007f75eff0be1c in TUnixSystem::DispatchSignals (this=0x1e074f0, sig=kSigSegmentationViolation) at /home/amadio/src/root/core/unix/src/TUnixSystem.cxx:3643
#5  <signal handler called>
#6  TList::RecursiveRemove (this=0x1e0d6c0, obj=0x7f75e4d04b40) at /home/amadio/src/root/core/cont/src/TList.cxx:717
#7  0x00007f75efe885b3 in THashList::RecursiveRemove (this=<optimized out>, obj=0x7f75e4d04b40) at /home/amadio/src/root/core/cont/src/THashList.cxx:286
#8  0x00007f75efe25047 in TObject::~TObject (this=0x7f75e4d04b40, __in_chrg=<optimized out>) at /home/amadio/src/root/core/base/src/TObject.cxx:88
#9  0x00007f75f0eef6a9 in TTree::~TTree (this=0x7f75e4d04b40, __in_chrg=<optimized out>) at /home/amadio/src/root/tree/tree/src/TTree.cxx:917
#10 0x00007f75efeba961 in TClass::Destructor (this=this
entry=0x1e81470, obj=obj
entry=0x7f75e4d04b40, dtorOnly=dtorOnly
entry=false) at /home/amadio/src/root/core/meta/src/TClass.cxx:5074
#11 0x00007f75f0758a38 in TFileMerger::MergeRecursive (this=0x7f75ebd75a90, target=0x7f75e4dd2780, sourcelist=0x7f75e4844ec0, type=14) at /home/amadio/src/root/io/io/src/TFileMerger.cxx:756
#12 0x00007f75f0756e37 in TFileMerger::PartialMerge (this=this
entry=0x7f75ebd75a90, in_type=in_type
entry=14) at /home/amadio/src/root/io/io/src/TFileMerger.cxx:842
#13 0x00007f75f073097e in ROOT::Experimental::TBufferMerger::WriteOutputFile (this=0x7fff3e662d00) at /home/amadio/src/root/io/io/src/TBufferMerger.cxx:90
#14 0x00007f75ef904d9e in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/7.1.0/libstdc++.so.6
#15 0x00007f75ef0f43f4 in start_thread (arg=0x7f75ebd7a700) at pthread_create.c:333
#16 0x00007f75eee3f65f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

Thread 1 (Thread 0x7f75f1325a80 (LWP 771)):
#0  0x00007f75ef0f56cd in pthread_join (threadid=140144444679936, thread_return=0x0) at pthread_join.c:90
#1  0x00007f75ef905063 in std::thread::join() () from /usr/lib/gcc/x86_64-pc-linux-gnu/7.1.0/libstdc++.so.6
#2  0x00007f75f072fd76 in ROOT::Experimental::TBufferMerger::~TBufferMerger (this=0x7fff3e662d00, __in_chrg=<optimized out>) at /home/amadio/src/root/io/io/src/TBufferMerger.cxx:35
#3  0x00000000004182f0 in TBufferMerger_ParallelTreeFill_Test::TestBody (this=<optimized out>) at /home/amadio/src/root/io/io/test/TBufferMerger.cxx:86
#4  0x000000000043fa74 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x442444 "the test body", method=<optimized out>, object=<optimized out>) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2402
#5  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object
entry=0x1e5e370, method=<optimized out>, location=location
entry=0x442444 "the test body") at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2438
#6  0x00000000004365aa in testing::Test::Run (this=0x1e5e370) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2475
#7  0x00000000004366f8 in testing::TestInfo::Run (this=0x1e5e520) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2656
#8  0x00000000004367d5 in testing::TestCase::Run (this=0x1e5e0a0) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2774
#9  0x0000000000436ab7 in testing::internal::UnitTestImpl::RunAllTests (this=0x1e5de00) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:4649
#10 0x000000000043ff24 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0x4419b8 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=<optimized out>) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2402
#11 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x1e5de00, method=<optimized out>, location=location
entry=0x4419b8 "auxiliary test code (environments or event listeners)") at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:2438
#12 0x0000000000436dd0 in testing::UnitTest::Run (this=0x650860 <testing::UnitTest::GetInstance()::instance>) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest.cc:4257
#13 0x0000000000417c59 in RUN_ALL_TESTS () at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/include/gtest/gtest.h:2233
#14 main (argc=<optimized out>, argv=0x7fff3e663168) at /home/amadio/build/root/googletest-prefix/src/googletest/googletest/src/gtest_main.cc:37
===========================================================


The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum.
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#6  TList::RecursiveRemove (this=0x1e0d6c0, obj=0x7f75e4d04b40) at /home/amadio/src/root/core/cont/src/TList.cxx:717
#7  0x00007f75efe885b3 in THashList::RecursiveRemove (this=<optimized out>, obj=0x7f75e4d04b40) at /home/amadio/src/root/core/cont/src/THashList.cxx:286
#8  0x00007f75efe25047 in TObject::~TObject (this=0x7f75e4d04b40, __in_chrg=<optimized out>) at /home/amadio/src/root/core/base/src/TObject.cxx:88
#9  0x00007f75f0eef6a9 in TTree::~TTree (this=0x7f75e4d04b40, __in_chrg=<optimized out>) at /home/amadio/src/root/tree/tree/src/TTree.cxx:917
#10 0x00007f75efeba961 in TClass::Destructor (this=this
entry=0x1e81470, obj=obj
entry=0x7f75e4d04b40, dtorOnly=dtorOnly
entry=false) at /home/amadio/src/root/core/meta/src/TClass.cxx:5074
#11 0x00007f75f0758a38 in TFileMerger::MergeRecursive (this=0x7f75ebd75a90, target=0x7f75e4dd2780, sourcelist=0x7f75e4844ec0, type=14) at /home/amadio/src/root/io/io/src/TFileMerger.cxx:756
#12 0x00007f75f0756e37 in TFileMerger::PartialMerge (this=this
entry=0x7f75ebd75a90, in_type=in_type
entry=14) at /home/amadio/src/root/io/io/src/TFileMerger.cxx:842
#13 0x00007f75f073097e in ROOT::Experimental::TBufferMerger::WriteOutputFile (this=0x7fff3e662d00) at /home/amadio/src/root/io/io/src/TBufferMerger.cxx:90
#14 0x00007f75ef904d9e in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/7.1.0/libstdc++.so.6
#15 0x00007f75ef0f43f4 in start_thread (arg=0x7f75ebd7a700) at pthread_create.c:333
#16 0x00007f75eee3f65f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

===========================================================


Fatal in <TFileMerger::RecursiveRemove>: Output file of the TFile Merger (targeting tbuffermerger_parallel.root) has been deleted (likely due to a TTree larger than 100Gb)
aborting

phsft-bot · 2017-06-20T16:18:45Z

Build failed on slc6/gcc62.
See console output.

Errors:

ninja: error: loading 'build.ninja': No such file or directory

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

pcanal · 2017-06-20T16:19:38Z

annoying :( ... the 'other' thread is already gone ....

phsft-bot · 2017-06-20T16:19:58Z

Build failed on centos7/gcc49.
See console output.

Errors:

/mnt/build/jenkins/workspace/root-pullrequests-build/root/core/meta/src/TClass.cxx:5486:27: error: ‘kFalse’ was not declared in this scope

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

amadio · 2017-06-20T16:20:14Z

Fixed typo kFalse -> kFALSE.

phsft-bot · 2017-06-20T16:20:15Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

pcanal · 2017-06-20T16:24:23Z

core/meta/src/TClass.cxx

-                 "no interpreter information for class %s is available even though it has a TClass initialization routine.",
-                 fName.Data());
-      }
      fCanLoadClassInfo = kFALSE;


To be self consistent, we need

if (autoParse) fCanLoadClassInfo = kFALSE;

When auto parsing is disabled, the lookup answer is not authoritative ....

Can it not be loaded by other means as you said before? I'm confused.

After SetClassInfo is done we can be in many state:

fClassInfo is now a value, we are all good, no new to try reloading (independently of the auto-parsing flag)

fClassInfo is still a nullptr but auto-parsing is enabled, there is no more information to be have anywhere, the AST decl for the class does not exist

fClassInfo is still anullptr but auto-parsing is disabled, we now know the AST decl for the class is not in the AST but it might be brought into the AST if we were to retry with auto-parsing on (because the header declaring it might be parsed).

pcanal · 2017-06-20T16:24:34Z

core/meta/src/TClass.cxx

-   // If another thread executed LoadClassInfo at about the same time
-   // as this thread return early since the work was done.
-   if (!fCanLoadClassInfo) return;
+   bool AutoParse = !gInterpreter->IsAutoParsingSuspended();


naming rule is that variable name start with a lower case.

Ok, I can change that.

Axel-Naumann · 2017-06-21T08:52:49Z

core/meta/src/TClass.cxx

-   // class was registered by a dictionary, but the info came from reading
-   // the pch.
-   // Note: This check avoids using AutoParse for classes in the pch!
-   if (fClassInfo) {


Why do we not need this check anymore? Can we return early if fClassInfo (after making it a unique_ptr)?

When fClassInfo is set, fCanLoadClassInfo is set to false, so this test is redundant with the one right above it. I could check for fClassInfo != nullptr before taking the lock, but fClassInfo is not atomic, while fCanLoadClassInfo is. Therefore, I use the latter exclusively for checking and returning early.

Thanks for explaining @amadio And @pcanal bad naming :-) "can load" is the negation of "have loaded"?

Should we rename the variable to fClassInfoLoaded or similar and use accordingly?

Hi Axel,

Well you may not have read the doc fully :). Orginally 'fCanLoadClassInfo' was really indicating whether there was a chance to be able to load the ClassInfo or not ... and it still does in a way.

At the moment when fCanLoadClassInfo is false, fClassInfo can be either a valid pointer or a nullptr, so "have loaded" would be inaccurate.

In other word, if fCanLoadClassInfo is true running the body of LoadClassInfo is likely to yield a change in the value of fClassInfo, when it is false (and whether or not we already ran it) there is no point in running the body of LoadClassInfo as it is guaranteed to not yield any change in the value of fClassInfo.

Cheers,
Philippe.

Thanks, Philippe!

phsft-bot · 2017-06-21T09:23:40Z

Build failed on mac1012/native.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.test.test_stressIOPlugins_xroot

phsft-bot · 2017-06-21T09:25:21Z

Build failed on slc6/gcc62.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.test.test_stressIOPlugins_xroot

phsft-bot · 2017-06-21T09:34:17Z

Build failed on centos7/gcc49.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

phsft-bot · 2017-06-21T09:38:52Z

Build failed on ubuntu14/native.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.test.test_stressIOPlugins_xroot

phsft-bot · 2017-06-22T08:40:29Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

phsft-bot · 2017-06-22T09:34:40Z

Build failed on mac1012/native.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

phsft-bot · 2017-06-22T09:34:43Z

Build failed on slc6/gcc49.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.roottest.root.dataframe.roottest_root_dataframe_test_snapshotNFiles

phsft-bot · 2017-06-22T09:40:19Z

Build failed on slc6/gcc62.
See console output.

Warnings:

JENKINS-19022: warning: possible memory leak due to Git plugin usage; see: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData

Failing tests:

projectroot.roottest.python.JupyROOT.roottest_python_JupyROOT_importROOT

amadio · 2017-06-22T10:13:42Z

@pcanal Could you please take a look at this test? http://cdash.cern.ch/testDetails.php?test=26224986&build=367429 It could be failing because of my commit, right?

Since fCanLoadClassInfo is atomic, the lock is not needed when checking for it. Returning early without waiting when the class info is already loaded yields significant performance improvements when writing in parallel.

phsft-bot · 2017-06-22T10:23:59Z

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with flags -Dvc=OFF -Dimt=ON -Dccache=ON
How to customize builds

pcanal · 2017-06-22T13:21:45Z

i guess it could .... but I am not sure how .... [Besides, today's version is still sub-optimal, i.e. a WIP]

amadio · 2017-06-22T13:28:51Z

Yes, this is work in progress, I will make several more changes to TClass based on my analysis with VTune. The current situation is shown in the figure below.

The wait time is still somewhat high, but the situation with only this PR's commits is still much better than it currently is on master.

amadio · 2017-06-22T13:41:46Z

@pcanal Jenkins passed. Should I merge this, or should we gather all changes first?

pcanal · 2017-06-22T14:03:52Z

I am confused ... the PR contains VC change and only 2 changes in TClass ... and the one change is still introduced a decrease of the single-thread performance (missing if (fClassInfo) and/or removal of the fClassInfo!=nullptr && fCanLoadClassInfo.)

amadio · 2017-06-22T14:34:46Z

The commit for Vc/VecCore sneaked in from another branch. I added it just to shut up warnings because of how things are on master at the moment, but it's not part of this PR (i.e. I will remove it prior to merging). I don't see a decrease of single thread performance, did you measure it?

pcanal · 2017-06-22T14:41:30Z

I don't see a decrease of single thread performance, did you measure it?

I know and have tried to explain it earlier.

There is a (quite) common case where a TClass is created (running TClass::Init) and ends up with fClassInfo != nullptr but fCanLoadClassInfo == true. In this commong case the removal of the
if (fClassInfo) return;
means that we do a useless (expensive) lookup (or at least series of function calls) for something already set and know. [On the other hand because that early return is NOT setting fCanLoadClassInfo it leads to LoadClassInfo being call often and the lock being taken]

We either need to re-add this if statement OR better yet make sure that all cases of fClassInfo!=nullptr also have fCanLoadClassInfo == false.

If you are still not convince, you can do the following test:
a) use the master
b) have the if (fStremearInfo) also set fCanLoadClassInfo to false.
and you will notice that the performance improvement is equal and/or similar to the one with your patch.

amadio · 2017-06-23T06:42:11Z

I am not sure that setting fCanLoadClassInfo to false whenever fClassInfo != nullptr will have no other side effects. I see at least a few places where fCanLoadClassInfo is used that could be affected by this change. TClass is too important to risk breaking, so at this point I think the best thing to do is to close this PR and to delegate making the right change to you, since you know the I/O system a lot better than me.

Axel-Naumann · 2017-06-23T09:15:35Z

Re-adding if (fClassInfo) return; (as Philippe suggested) sounded like the more effective path forward :-( No chance for you to have your patch not touch that if, and then ask roottest? The world would be a better place with this PR in!

amadio · 2017-06-23T10:31:06Z

@Axel-Naumann Well, @pcanal disagrees. Philippe wants to just add fCanLoadClassInfo = kFALSE; right after line 1413 of TClass.cxx, so that the calls to LoadClassInfo() which are guarded by if (fCanLoadClassInfo) are never made. However, I am not sure this change is safe yet.

That said, we surely need to fix this problem (given the difference seen in VTune), and I will keep working on a solution and will probably open a new pull request with either this change or similar changes after enough testing to ensure that everything is ok.

…dition Start 672: tutorial-io-loopdir 690/1156 Test root-project#666: tutorial-io-double32 ................................................ Passed 2.42 sec Start 673: tutorial-io-loopdir11 691/1156 Test root-project#673: tutorial-io-loopdir11 ............................................... Passed 0.76 sec Start 674: tutorial-io-mergeSelective 692/1156 Test root-project#672: tutorial-io-loopdir .................................................***Failed Error regular expression found in output. Regex=[Error in <] 1.23 sec Processing /builddir/build/BUILD/root-6.26.10/tutorials/io/loopdir.C... Info in <TCanvas::Print>: ps file hsimple.ps has been created Info in <TCanvas::Print>: Current canvas added to ps file hsimple.ps Info in <TCanvas::Print>: Current canvas added to ps file hsimple.ps Info in <TCanvas::Print>: Current canvas added to ps file hsimple.ps Info in <TCanvas::Print>: ps file hsimple.ps has been closed Error in <TPostScript::Text>: Cannot open temporary file: hsimple.ps_tmp_2089748

…dition (#11725) Start 672: tutorial-io-loopdir 690/1156 Test #666: tutorial-io-double32 ................................................ Passed 2.42 sec Start 673: tutorial-io-loopdir11 691/1156 Test #673: tutorial-io-loopdir11 ............................................... Passed 0.76 sec Start 674: tutorial-io-mergeSelective 692/1156 Test #672: tutorial-io-loopdir .................................................***Failed Error regular expression found in output. Regex=[Error in <] 1.23 sec Processing /builddir/build/BUILD/root-6.26.10/tutorials/io/loopdir.C... Info in <TCanvas::Print>: ps file hsimple.ps has been created Info in <TCanvas::Print>: Current canvas added to ps file hsimple.ps Info in <TCanvas::Print>: Current canvas added to ps file hsimple.ps Info in <TCanvas::Print>: Current canvas added to ps file hsimple.ps Info in <TCanvas::Print>: ps file hsimple.ps has been closed Error in <TPostScript::Text>: Cannot open temporary file: hsimple.ps_tmp_2089748

pcanal reviewed Jun 20, 2017

View reviewed changes

pcanal requested changes Jun 20, 2017

View reviewed changes

amadio force-pushed the parallel-io-optimizations branch from 4e0d5b3 to ce7a646 Compare June 20, 2017 16:09

amadio force-pushed the parallel-io-optimizations branch from ce7a646 to c6fe0e8 Compare June 20, 2017 16:11

amadio force-pushed the parallel-io-optimizations branch from c6fe0e8 to a8281a6 Compare June 20, 2017 16:15

amadio force-pushed the parallel-io-optimizations branch from a8281a6 to 0fa8ca8 Compare June 20, 2017 16:20

pcanal reviewed Jun 20, 2017

View reviewed changes

Axel-Naumann reviewed Jun 21, 2017

View reviewed changes

amadio force-pushed the parallel-io-optimizations branch from eb98b89 to 3450731 Compare June 22, 2017 08:40

amadio added 2 commits June 22, 2017 12:23

[Math] Improve VecCore backend configuration

a5441f4

amadio force-pushed the parallel-io-optimizations branch from 3450731 to a5441f4 Compare June 22, 2017 10:23

amadio closed this Jun 23, 2017

amadio deleted the parallel-io-optimizations branch June 23, 2017 06:42

phsft-bot mentioned this pull request Feb 16, 2018

[modules][cxxmodules] Improve the layering of our special modules. #1412

Merged

phsft-bot mentioned this pull request Oct 22, 2018

[cxxmodules] Fix failing execNestedClasses test for modules #2785

Merged

Optimizations for TBufferMerger and parallel I/O in general #666

Optimizations for TBufferMerger and parallel I/O in general #666

Uh oh!

Conversation

amadio commented Jun 20, 2017

Uh oh!

phsft-bot commented Jun 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcanal commented Jun 20, 2017

Uh oh!

amadio commented Jun 20, 2017

Uh oh!

amadio commented Jun 20, 2017

Uh oh!

pcanal commented Jun 20, 2017

Uh oh!

phsft-bot commented Jun 20, 2017

Warnings:

Failing tests:

Uh oh!

phsft-bot commented Jun 20, 2017

Warnings:

Failing tests:

Uh oh!

pcanal commented Jun 20, 2017

Uh oh!

amadio commented Jun 20, 2017

Uh oh!

phsft-bot commented Jun 20, 2017

Uh oh!

phsft-bot commented Jun 20, 2017

Uh oh!

amadio commented Jun 20, 2017

Uh oh!

phsft-bot commented Jun 20, 2017

Errors:

Warnings:

Uh oh!

pcanal commented Jun 20, 2017

Uh oh!

phsft-bot commented Jun 20, 2017

Errors:

pcanal Jun 20, 2017 •

edited

Loading

Axel-Naumann Jun 21, 2017 •

edited

Loading