Skip to content

Conversation

@amadio
Copy link
Member

@amadio amadio commented Apr 27, 2017

This PR is a work in progress for a parallel version of the snapshot action introduced recently to TDataFrame. This version compiles and passes the test_snaphot.C test from roottest.git, but still needs quite a bit of work.

I imported the files we use with attributed authorship for each part, but we now need to move them to the right place if needed and refactor the interfaces according to feedback from various sources.

Please feel free to make comments directly on the code, and I will try to address everything by the deadline for branching out 6.10.

@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1011, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@amadio amadio changed the title WIP: Parallel snapshot action in TDataFrame [WIP] Parallel snapshot action in TDataFrame Apr 27, 2017
fMergingThread->join();

fIsMerged = true;
return nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reminder, we need to return the actual tree here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a FIXME would be good.

@dpiparo
Copy link
Member

dpiparo commented Apr 27, 2017

Hi,
I plan to add more punctual comments but here I think we have the first comments emerged from today's meeting:

  • A better naming is needed
  • In-tree gtests should be added to make sure the needs of TDF are satisfied
  • The queue should move into ROOT::Internal in core/cont
  • A factory pattern shall be in place not to expose the queue to the users for the moment

@dpiparo
Copy link
Member

dpiparo commented Apr 28, 2017

Another comment: the headers are quite fat. Perhaps we can have in the headers the interfaces and have cxx files with the implementation and helper functions hidden from the interface (and the pch)?

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from ef68901 to e1e4692 Compare April 28, 2017 06:26
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1011, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

Copy link
Member

@vgvassilev vgvassilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

Here is my first set of comments.

On general level:

  • It seems the headers do not include the ROOT-style heading.
  • I find the TThread prefix in the naming exposing an implementation detail. In fact, it seems that we do not rely on threading in the file implementation, we just provide discriminate file blocks to the writers and then we squash them together.
  • It seems misleading to the casual reader/developer/user to use the client-server naming (we have https server and networking libaries in ROOT). I'd use naming around discriminate writing blocks or smth like this.
  • I'd like to see in-tree unit tests of the new file merging.

// //
//////////////////////////////////////////////////////////////////////////

#include "dcqueue.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcanal, I am wondering if there is not already a ROOT container which we could rely on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ROOT has no equivalent at this time, so we are thinking about using tbb::concurrent_queue instead of this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that leak a header-wise dependency on tbb, which we are trying to avoid in general?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header dependency can probably be avoided if we move implementation to source files, as was suggested by Danilo (headers are "fat" now...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that sounds reasonable to me.

TTimeStamp fLastContact;
Double_t fTimeSincePrevContact;

ClientInfo() : fFile(0), fLocalName(), fContactsCount(0), fTimeSincePrevContact(0) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do in-place initialization instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here, can you explain? Do you want to initialize the variables in the place they are declared in the class, or initialize ClientInfo where it's created?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, it'd be better to initialize the variables in place, i.e. as part of their definitions.

ClientInfo() : fFile(0), fLocalName(), fContactsCount(0), fTimeSincePrevContact(0) {}
ClientInfo(const char *filename, UInt_t clientId);

void Set(TFile *file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect (reading from the name) that this is a regular setter. It seems, that we are doing more work in this method eg. some non-trivial file registration. Could you pick up a better name (eg. RegisterFile) and add doxygen style documentation?

return fFilename;
}

Bool_t InitialMerge(TFile *input);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have documentation on each method.

@@ -0,0 +1,212 @@
//===----------------------------------------------------------------------===//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be different from the ROOT-style headers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is the code as imported. I wanted to give correct authorship to it, and then make my modifications on top. I will put everything in ROOT style as part of the work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as long as this doesn't end up in master like that I am happy.

Int_t TThreadMergingFile::Write(const char *n, Int_t opt, Int_t bufsize) const
{
Error("Write const", "A const TFile object should not be saved. We try to proceed anyway.");
return const_cast<TThreadMergingFile *>(this)->Write(n, opt, bufsize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to const_cast that way here seems a bug (or a missing interface) to me.

fMergingThread->join();

fIsMerged = true;
return nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a FIXME would be good.

}

TIter next(&mergers);
ThreadFileMerger *info;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not moving the var definition along with the init?

return BranchNames_t(bnBegin, bnBegin + nExpectedBranches);
}

#if 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is WIP, as mentioned in the header of the PR. I want to be able to switch between the old and new implementations for testing. It will, of course, go away before this gets merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for explaining.

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from e1e4692 to 8048307 Compare May 4, 2017 15:28
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from 8048307 to bb590be Compare May 5, 2017 08:19
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from bb590be to 676f1bc Compare May 5, 2017 08:20
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@dpiparo
Copy link
Member

dpiparo commented May 5, 2017

Hi @amadio,
very nice. Some high level comments before diving into the details:

  • We perhaps need different names for the classes for a better programming model (ideas for the keywords: Buffer, Merge, Parallel, Server, Client, Async ...)
  • We need to check if the inheritance from TObject is always needed
  • We need to put the public interface in the ROOT::Experimental namespace
  • We need to return shtd_ptr to the clients bookkept by the server as weak_ptr by the server
  • We need to check in the dtor of the server if all clients are alive, if not throw
  • We need to put the helper components (e.g. structs not intended to be used by the user) in ROOT::Internal if they do not appear in the interface, in ROOT::Detail otherwise
  • We need to hide all the methods which are not necessary for the user, e.g. the ctor of the client
  • In general, new headers are put in the inc/ROOT directory and have the hxx extension (see TThreadedObject, TThreadedExecutor, TDataFrame....)

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from 676f1bc to 2b38bab Compare May 5, 2017 09:54
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@amadio
Copy link
Member Author

amadio commented May 5, 2017

Hi @dpiparo,

We perhaps need different names for the classes for a better programming model (ideas for the keywords: Buffer, Merge, Parallel, Server, Client, Async ...)

I would prefer to not discuss names. I think it's just easier if you and Philippe propose the names, and then I can rename the classes accordingly. I have no strong opinion on what to name them.

We need to check if the inheritance from TObject is always needed

You are right, it's probably not needed. I left them as they were in the imported code. If not inheriting from TObject is desirable, I would prefer to avoid it too.

We need to put the public interface in the ROOT::Experimental namespace

Done in my last push. I still need to move the appropriate things into Internal and Detail, though.

We need to return shtd_ptr to the clients bookkept by the server as weak_ptr by the server

In order to use either unique_ptr or shared_ptr, the constructor/destructor for the client needs to be public. What I did was to return a raw pointer, and only the server is able to create/destroy clients.
That way we have full control. If others have a different preference, I can change the details of the implementation.

We need to check in the dtor of the server if all clients are alive, if not throw

This is done now. Since only the server can create/destroy clients, there is no need to throw. The server just destroys the clients before itself gets destroyed.

We need to put the helper components (e.g. structs not intended to be used by the user) in ROOT::Internal if they do not appear in the interface, in ROOT::Detail otherwise

Yes, I will do that and push again soon. Only the ClientInfo and ThreadFileMerger classes have to be moved. I may fuse ThreadFileMerger into the server class if I can instead, so that we will have no need for either the ThreadFileMerger or ClientInfo anymore.

We need to hide all the methods which are not necessary for the user, e.g. the ctor of the client
In general, new headers are put in the inc/ROOT directory and have the hxx extension (see TThreadedObject, TThreadedExecutor, TDataFrame....)

After I either fuse the classes as mentioned above, or move them into the Internal/Detail namespaces, this will be solved as well.

@pcanal
Copy link
Member

pcanal commented May 5, 2017

We need to return shtd_ptr to the clients bookkept by the server as weak_ptr by the server

I thought we had settled on the reverse? I.e. return a unique_ptr and the client has some sort of weak_ptr to some server own memory and then we throw if the client try to access the server after it dies?

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from 2b38bab to ddfb310 Compare May 5, 2017 12:03
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@pcanal
Copy link
Member

pcanal commented May 5, 2017

If not inheriting from TObject is desirable,

Yes, do not inherit from TObject unless you use one of its facility:

  • Name/Title
  • UniqueID and/or TRef
  • storing in a ROOT collection
  • Implicit shared ownership.

In addition any class inheriting directly or indirectly from TObject must have a ClassDef.
Other classes should have a ClassDef (either NV or Override or Inline)

@amadio
Copy link
Member Author

amadio commented May 5, 2017

We need to return shtd_ptr to the clients bookkept by the server as weak_ptr by the server

I thought we had settled on the reverse? I.e. return a unique_ptr and the client has some sort of weak_ptr to some server own memory and then we throw if the client try to access the server after it dies?

There is a conflict between using unique_ptr, which requires public constructor/destructor, and having only the server being able to create them, by making the constructor/destructor private, which was suggested by Danilo. I think the solution with private constructor/destructor is nice, since the lifetime of the clients is forced to be the same as the server. The user needs to understand, however, that once the server goes down, all clients go down with it too.

@amadio
Copy link
Member Author

amadio commented May 5, 2017

Apparently, inheriting from TObject is needed, unfortunately:

root/io/io/src/TFileMergeServer.cxx:
 In member function ‘void ROOT::Experimental::TFileMergeServer::Listen()’:
root/io/io/src/TFileMergeServer.cxx:271:29:
 error: no matching function for call to 
     ‘THashTable::Add(ROOT::Experimental::ThreadFileMerger*&)’
             mergers.Add(info);
                             ^

I will try to get rid of ThreadFileMerger by just fusing its functionality into TFileMergeServer.

@pcanal
Copy link
Member

pcanal commented May 5, 2017

I will try to get rid of ThreadFileMerger by just fusing its functionality into TFileMergeServer.

At this point, I don't think that is necessary. That class is in one the use case (i.e. using a ROOT collection). So unless it is clear that replacing THashTable with an equivalent stl collection is much more performant, there is no need to waste time changing the code.

* This function must be called before the TBufferMergerFile gets destroyed,
* or no data is appended to the TBufferMerger.
*/
virtual Int_t Write(const char *name = nullptr, Int_t opt = 0, Int_t bufsize = 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the override keyword to mark what is oveloaded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, will come with next push.


/** Write StreamerInfo for objects that have not already been written. */
void WriteStreamerInfo();
};
Copy link
Member

@pcanal pcanal May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the class inherits from (indirectly) TObject, it must have a ClassDef which could be here

   ClassDefOverride(TBufferMergerFile,0);

The 0 indicate that the class's data member default to transient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for noticing that. I tried to get rid of the TObject inheritance, but it's actually needed, unfortunately. What is difference between ClassDef and ClassDefOverride?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should I add a ClassDef to TBufferMerger? I can't get ROOT to know of its existence when running the tutorial as an interpreted script... Maybe I need a ClassImp in the source file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's actually needed, unfortunately.

Why unfortunately :) ?

What is difference between ClassDef and ClassDefOverride?

ClassDef introduce overload of some of the interface in TObject. Once you add the override keyword to any overload, this signal the compiler that you want warning if you are missing they keyword on any other overload. ClassDefOverride includes the keyword in the right places :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get ROOT to know of its existence when running the tutorial as an interpreted script...

This likely because the auto-parsing of the header is not enable ... i.e. you need to generate a dictionary for both TBufferMergerFile (for both auto-parsing and fulling the ClassDef) and TBufferMerger (for the auto-parsing) ...

and as a general rule we need a dictionary for all user visible classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting errors after adding the ClassDefs you suggested.

Yes, you need to generate the dictionary for those. (And thus update the LinkDef.h file).

Copy link
Member Author

@amadio amadio May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TObject inheritance is needed for ThreadFileMerger, TBufferMerger does not inherit from TObject.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vgvassilev ThreadFileMerger, where the THashTable is needed, will go away later, so I wouldn't worry about fixing this right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do I need to do to have the dictionaries generated? I added ClassDef and ClassImp already, and I now get the errors above...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClassImp is no longer necessary/used. It was to record the name of the source file for THtml to be able to generate documentation.

To generate the dictionary, you need to update the file io/io/inc/LinkDef.h to list the new classes. You may also need to update the CMakeList.txt to make sure the new header files are passed to the dictionary generation step.


using namespace ROOT::Experimental;

static void fill(std::shared_ptr<TTree> tree, int init, int count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for test, we ought to follow the function naming convention :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const size_t nEventsPerWorker = nEvents / nWorkers;

// Make ROOT thread-safe
ROOT::EnableThreadSafety();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we move this must early (first statement) to make it more obvious (that it is required if you do multithreading).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, will come with next push.

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from 6d03994 to 5c73ea6 Compare May 9, 2017 20:19
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@phsft-bot
Copy link

Build failed on mac1012/native.
See console output.

Errors:

  • clang: error: linker command failed with exit code 1 (use -v to see invocation)

Warnings:

  • include/ROOT/TBufferMerger.hxx:126:9: warning: 'WriteStreamerInfo' overrides a member function but is not marked 'override' [-Winconsistent-missing-override]
  • include/ROOT/TBufferMerger.hxx:126:9: warning: 'WriteStreamerInfo' overrides a member function but is not marked 'override' [-Winconsistent-missing-override]
  • include/ROOT/TBufferMerger.hxx:126:9: warning: 'WriteStreamerInfo' overrides a member function but is not marked 'override' [-Winconsistent-missing-override]

@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from 5c73ea6 to 45c9fec Compare May 9, 2017 21:43
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@amadio
Copy link
Member Author

amadio commented May 9, 2017

Alright, I added dictionary generation and fixed the tutorial. It now runs fine, as well as the tests.

@phsft-bot
Copy link

Build failed on centos7/gcc49.
See console output.

Failing tests:

@amadio
Copy link
Member Author

amadio commented May 10, 2017

It looks like the test failed because it got stuck waiting to get data, but it never came. It doesn't get stuck on my machine, so I didn't see the problem before. I will log into the build node and fix it. That's probably the only remaining thing before we can merge this in.

@amadio
Copy link
Member Author

amadio commented May 10, 2017

I logged into the node and it seems the node may just have been overloaded when the test was run, so it timed out. I ran the test several times, and the test succeeded everytime, even though it took 3 minutes to run the first time. Running 'top' I saw that there were some heavy compilations running at the same time, taking all CPU available... Maybe we should limit the number of simultaneous jobs on each node, and/or the maximum resources a job can use, so that there is always at least one full core available to each job.

For now, I am just going to rebase and push again. I think the tests should succeed if the build node is not too busy.

amadio added 3 commits May 10, 2017 08:49
TBufferMerger is a class to facilitate writing data in
parallel from multiple threads, while writing to a single
output file. Its purpose is similar to TParallelMergingFile,
but instead of using processes that connect to a network
socket, TBufferMerger uses threads that each write to a
TBufferMergerFile, which in turn push data into a queue
managed by the TBufferMerger.

These classes were originally written by Witold Pokorski and
Philippe Canal for the GeantV project. There, they are named
TThreadedMergingServer and TThreadedMergingFile, respectively.
The imported code was then modified into the current version to
be used initially by TDataFrame classes, and later by general
ROOT users.
@amadio amadio force-pushed the jira/root-8756-write-tree-parallel branch from 45c9fec to ed8dc20 Compare May 10, 2017 06:49
@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@amadio amadio changed the title Parallel snapshot action in TDataFrame Add TBufferMerger and TBufferMergerFile to allow parallel data creating and writing to single file May 10, 2017
@amadio
Copy link
Member Author

amadio commented May 10, 2017

The failure is not related to the code in this PR, apparently:

08:50:00 Caused by: hudson.plugins.git.GitException: Command "git config remote.origin.url https://github.com/root-project/root.git" returned status code 4:
08:50:00 stdout: 
08:50:00 stderr: error: failed to write new configuration file .git/config.lock

@amadio
Copy link
Member Author

amadio commented May 10, 2017

@phsft-bot build

@phsft-bot
Copy link

Starting build on gcc49/centos7, native/mac1012, gcc49/slc6, gcc62/slc6, native/ubuntu14 with CMake flags -Dvc=OFF -Dimt=OFF -Dccache=ON

@dpiparo dpiparo merged commit db62b94 into root-project:master May 10, 2017
@dpiparo
Copy link
Member

dpiparo commented May 10, 2017

Merged. Nice job and thanks for following up all the comments which have been made.

@amadio
Copy link
Member Author

amadio commented May 10, 2017

Thanks, Danilo. I will follow up with more fixes once we get the snapshot done as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants