- 
                Notifications
    
You must be signed in to change notification settings  - Fork 19.6k
 
[Don't Merge][Need discussion] First-step toward a working mp on Travis #11521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           If I understand correctly (I'm by no means an expert in multithreading/multiprocessing), in the implementation proposed in this PR, the batch is dropped and a warning is raised. Is that right? If it's not too much trouble to implement, I believe the best option is to recompute the sample and raise a warning for those reasons: 
 I might be wrong on certain points, as I don't know much about all this.  | 
    
        
          
                keras/utils/data_utils.py
              
                Outdated
          
        
      | future.idx = i | ||
| self.queue.put( | ||
| executor.apply_async(get_index, (self.uid, i)), block=True) | ||
| future, block=True) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the line needs to be broken.
        
          
                keras/utils/data_utils.py
              
                Outdated
          
        
      | "An input could not be retrieved." | ||
| " It could be because a worker has died." | ||
| "We do not have any information on the lost sample." | ||
| .format(), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe the format is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, maybe you intended to display the index in the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but we cannot know the index in a generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious as to what's the interaction with fit_generator when batches are dropped? Does while steps_done < steps_per_epoch: just complete fine? Are other batches drawn at random instead of the dropped ones?
        
          
                keras/utils/data_utils.py
              
                Outdated
          
        
      | except mp.TimeoutError: | ||
| idx = future.idx | ||
| warnings.warn( | ||
| "The input {} could not be retrieved." | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use ' as quote character for consistency.
Same below.
        
          
                keras/utils/data_utils.py
              
                Outdated
          
        
      | "An input could not be retrieved." | ||
| " It could be because a worker has died." | ||
| "We do not have any information on the lost sample." | ||
| .format(), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, maybe you intended to display the index in the warning?
| 
           As of now, there is no way to inform  Quick note that the "real" fix for travis is to use spawn instead of fork. The issue is that it's only doable with Python3.  | 
    
          
 I think it would be acceptable to only test this functionality with Python 3. Our #1 priority should be to make CI reliable and fast.  | 
    
| 
           Currently, most of the timouts append with CNTK and python 3.6. So fixing the timeout for 3.6 would be a very good first step.  | 
    
| 
           Now multiprocessing tests on Sequences are running on all backends. We do see some Github issues where the training stops. (When using HDF5 and/or OOM)  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few improvements on the tests are possible.
Since I don't know much about multiprocessing, a second review (maybe from @fchollet ) would be beneficial.
        
          
                tests/keras/utils/data_utils_test.py
              
                Outdated
          
        
      | for _ in range(11): | ||
| next(gen_output) | ||
| assert "The input {} could not be retrieved.".format( | ||
| missing_idx) in str(w[-1].message) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test will fail if other warnings are being emitted. I would recommend using pytest's utility to check that warnings are being triggered correctly: https://docs.pytest.org/en/latest/warnings.html#warns
        
          
                tests/keras/utils/data_utils_test.py
              
                Outdated
          
        
      | warnings.simplefilter("always") | ||
| for _ in range(4 * missing_idx): | ||
| next(gen_output) | ||
| assert 'An input could not be retrieved.' in str(w[-1].message) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| 
           Ping @fchollet could you review when you have the time? Thank you.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as far as I can tell. Shall we merge it?
| 
           I think this is LGTM, this would give the user "some" information and we can move forward with using "spawn" when possible on Travis.  | 
    
| 
           LGTM.  | 
    
| 
           Thanks @Dref360 for working on this!  | 
    
Summary
This PR solves some issue when a worker dies and the Pool is not told. (OOM, pkill, etc)
This puts a timeout on the future and notifies the user of the issue.
Discussion
What should we do for those samples in the case of Sequence? Should we re-queue the task or compute the sample directly or just drop it?
Related Issues
PR Overview