Skip to content

Conversation

@sschmidt23
Copy link
Collaborator

addressing #41, I looked for instances where we were not using numpy.random.default_rng, and/or were not setting the seed from a config parameter for the stage. With all stages that use a random number generator, using default_rng and a configurable seed should allow us to isolate the effects of the rng to the particular stage, and easily change via the config parameter to test effects of the randoms on stage performance.

@sschmidt23 sschmidt23 linked an issue Apr 28, 2023 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Apr 28, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (82da866) 100.00% compared to head (6e4ccc4) 100.00%.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #354   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           38        38           
  Lines         2502      2582   +80     
=========================================
+ Hits          2502      2582   +80     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/rail/creation/degradation/grid_selection.py 100.00% <100.00%> (ø)
src/rail/estimation/algos/knnpz.py 100.00% <100.00%> (ø)
src/rail/estimation/algos/randomPZ.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sschmidt23 sschmidt23 requested review from aimalz and drewoldag April 28, 2023 20:23
Copy link
Collaborator

@aimalz aimalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all look good, thanks for catching and fixing them! EDIT: Actually, does it matter if random_seed isn't explicitly in the init unpacking of the config parameters in two of them?

Copy link
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me. Thank for taking care of it.

# allow for either format for now
numzs = len(data[self.config.column_name])
zmode = np.round(np.random.uniform(0.0, self.config.rand_zmax, numzs), 3)
rng = np.random.default_rng(seed=self.config.seed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine, but it stood out a little. The goal here is that every time _process_chunk is called, it would use the same random value, not a new random value for each call to _process_chunk, correct?

nobs = colordata.shape[0]
rng = np.random.default_rng
perm = rng().permutation(nobs)
rng = np.random.default_rng(seed=self.config.seed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to update, but this file is using 0 as the default seed - just seems like we could do better 🤷

@eacharles
Copy link
Collaborator

eacharles commented Apr 28, 2023 via email

@sschmidt23
Copy link
Collaborator Author

sschmidt23 commented Apr 28, 2023

I hadn't even thought of the chunk number, that's a good point, Eric. I think you're right, in our current setup, if we have the numpy.random.default_rng(seed=self.config.seed) in the process_chunk function, then it will reset to the same seed at the start of each chunk, setting the seed to seed + chunknum fixes that. I guess the other option would be to set the rng in the init as self.rng so that it's not reset at each call of process_chunk but rather only during the init.

@sschmidt23
Copy link
Collaborator Author

Checking through things in RAIL, the only parallelized estimator that uses an rng in _process_chunk is randomPZ, I changed the seed initialization to seed = self.config.seed + start. All other uses of a random number generator are in non-parallelized functions, and thus do not need this addition. However, any future parallelizations that have a random number used in a chunked function will have to similarly initialize (e.g. the open PR on somocluSOM, I'll push a change to that branch now as well).

I'll look through GPz_v1, FlexZBoost, Delight, and BPZ_Lite now to see if I need to make any changes on those repos.

@sschmidt23 sschmidt23 merged commit a015dce into main May 3, 2023
@sschmidt23 sschmidt23 deleted the issue/41/rando branch May 3, 2023 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

managing random seeds uniformly

5 participants