-
Notifications
You must be signed in to change notification settings - Fork 41
Pex 552/on demand detection triggers #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
b1aa8b9
af111b7
22def48
a993ed8
48be3ae
c0a3bea
dd74f91
9b09545
b8e6c12
1349ff0
3bcb748
150c209
bf024cc
842125c
1a8ff7e
7a33f57
20b1d44
5208931
4b413ad
290a5f4
2854098
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -69,14 +69,30 @@ class TimeoutConfig(IntEnum): | |||
| """ | ||||
|
|
||||
| # base amount to sleep for before beginning exponential backoff during testing | ||||
| BASE_SLEEP = 60 | ||||
| BASE_SLEEP = 2 | ||||
|
|
||||
| # NOTE: Some detections take longer to generate their risk/notables than other; testing has | ||||
| # shown 270s to likely be sufficient for all detections in 99% of runs; however we have | ||||
| # encountered a handful of transient failures in the last few months. Since our success rate | ||||
| # is at 100% now, we will round this to a flat 300s to accomodate these outliers. | ||||
| # shown 30s to likely be sufficient for all detections in 99% of runs and less than 1% of detections | ||||
| # would need 60s and 90s to wait for risk/notables; therefore 30s is a reasonable interval for max | ||||
| # wait time. | ||||
| # Max amount to wait before timing out during exponential backoff | ||||
| MAX_SLEEP = 300 | ||||
| MAX_SLEEP = 30 | ||||
|
|
||||
| # NOTE: Based on testing, 99% of detections will generate risk/notables within 30s, and the remaining 1% of | ||||
cmcginley-splunk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||
| # detections may need up to 150s to finish; so this is a reasonable total maximum wait time | ||||
| # Total wait time before giving up on waiting for risk/notables to be generated | ||||
| TOTAL_MAX_WAIT = 180 | ||||
cmcginley-splunk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||
|
|
||||
| # NOTE: Based on testing, there is 1% detections couldn't generate risk/notables within single dispatch, and | ||||
| # they needed to be retried; 90s is a reasonable wait time before retrying dispatching the SavedSearch | ||||
| # Wait time before retrying dispatching the SavedSearch | ||||
| RETRY_DISPATCH = 90 | ||||
|
|
||||
| # NOTE: Based on testing, 99% of detections will generate risk/notables within 30s, and the validation of risks | ||||
| # and notables would take around 5 to 10 seconds; so before adding additional wait time, we let the validation | ||||
| # process work as the default wait time until we reach the ADD_WAIT_TIME and add additional wait time | ||||
| # Time elased before adding additional wait time | ||||
xqi-splunk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||
| ADD_WAIT_TIME = 30 | ||||
|
|
||||
|
|
||||
| # TODO (#226): evaluate sane defaults for timeframe for integration testing (e.g. 5y is good | ||||
|
|
@@ -88,7 +104,7 @@ class ScheduleConfig(StrEnum): | |||
|
|
||||
| EARLIEST_TIME = "-5y@y" | ||||
| LATEST_TIME = "-1m@m" | ||||
| CRON_SCHEDULE = "*/1 * * * *" | ||||
| CRON_SCHEDULE = "0 0 1 1 *" | ||||
pyth0n1c marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
|
||||
|
|
||||
| class ResultIterator: | ||||
|
|
@@ -202,6 +218,9 @@ class CorrelationSearch(BaseModel): | |||
| # cleanup of this index | ||||
| test_index: str | None = Field(default=None, min_length=1) | ||||
|
|
||||
| # The search ID of the last dispatched search; this is used to query for risk/notable events | ||||
| sid: str | None = Field(default=None) | ||||
|
|
||||
| # The logger to use (logs all go to a null pipe unless ENABLE_LOGGING is set to True, so as not | ||||
| # to conflict w/ tqdm) | ||||
| logger: logging.Logger = Field( | ||||
|
|
@@ -437,6 +456,34 @@ def enable(self, refresh: bool = True) -> None: | |||
| if refresh: | ||||
| self.refresh() | ||||
|
|
||||
| def dispatch(self) -> splunklib.Job: | ||||
| """Dispatches the SavedSearch | ||||
|
|
||||
| Dispatches the SavedSearch entity, returning a Job object representing the search job. | ||||
| :return: a splunklib.Job object representing the search job when the SavedSearch is finished running | ||||
| """ | ||||
| self.logger.debug(f"Dispatching {self.name}...") | ||||
| try: | ||||
| job = self.saved_search.dispatch(trigger_actions=True) | ||||
xqi-splunk marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
|
||||
| time_to_execute = 0 | ||||
| # Check if the job is finished | ||||
| while not job.is_done(): | ||||
| self.logger.info(f"Job {job.sid} is still running...") | ||||
xqi-splunk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||
| time.sleep(1) | ||||
| time_to_execute += 1 | ||||
| self.logger.info( | ||||
| f"Job {job.sid} has finished running in {time_to_execute} seconds." | ||||
| ) | ||||
|
|
||||
| self.sid = job.sid | ||||
|
|
||||
| return job # type: ignore | ||||
| except HTTPError as e: | ||||
| raise ServerError( | ||||
| f"HTTP error encountered while dispatching detection: {e}" | ||||
| ) | ||||
|
|
||||
| def disable(self, refresh: bool = True) -> None: | ||||
| """Disables the SavedSearch | ||||
|
|
||||
|
|
@@ -496,6 +543,10 @@ def force_run(self, refresh: bool = True) -> None: | |||
| self.update_timeframe(refresh=False) | ||||
xqi-splunk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||
| if not self.enabled: | ||||
|
||||
| self.enable(refresh=False) | ||||
| job = self.dispatch() | ||||
| self.logger.info( | ||||
| f"Finished running detection '{self.name}' with job ID: {job.sid}" | ||||
| ) | ||||
| else: | ||||
| self.logger.warning(f"Detection '{self.name}' was already enabled") | ||||
|
|
||||
|
|
@@ -535,10 +586,15 @@ def get_risk_events(self, force_update: bool = False) -> list[RiskEvent]: | |||
|
|
||||
| # TODO (#248): Refactor risk/notable querying to pin to a single savedsearch ID | ||||
| # Search for all risk events from a single scheduled search (indicated by orig_sid) | ||||
| query = ( | ||||
| f'search index=risk search_name="{self.name}" [search index=risk search ' | ||||
| f'search_name="{self.name}" | tail 1 | fields orig_sid] | tojson' | ||||
| ) | ||||
| if self.sid is None: | ||||
|
||||
| self.cleanup() |
Given that we are presently relying on the caller to do Attack Data Cleanup, I think this change is okay and is a small speed optimization of a second or two per test due to running less SPL (having this cleanup up function remove attack data, if we passed test_index as a value other than None would cause issues here anyway).
tl;dr - I would remove the cleanup check at the start of every test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second point, I wonder what's the use case of for_sid set to False? Since we set to have maximum 3 retries, I feel like might be a bit risky if we simply search for any events matching the detection. That might give out doubling risk/notable objects. Suppose we encounter some extreme cases where the risk/notable objects doesn't generate in the first attempt but some how generated during the second attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For option 2, I was only suggesting that bool field in the case we still wanted to do cleanup pre-test. But I agree with Eric for all the reasons mentioned. I would remove this code path (or throw on self.sid is None) and remove the pre-test cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the cleanup check and removed the if self.sid is None: code path.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the if self.sid is None: code path.
xqi-splunk marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, max_total_wait and max_wait are slightly confusing as variable names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After refactoring, the max_total_wait = TimeoutConfig.TOTAL_MAX_WAIT is no longer needed.
Uh oh!
There was an error while loading. Please reload this page.