Skip to content

fix: log swallowed exceptions in core server components#1023

Merged
GonzoSpire merged 3 commits intomasterfrom
fix/log-swallowed-exceptions
Mar 26, 2026
Merged

fix: log swallowed exceptions in core server components#1023
GonzoSpire merged 3 commits intomasterfrom
fix/log-swallowed-exceptions

Conversation

@SoftFx
Copy link
Copy Markdown
Owner

@SoftFx SoftFx commented Mar 23, 2026

Summary

Three catch blocks in the server core were silently swallowing exceptions with no logging, making production issues impossible to diagnose.

  • TreeValuesCacheMigrateDatabaseV2: logs sensorId + exception when value deserialization fails, instead of silently continue-ing
  • AlertScheduleProviderLoadSchedulesFromDb: logs parse error with schedule Id/Name when _parser.Parse() throws, before falling back to raw entity
  • TreeStateSnapshot — constructor: added NLog logger, logs deserialization error when state snapshot fails to load instead of silently setting HasData = false

Impact

Previously these failures were completely invisible in logs. Now operators can see exactly which sensors/schedules are corrupted and investigate.

🤖 Generated with Claude Code

Three silent catch blocks were hiding errors with no diagnostic info:

- TreeValuesCache.MigrateDatabaseV2: log sensorId when value deserialization
  fails during V2 migration instead of silently skipping the record
- AlertScheduleProvider.LoadSchedulesFromDb: log parse error and entity id/name
  when schedule parsing fails and fallback to raw entity is used
- TreeStateSnapshot ctor: add NLog logger and log deserialization error when
  state snapshot fails to load, instead of silently setting HasData=false

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Review: fix/log-swallowed-exceptions

Overall assessment: Clean, focused, and correct fix. The three catch-all blocks were genuine observability gaps and the logging additions are well-targeted with useful context (IDs, names). No bugs or security issues introduced.


What's good

  • Each log message includes actionable identifiers (entity.Id, entity.Name, sensorId) so operators can correlate log entries with specific data.
  • Existing behavior (fallback / continue / HasData = false) is preserved unchanged — safe, low-risk fix.
  • TreeStateSnapshot correctly uses a static readonly Logger on a sealed class.

Issues

Minor — Logger style inconsistency

TreeStateSnapshot adds:

private static readonly Logger _logger = LogManager.GetCurrentClassLogger();

But TreeValuesCache uses:

private readonly Logger _logger = LogManager.GetLogger(nameof(TreeValuesCache));

And AlertScheduleProvider uses an instance (not static) field with GetCurrentClassLogger(). All three approaches work correctly with NLog, but using a consistent pattern across the codebase reduces confusion. static readonly is the idiomatic NLog recommendation.

Minor — using directive ordering in TreeStateSnapshot.cs

The new using NLog; and using System; are inserted after the HSMDatabase.* project usings:

using HSMDatabase.AccessManager;
using HSMDatabase.AccessManager.DatabaseEntities.SnapshotEntity;
using HSMServer.Core.DataLayer;
using NLog;        // ← added here
using System;      // ← added here
using System.Threading.Tasks;

Convention (and StyleCop/Roslyn defaults) place System.* and framework namespaces before third-party and project namespaces. Suggest:

using System;
using System.Threading.Tasks;
using HSMDatabase.AccessManager;
using HSMDatabase.AccessManager.DatabaseEntities.SnapshotEntity;
using NLog;
using HSMServer.Core.DataLayer;
using HSMServer.Core.TreeStateSnapshot.States;

Worth noting — pre-existing fragile fallback in AlertScheduleProvider

This PR didn't introduce it, but the catch block in LoadSchedulesFromDatabase is now more visible:

catch (Exception ex)
{
    _logger.Error(ex, $"Failed to parse alert schedule. Id = {entity.Id}, Name = {entity.Name}. Using raw entity as fallback.");
    schedule = new AlertSchedule()
    {
        Id = new Guid(entity.Id),   // throws if entity.Id is null or malformed
        ...
    };
}

If entity.Id is null or not a valid GUID, new Guid(entity.Id) will throw a second unhandled exception that propagates out of the foreach, aborting loading of all remaining schedules. Since this PR adds logging just above it, it's now a good moment to also wrap the fallback in a nested try/catch or validate entity.Id first. Not a blocker for this PR, but worth a follow-up issue.


Summary

Category Finding Severity
Logger style Inconsistent static vs instance field across three files Nit
using order System and NLog usings out of conventional order Nit
Pre-existing Fallback in AlertScheduleProvider can throw again on bad entity.Id Follow-up

No blockers. Approve with the above nits noted.

…atabaseV2

AlertScheduleProvider: skip unparseable schedules instead of adding them
to cache as broken objects. A partially-constructed AlertSchedule with no
parsed working hours has unpredictable IsWorkingTime behavior - skipping
is safer than a fallback that silently misfires alerts.

TreeValuesCache.MigrateDatabaseV2: add success/failed counters and log
migration summary so data loss during migration is visible in logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Review: fix/log-swallowed-exceptions

Overall direction is correct — silently swallowed exceptions are genuinely harmful for observability. However there are a few issues worth addressing before merging.


1. Breaking behavioral change in AlertScheduleProvider.cs (Bug)

File: src/server/HSMServer.Core/AlertSchedule/AlertScheduleProvider.cs

The old code caught parse failures and fell back to constructing a minimal AlertSchedule from raw entity data (Id, Name, Timezone, Schedule), so the schedule was still usable even if parsing failed. The new code logs and continues — the schedule is skipped entirely.

This is a significant semantic change: any schedule that previously survived a parse error via the fallback path will now simply not exist in _cache. Downstream code that looks up a schedule by ID will get a miss instead of a degraded-but-functional entry.

The PR description says "before falling back to raw entity" but the fallback has actually been removed, which makes the description inaccurate.

Recommendation: Either keep the fallback (add the log and keep the fallback object creation), or explicitly justify why dropping the schedule is safer than the fallback. If the fallback object was itself causing problems (e.g. invalid state that crashes later), document that reasoning.

// Consider: log AND fall back, rather than log AND skip
catch (Exception ex)
{
    _logger.Error(ex, $"Failed to parse alert schedule. Id = {entity.Id}, Name = {entity.Name}. Using raw entity as fallback.");
    schedule = new AlertSchedule()
    {
        Id = new Guid(entity.Id),
        Name = entity.Name,
        Timezone = entity.Timezone,
        Schedule = entity.Schedule,
    };
}

2. Inaccurate success counter in MigrateDatabseV2 (Minor Bug)

File: src/server/HSMServer.Core/Cache/TreeValuesCache.cs

success++ is placed after the if (!policy.TimeIsUp(val.Time)) block, so it counts every record that deserialized successfully — including records not actually written to the database because TimeIsUp was true (value expired).

The log message says "{success} records migrated" which implies they were stored, but expired records are counted too.

Additionally, records where GetSensor(sensorId) returns null are silently skipped with no logging and no counter. These could represent a meaningful data loss during migration.

Recommendation:

if (!policy.TimeIsUp(val.Time))
{
    _database.AddSensorValue(sensorId, val);
    success++;   // only count what was actually stored
}
// else: optionally log skipped-as-expired records

And consider a counter or log for null-sensor records:

var sensor = GetSensor(sensorId);
if (sensor == null)
{
    _logger.Warn($"Sensor not found during V2 migration, skipping. SensorId = {sensorId}");
    failed++;
    continue;
}

3. Misleading log message in TreeStateSnapshot.cs (Nit)

File: src/server/HSMServer.Core/TreeStateSnapshot/TreeStateSnapshot.cs

"Starting with empty state" could be misleading if part of the snapshot was processed before the exception occurred (e.g., some sensors or keys already populated). A more precise message like "Snapshot deserialization failed, discarding partial state" would better reflect what actually happens.


Summary

Severity File Issue
Medium AlertScheduleProvider.cs Fallback silently removed — schedules now dropped on parse error instead of degraded-but-loaded
Low TreeValuesCache.cs success counter over-counts (includes expired/not-stored records); null-sensor records unaccounted
Nit TreeStateSnapshot.cs Log message may be slightly misleading

The logging additions themselves are the right call. The TreeStateSnapshot change is clean. The main concern is the AlertScheduleProvider behavioral change which could manifest as missing alert schedules in production.

@github-actions
Copy link
Copy Markdown

PR Review: fix/log-swallowed-exceptions

Overall direction is correct — silent catch blocks are a real observability problem. However, there are a few bugs and a behavioral regression introduced by the fix.


Bug: entity.Id logs as "System.Byte[]"AlertScheduleProvider.cs:138

AlertScheduleEntity.Id is byte[] (see HSMDatabase.AccessManager/DatabaseEntities/AlertScheduleEntity.cs). String-interpolating it directly produces "System.Byte[]" in the log message, which is useless for diagnosis.

The original code correctly converted it: new Guid(entity.Id).

Current (broken):

_logger.Error(ex, $"Failed to parse alert schedule. Id = {entity.Id}, Name = {entity.Name}. ...");

Fix:

_logger.Error(ex, $"Failed to parse alert schedule. Id = {new Guid(entity.Id)}, Name = {entity.Name}. ...");

Behavioral regression: corrupt schedule is now silently dropped — AlertScheduleProvider.cs:139

The old code fell back to a stub AlertSchedule populated from the raw entity fields, keeping the schedule ID present in _cache. The new code does continue, removing the entry entirely.

This is a breaking behavior change: any downstream code that looks up a schedule by ID (e.g., when evaluating alert policies) will now get a cache miss for a schedule that previously existed. Depending on how callers handle null or missing cache entries, this could surface as NullReferenceExceptions or silently broken alerting for sensors using that schedule.

The PR description still says "before falling back to raw entity," but the implementation no longer does that.

If skipping is truly the intended behavior, callers that reference this schedule ID need to be audited for how they handle missing entries.


success counter is misleading — TreeValuesCache.cs:1869

success++ increments for every record where the sensor was found and deserialization succeeded, including records filtered out by policy.TimeIsUp(val.Time) (i.e., records that were intentionally not written to the database). The summary log says "records migrated" which implies they were written.

Additionally, records where GetSensor(sensorId) returns null are silently skipped — they're counted in neither success nor failed. The migration summary gives no indication of how many records were skipped this way.

Minor, but the counter name and log message should either reflect "successfully processed" instead of "migrated," or the counter should only increment after AddSensorValue is called.


AddSensorValue exceptions are unhandled — TreeValuesCache.cs:1866

The PR adds a catch around ConvertFromJson, but _database.AddSensorValue(sensorId, val) at line 1866 remains unguarded. If it throws, the entire migration aborts mid-loop with no summary log, and any partial progress is lost with no indication of where it stopped. This was pre-existing, but in the context of a PR that's hardening this method, it's worth addressing.


TreeStateSnapshot.cs — looks good

Adding the logger and logging the deserialization error before setting HasData = false is correct and clean. No issues here.

@GonzoSpire GonzoSpire merged commit b86fef0 into master Mar 26, 2026
1 check passed
@GonzoSpire GonzoSpire deleted the fix/log-swallowed-exceptions branch March 26, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants