CT 1998 use google protobuf to enable more flexible dictionaries#7190
CT 1998 use google protobuf to enable more flexible dictionaries#7190
Conversation
jtcohen6
left a comment
There was a problem hiding this comment.
Very cool! Really appreciated the walkthrough & recording :)
I tried this out locally, and I didn't see any obvious changes in JSON-formatted logs. No blocking comments from me.
| LOG_VERSION = 3 | ||
| metadata_vars: Optional[Dict[str, str]] = None | ||
|
|
||
| nofile_codes = ["Z012", "Z013", "Z014", "Z015"] |
There was a problem hiding this comment.
Since we can no longer do this via class inheritance (NoFile), we need to explicitly state here which types are "no file."
This is fine IMO - we don't have so many of these things - but it does feel trickier to test, now that these are no longer attributes.
(After watching the recorded walkthrough) Could we leave an inline code comment here? These events are specific to dbt clean, and we can't write file logs during clean because the log file might be one of the things that dbt is cleaning (= race condition)
There was a problem hiding this comment.
I added a comment. In theory we could still use something like classes or method flags, but we'd have to get the class from msg.info.name, do getattr(types, class_name) and then check for the existence of another class or execute a class method. So that's still on option if some other flag thing comes along that makes it worth it.
There was a problem hiding this comment.
cool! agree that, in this case, it doesn't feel worth it — hard-coding the codes is a right-sized solution
| "node_started_at": self._event_status.get("started_at"), | ||
| "node_finished_at": self._event_status.get("finished_at"), | ||
| "meta": meta_stringified, | ||
| "meta": getattr(self, "meta", {}), |
core/dbt/contracts/results.py
Outdated
| msg_dict["started_at"] = self.started_at.strftime("%Y-%m-%dT%H:%M:%SZ") | ||
| if self.completed_at: | ||
| msg_dict["completed_at"] = self.completed_at.strftime("%Y-%m-%dT%H:%M:%SZ") | ||
| return msg_dict |
There was a problem hiding this comment.
Is this worth a utility function? It also looks like this is different from timestamp_to_datetime_string - just based on the precision we want ("%H:%M:%S.%f" vs "%Y-%m-%dT%H:%M:%SZ") - for parity with status quo?
There was a problem hiding this comment.
I created a datetime_to_json_format_string utility method, and used it here and in base_types.py. The other one (timestampe_to_datetime_string) goes in a different direction, from the timestamp to a string and string is shorter
| fire_event( | ||
| TimingInfoCollected( | ||
| timing_info=self.timing_info.to_msg(), node_info=get_node_info() | ||
| timing_info=self.timing_info.to_msg_dict(), node_info=get_node_info() |
There was a problem hiding this comment.
In practice, do we expect this change (nested message → nested dict) to be breaking in any way? My sense is that it looks the same, if the event is being serialized to JSON, which is our only official current interface for consuming structured logs
There was a problem hiding this comment.
I think it should have the same output. In betterproto we had to provide the pre-constructed sub-messages (which is the way python works) and with google protobuf, you can't.
core/dbt/events/base_types.py
Outdated
| except Exception as exc: | ||
| raise Exception(f"[{class_name}]: Unable to parse dict {kwargs},\n exc: {exc}") |
There was a problem hiding this comment.
As discussed during the live walkthrough, we want to stop throwing an exception (and stopping the run), and instead just log a warning that we failed to serialize the event:
@gshank made the point that it's a bit weird to fire an event within the code for firing events. I don't have strong opinions here on the implementation — just that, as an end user, I would want dbt to more gracefully handle serialization errors.
There was a problem hiding this comment.
The nested fire_event seems to be working okay. Added a test for it.
| Steps taken to mitigate the drawbacks of Google protobuf from above: | ||
| * We are using a wrapping class around the logging events to enable a constructor that looks more like a Python constructor, as long as only keyword arguments are used. | ||
| * The generated file is skipped in the pre-commit config | ||
| * We can live with the awkward interfaces. It's just code. |
There was a problem hiding this comment.
We can live with the awkward interfaces. It's just code.
:)
resolves #6832
Description
In order to have more flexible dictionaries in our logging events, we need to have the protobuf Struct data type. This isn't supported by betterproto. In addition, Optional is only supported with a beta version. In order to have Structs, we are switching to using google protobuf instead.
There are a number of differences in implementation between betterproto and google protobuf. In google protobuf the generated "Python" classes are not very Python-like, and the interfaces are odd. Betterproto had support for a datetime-like timestamp field, which is different than protobuf. Protobuf can instantiate messages from nested dictionaries. It also catches type mismatches that betterproto didn't. When using betterproto it was necessary to pre-construct nested messages; this can't be done in protobuf. The interfaces for serializing are different.
Checklist
changie newto create a changelog entry