-
Notifications
You must be signed in to change notification settings - Fork 0
Add CIF value reader #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cd7b3f7 to
a404d19
Compare
klywang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! There was just a typo in the docstring.
I think I had a few questions here or there, but none of them are important or require changes (unless you feel like it).
Also I didn't look too closely at the tests.
Co-authored-by: Kelly Wang <[email protected]>
joaander
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks very good. I have a few suggestions for you to consider.
Key-Value reader
Code has been added to read key-value pairs from a CIF file. By default, all valid keys are read but a smaller list of keys can be provided to narrow down the information. Users have an option to search for only numeric data, or all valid keys and when possible data is converted to the expected numeric datatype. A special-purpose reader for cell lengths and angles is included, as this is the most commonly required functionality within the Glotzer group.
Design choices
Default return values as strings
By default, users are returned as much data as possible - all values are returned as strings, including precision notes (e.g. 1.234(5)) but excluding comments. Using the
only_read_numericskey lets us copy only data that can be safely cast into a numeric datatype, and performs that conversion. Integers are preferred where possible (e.g.180), and floats are used in all other cases (e.g.6.789).Multiline values
CIF files allow for the creation of multiline values, enclosed in semicolon blocks. The current implementation DOES NOT handle these, and will instead skip those keys. I will add this functionality in a later PR, but doing so requires a totally different approach and is both slower and more complicated. Very few pieces of data use the multiline notation, and those that do are typically for administrative/non-scientific keys: for example,
_audit_update_record(Upload and download dates for the file) or_publ_section_title(the full title of the journal article where the file was published).Why not np.fromregex?
np.fromregexcan recreate our current code - however, it is slower than line-by-line iteration and is far less flexible. With the current approach, we can use a single generator function for all of the key-value readers, and use the most general reader function for special cases like reading box values for Freud/EBT. The code is somewhat simpler when using Numpy, but realistically most of the difficulty comes from understanding the regex rather than the actual data parsing logic.TODOs