Skip to content

Conversation

@janbridley
Copy link
Collaborator

@janbridley janbridley commented Apr 5, 2024

Key-Value reader

Code has been added to read key-value pairs from a CIF file. By default, all valid keys are read but a smaller list of keys can be provided to narrow down the information. Users have an option to search for only numeric data, or all valid keys and when possible data is converted to the expected numeric datatype. A special-purpose reader for cell lengths and angles is included, as this is the most commonly required functionality within the Glotzer group.

Design choices

  • Default return values as strings
    By default, users are returned as much data as possible - all values are returned as strings, including precision notes (e.g. 1.234(5)) but excluding comments. Using the only_read_numerics key lets us copy only data that can be safely cast into a numeric datatype, and performs that conversion. Integers are preferred where possible (e.g. 180), and floats are used in all other cases (e.g. 6.789).

  • Multiline values
    CIF files allow for the creation of multiline values, enclosed in semicolon blocks. The current implementation DOES NOT handle these, and will instead skip those keys. I will add this functionality in a later PR, but doing so requires a totally different approach and is both slower and more complicated. Very few pieces of data use the multiline notation, and those that do are typically for administrative/non-scientific keys: for example, _audit_update_record (Upload and download dates for the file) or _publ_section_title (the full title of the journal article where the file was published).

  • Why not np.fromregex?
    np.fromregex can recreate our current code - however, it is slower than line-by-line iteration and is far less flexible. With the current approach, we can use a single generator function for all of the key-value readers, and use the most general reader function for special cases like reading box values for Freud/EBT. The code is somewhat simpler when using Numpy, but realistically most of the difficulty comes from understanding the regex rather than the actual data parsing logic.

TODOs

@janbridley janbridley self-assigned this Apr 5, 2024
@janbridley janbridley marked this pull request as ready for review April 5, 2024 16:44
@janbridley janbridley requested a review from klywang April 5, 2024 17:04
@janbridley janbridley force-pushed the feature/read-values branch from cd7b3f7 to a404d19 Compare April 10, 2024 13:45
Copy link

@klywang klywang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! There was just a typo in the docstring.

I think I had a few questions here or there, but none of them are important or require changes (unless you feel like it).

Also I didn't look too closely at the tests.

Copy link
Member

@joaander joaander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks very good. I have a few suggestions for you to consider.

@janbridley janbridley merged commit dfd640c into main May 22, 2024
@janbridley janbridley deleted the feature/read-values branch May 22, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants