Add CIF value reader #4

janbridley · 2024-04-05T16:30:07Z

Key-Value reader

Code has been added to read key-value pairs from a CIF file. By default, all valid keys are read but a smaller list of keys can be provided to narrow down the information. Users have an option to search for only numeric data, or all valid keys and when possible data is converted to the expected numeric datatype. A special-purpose reader for cell lengths and angles is included, as this is the most commonly required functionality within the Glotzer group.

Design choices

Default return values as strings
By default, users are returned as much data as possible - all values are returned as strings, including precision notes (e.g. 1.234(5)) but excluding comments. Using the only_read_numerics key lets us copy only data that can be safely cast into a numeric datatype, and performs that conversion. Integers are preferred where possible (e.g. 180), and floats are used in all other cases (e.g. 6.789).
Multiline values
CIF files allow for the creation of multiline values, enclosed in semicolon blocks. The current implementation DOES NOT handle these, and will instead skip those keys. I will add this functionality in a later PR, but doing so requires a totally different approach and is both slower and more complicated. Very few pieces of data use the multiline notation, and those that do are typically for administrative/non-scientific keys: for example, _audit_update_record (Upload and download dates for the file) or _publ_section_title (the full title of the journal article where the file was published).
Why not np.fromregex?
np.fromregex can recreate our current code - however, it is slower than line-by-line iteration and is far less flexible. With the current approach, we can use a single generator function for all of the key-value readers, and use the most general reader function for special cases like reading box values for Freud/EBT. The code is somewhat simpler when using Numpy, but realistically most of the difficulty comes from understanding the regex rather than the actual data parsing logic.

TODOs

Clean up code
Rebase this onto the branch for Add CIF table reader #2, once merged.

klywang

Looks good! There was just a typo in the docstring.

I think I had a few questions here or there, but none of them are important or require changes (unless you feel like it).

Also I didn't look too closely at the tests.

parsnip/parse.py

parsnip/patterns.py

Co-authored-by: Kelly Wang <[email protected]>

joaander

Overall this looks very good. I have a few suggestions for you to consider.

doc/source/quickstart.rst

tests/sample_data/CCDC_1446529_Pm-3m.cif

parsnip/parse.py

Stale

janbridley self-assigned this Apr 5, 2024

janbridley marked this pull request as ready for review April 5, 2024 16:44

janbridley requested a review from klywang April 5, 2024 17:04

janbridley added 17 commits April 10, 2024 09:33

Add _str2num and _deg2rad _utils

656e9e1

Add cif file keys list to sample data

1e74eb7

Add key_value_pairs reader and cell_params reader to parse

c369fd1

Add tests for key reader

672c4e3

Add tests for new utils

e0b693f

Reorder test_key_reader

79350fc

Improve documentation for regex

04b3344

Add warnings and tests to read_key_value_pairs

b59eab1

Restore trailing spaces to downloaded CIF files

87303b9

Properly track keys containing "-"

90120c7

Improved tests for key value pair reader

d4203da

Add key-value tests for INTENTIONALLY_BAD_CIF.cif

8c3c014

Fix docs

9c91bde

Enable top of page button

9aaba90

Update brand primary colors

6ea7882

Improve docs for parse.py

0169783

Add __future__.annotations imports to relevant files

a404d19

janbridley force-pushed the feature/read-values branch from cd7b3f7 to a404d19 Compare April 10, 2024 13:45

janbridley added 9 commits April 10, 2024 09:54

Fix typo

4903f80

Seperate _errors from _templates

a333c5c

Clean up docstring return types

b0f386b

Add PDB cif to test suite

96acd85

Fix test in test_key_reader

a6ebf33

Clean up patterns.py and add remove_nondelimiting_whitespace

f8dbaa3

Update table_reader to use remove_nondelimiting_whitespace

b1e0bdd

Allow value reader to read mmCIF files

51328be

Update test_table_reader.py

06abb57

janbridley added 8 commits April 10, 2024 12:58

Add source for PDB cif

56d80de

Add mmCIF flag to read_cell_params

5d47d10

Add quickstart.rst

dfbf5ed

Fix comment in quickstart

28a7025

Remove unnecessary line in quickstart

e60cd1b

Fix image path in README.rst

6e82566

Update regex documentation

a772261

Fix CI

7d03311

klywang previously requested changes Apr 15, 2024

View reviewed changes

janbridley and others added 5 commits April 25, 2024 14:46

Documentation fix

bbe6426

Documentation fix for regex filter

836f465

Comment fixes

81a2f98

Fix #8

d506df3

Fix typo in _parsed_line_generator docs

975f9b0

Co-authored-by: Kelly Wang <[email protected]>

janbridley mentioned this pull request Apr 25, 2024

Comments on loop_ keyword lines #8

Closed

janbridley added 2 commits April 25, 2024 16:01

Typo fix

0c48708

Move tip block comment

8eb5f2e

janbridley requested a review from klywang April 25, 2024 20:04

janbridley mentioned this pull request Apr 26, 2024

Add ability to build out unit cells from CIF files #11

Merged

7 tasks

Merge branch 'main' into feature/read-values

343dc90

joaander reviewed May 16, 2024

View reviewed changes

doc/source/quickstart.rst Show resolved Hide resolved

tests/sample_data/CCDC_1446529_Pm-3m.cif Show resolved Hide resolved

parsnip/parse.py Show resolved Hide resolved

parsnip/parse.py Show resolved Hide resolved

janbridley mentioned this pull request May 16, 2024

Use sybil to validate that code examples run without errors. #14

Closed

janbridley added 5 commits May 16, 2024 13:33

Untrack cif files from end-of-file-fixer

d71119a

Merge remote-tracking branch 'origin/main' into feature/read-value

cc23120

Add missing key to CifData namedtuple

75f8c3e

Remove __future__ annotations

855f2c5

Remove type | type

60fcbe7

janbridley merged commit dfd640c into main May 22, 2024

janbridley deleted the feature/read-values branch May 22, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CIF value reader #4

Add CIF value reader #4

Uh oh!

janbridley commented Apr 5, 2024 •

edited

Loading

Uh oh!

klywang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joaander left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add CIF value reader #4

Add CIF value reader #4

Uh oh!

Conversation

janbridley commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key-Value reader

Design choices

TODOs

Uh oh!

klywang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joaander left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

janbridley commented Apr 5, 2024 •

edited

Loading