Set up package structure #3
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Parnip aims to be a simple interface for reading data from CIF files. It will be written in pure python for portability, and is expected to only depend on Numpy - while this dependency could be avoided, it makes simplifies code and speeds up a few functions.
The testing of this package uses some real CIF files, drawn from four free databases. Attributions for the files are included at the top of the CIF, and a small README links the sources for that data. If this is insufficient, I can adjust the citations appropriately.
Note that developers must install
gemmito run the tests - it makes sense to have a widely-used package as a benchmark, and provides strong feedback on the functionality ofparnipWhy
parnipSimplicity. CIF parsing packages are almost always large, clunky, and sparsely documented: gemmi works well and is fast, but the documentation is surface level and the examples are not particularly good. pymmcif has little public documentation, and pdbecif's examples do not all work out of the box.
While all of these projects are good tools, the scope is far broader than what many researchers need.
Why not C/C++/Rust/...
Portability.
parsnipis intended to be general enough for an experimental materials undergrad to use without much effort, and that includes out-of-the-box function on Windows and simple extensibility in Python.Could this be done with pure regex?
Possibly? But doing so would be far more confusing (and likely slower) than the current approach. I want this package to be simple and easy to use and maintain, so regex is documented well and used sparingly.
Performance
parnip's minimal interface means the code is quite fast: extracting a table from a small (~100 line) CIF file takes about 70μs, and extracting from a much larger (~600 line) line file takes about 150μs. 5000 line mmCIF files can be parsed in less than 500μs. Performing the same operations withgemmitakes about 1.5x-2x as long, although obviously that package does far more checking/parsing under the hood.parsnipachieves this performance with logic designed around the CIF format. When reading tables, only the header section must be scanned for keys: if none are found, every data line in that table can be skipped. When reading individual values, lines are parsed in order and functions can return once each key is identified. Keeping the parsed file in the program's state will be faster for scripts performing many operations, but for reading one or two tables, plus a few keys, simplicity wins out here.