Skip to content

Conversation

@janbridley
Copy link
Collaborator

@janbridley janbridley commented Apr 3, 2024

Overview

Parnip aims to be a simple interface for reading data from CIF files. It will be written in pure python for portability, and is expected to only depend on Numpy - while this dependency could be avoided, it makes simplifies code and speeds up a few functions.

The testing of this package uses some real CIF files, drawn from four free databases. Attributions for the files are included at the top of the CIF, and a small README links the sources for that data. If this is insufficient, I can adjust the citations appropriately.

Note that developers must install gemmi to run the tests - it makes sense to have a widely-used package as a benchmark, and provides strong feedback on the functionality of parnip

Why parnip

Simplicity. CIF parsing packages are almost always large, clunky, and sparsely documented: gemmi works well and is fast, but the documentation is surface level and the examples are not particularly good. pymmcif has little public documentation, and pdbecif's examples do not all work out of the box.

While all of these projects are good tools, the scope is far broader than what many researchers need.

Why not C/C++/Rust/...

Portability. parsnip is intended to be general enough for an experimental materials undergrad to use without much effort, and that includes out-of-the-box function on Windows and simple extensibility in Python.

Could this be done with pure regex?

Possibly? But doing so would be far more confusing (and likely slower) than the current approach. I want this package to be simple and easy to use and maintain, so regex is documented well and used sparingly.

Performance

parnip's minimal interface means the code is quite fast: extracting a table from a small (~100 line) CIF file takes about 70μs, and extracting from a much larger (~600 line) line file takes about 150μs. 5000 line mmCIF files can be parsed in less than 500μs. Performing the same operations with gemmi takes about 1.5x-2x as long, although obviously that package does far more checking/parsing under the hood.

parsnip achieves this performance with logic designed around the CIF format. When reading tables, only the header section must be scanned for keys: if none are found, every data line in that table can be skipped. When reading individual values, lines are parsed in order and functions can return once each key is identified. Keeping the parsed file in the program's state will be faster for scripts performing many operations, but for reading one or two tables, plus a few keys, simplicity wins out here.

@janbridley janbridley marked this pull request as ready for review April 3, 2024 15:11
@janbridley janbridley merged commit 28df06c into main Apr 3, 2024
@janbridley janbridley deleted the admin/setup branch April 3, 2024 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants