Set up package structure #3

janbridley · 2024-04-03T14:57:34Z

Overview

Parnip aims to be a simple interface for reading data from CIF files. It will be written in pure python for portability, and is expected to only depend on Numpy - while this dependency could be avoided, it makes simplifies code and speeds up a few functions.

The testing of this package uses some real CIF files, drawn from four free databases. Attributions for the files are included at the top of the CIF, and a small README links the sources for that data. If this is insufficient, I can adjust the citations appropriately.

Note that developers must install gemmi to run the tests - it makes sense to have a widely-used package as a benchmark, and provides strong feedback on the functionality of parnip

Why `parnip`

Simplicity. CIF parsing packages are almost always large, clunky, and sparsely documented: gemmi works well and is fast, but the documentation is surface level and the examples are not particularly good. pymmcif has little public documentation, and pdbecif's examples do not all work out of the box.

While all of these projects are good tools, the scope is far broader than what many researchers need.

Why not C/C++/Rust/...

Portability. parsnip is intended to be general enough for an experimental materials undergrad to use without much effort, and that includes out-of-the-box function on Windows and simple extensibility in Python.

Could this be done with pure regex?

Possibly? But doing so would be far more confusing (and likely slower) than the current approach. I want this package to be simple and easy to use and maintain, so regex is documented well and used sparingly.

Performance

parnip's minimal interface means the code is quite fast: extracting a table from a small (~100 line) CIF file takes about 70μs, and extracting from a much larger (~600 line) line file takes about 150μs. 5000 line mmCIF files can be parsed in less than 500μs. Performing the same operations with gemmi takes about 1.5x-2x as long, although obviously that package does far more checking/parsing under the hood.

parsnip achieves this performance with logic designed around the CIF format. When reading tables, only the header section must be scanned for keys: if none are found, every data line in that table can be skipped. When reading individual values, lines are parsed in order and functions can return once each key is identified. Keeping the parsed file in the program's state will be faster for scripts performing many operations, but for reading one or two tables, plus a few keys, simplicity wins out here.

janbridley added 8 commits April 3, 2024 10:53

Add .pre-commit-config.yaml

d3a4246

Add .gitignore

6ff948c

Add pyproject.toml

8fcd7ad

Add dependabot.yml

a51e0c8

Add requirements.txt

7a5fb8e

Add top-level __init__.py

c977c00

Update pyproject.toml

8151ac0

Update pyproject.toml

6d2417f

janbridley marked this pull request as ready for review April 3, 2024 15:11

janbridley merged commit 28df06c into main Apr 3, 2024

janbridley deleted the admin/setup branch April 3, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set up package structure #3

Set up package structure #3

Uh oh!

janbridley commented Apr 3, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Set up package structure #3

Set up package structure #3

Uh oh!

Conversation

janbridley commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Why parnip

Why not C/C++/Rust/...

Could this be done with pure regex?

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

janbridley commented Apr 3, 2024 •

edited

Loading

Why `parnip`