Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
1bd92e8
Nearly-working impl
janbridley Dec 20, 2024
da87f8c
Full working example
janbridley Dec 20, 2024
b28cc05
Clean up layout
janbridley Dec 20, 2024
2ae2d31
Further cleanup
janbridley Dec 20, 2024
a05d8b8
Lint OO
janbridley Dec 20, 2024
5bc5299
Run pre-commit on _errors.py
janbridley Dec 21, 2024
06fabc7
Add oo.py temp implementation
janbridley Dec 21, 2024
727f604
Undo changes to sample data
janbridley Dec 21, 2024
2752632
Lint oo.py
janbridley Dec 21, 2024
b643d5f
Remove change to sample data
janbridley Dec 21, 2024
c949095
Add oo to init.py
janbridley Dec 21, 2024
04280cc
Handle edge cases
janbridley Dec 21, 2024
0d53a30
Test parsing real files
janbridley Dec 21, 2024
e7855f1
Improve robustness of table reader
janbridley Dec 21, 2024
f7a9d75
Lint oo and conftest
janbridley Dec 21, 2024
9c63222
Clean up text and remove comments
janbridley Dec 21, 2024
aa78a68
Port initial test to new style
janbridley Dec 21, 2024
7d07b59
Port remaining key tests
janbridley Dec 21, 2024
56a4d50
Minor fixes
janbridley Dec 21, 2024
16f1655
Clean up test_key_reader.py
janbridley Dec 21, 2024
5e92141
Progress toward transition to recarray
janbridley Dec 21, 2024
a44a925
Increase tests and fix memory layout bug
janbridley Dec 22, 2024
b020262
Fixes to memory layout
janbridley Dec 22, 2024
ccb2aea
Convert table_reader tests
janbridley Dec 22, 2024
1e81654
Linting and doc fixes
janbridley Dec 22, 2024
97e0cbd
Clean up docs
janbridley Dec 22, 2024
33134e8
Finish porting tests
janbridley Dec 23, 2024
7bc9f86
Lints
janbridley Dec 23, 2024
c80a86a
Fix for scalar array inputs
janbridley Dec 24, 2024
e23762e
Remove unnecessary filterwarning
janbridley Dec 24, 2024
917685f
Expand on tests
janbridley Dec 24, 2024
933d4b9
Clean up unitcells
janbridley Dec 24, 2024
42151fa
Lint tests
janbridley Dec 24, 2024
8a5e73f
Finalize lints
janbridley Dec 24, 2024
12b73c5
Restructure patterns
janbridley Dec 24, 2024
8f945fd
Update test_patterns
janbridley Dec 24, 2024
0d50bc8
Lint and clean up
janbridley Dec 24, 2024
d39ce71
Final lint
janbridley Dec 24, 2024
4bfe1b3
Improve a few tests
janbridley Dec 24, 2024
cded448
Add symops to example cif
janbridley Dec 24, 2024
faf9816
Remove package-unitcells deprecated docs
janbridley Dec 24, 2024
efb80e8
Fix link in package-parse
janbridley Dec 24, 2024
86faca3
Update quickstart tutorial
janbridley Dec 24, 2024
867a37d
Move oo.py to parsnip.py
janbridley Dec 24, 2024
31a5b6c
Update README
janbridley Dec 24, 2024
9ceb2f9
Update Unitcells test imports
janbridley Dec 24, 2024
8ef7150
Lint
janbridley Dec 24, 2024
2dba4be
Lazily load file
janbridley Dec 24, 2024
65d6890
Remove unused files
janbridley Dec 24, 2024
8b6f99f
Skip bad_cif test
janbridley Dec 25, 2024
05895ea
Clean up tests
janbridley Dec 25, 2024
305a37a
Lint
janbridley Dec 25, 2024
3592f1e
Add tests for table_labels and cast_numerics
janbridley Dec 26, 2024
0cc212e
Clean up tests
janbridley Dec 26, 2024
79a1c85
Lint
janbridley Dec 26, 2024
a16c7e8
Clean up docstrings
janbridley Dec 26, 2024
716cc4c
Lint and update docstrings
janbridley Dec 27, 2024
f211007
Further docs
janbridley Dec 27, 2024
def08df
Codespell
janbridley Dec 27, 2024
a001b48
Tests for cell
janbridley Dec 27, 2024
a40fa7e
Lint
janbridley Dec 27, 2024
cf83de1
Update errors for read_unit_cell
janbridley Dec 27, 2024
ab12074
Clean up tests and todos
janbridley Dec 28, 2024
46d5577
More TODOs
janbridley Dec 28, 2024
b1e4388
Lint
janbridley Dec 28, 2024
e27c00b
Add test for cell property
janbridley Dec 28, 2024
4ced6ed
Remove modindex from sidebar
janbridley Dec 28, 2024
b8eb804
Consolidate logic for nonsimple data
janbridley Dec 28, 2024
7369f84
Lint
janbridley Dec 28, 2024
8b8a0b5
Fix type annotation in cast_array function
janbridley Dec 28, 2024
d6b4da4
Add more-itertools as official dependency
janbridley Dec 28, 2024
343150d
Clean up dependency documentation
janbridley Dec 28, 2024
30d7776
Add index for ase backward compatibility
janbridley Dec 28, 2024
e46fdf4
Change wording in development.rst
janbridley Dec 28, 2024
1871e50
Replace index specification
janbridley Dec 28, 2024
66701ff
Disable ASE test on python3.7
janbridley Dec 28, 2024
67e6b22
Fix version check
janbridley Dec 28, 2024
31eb83d
Add additional lints
janbridley Dec 28, 2024
31a9907
Document additional rules in pyproject.toml
janbridley Dec 28, 2024
2828a3b
Move PATTERNS dict to end of docs
janbridley Dec 28, 2024
54f316c
Clean up development.rst
janbridley Dec 28, 2024
dadb0b4
Expand with tests from additional databases
janbridley Dec 28, 2024
8672046
Disable lint that causes warning
janbridley Dec 28, 2024
08b35c1
Fix for multiline data entries
janbridley Dec 29, 2024
10a60fa
Progress toward multiline string parsing
janbridley Dec 29, 2024
f7402d4
Working impl that fails for blocks containing a semicolon
janbridley Dec 29, 2024
58339c5
Clean up
janbridley Dec 29, 2024
993f698
Messy working impl
janbridley Dec 29, 2024
88c3a1a
Clean up
janbridley Dec 29, 2024
602edf9
Retain newlines
janbridley Dec 29, 2024
0230b52
Lint
janbridley Dec 29, 2024
82f31c6
Add TODO
janbridley Dec 29, 2024
f029663
Add missing multiline keys
janbridley Dec 29, 2024
e648ce4
Wrap accumulator into a function
janbridley Dec 29, 2024
d643fad
Clean up _accumulate_nonsimple_data
janbridley Dec 29, 2024
85c39be
Clean up unused comments
janbridley Dec 29, 2024
a7a2468
Update changelog.rst
janbridley Dec 30, 2024
67b59d4
Fix version headings in changelog
janbridley Dec 30, 2024
0c14506
Update README to reflect correct CIF2.0 status
janbridley Dec 31, 2024
af0325d
Add CIFTEST data to gitignore
janbridley Dec 31, 2024
19a8173
Escape dash in regex and allow forward slash in data name
janbridley Dec 31, 2024
2fec769
Swap namedtuple to dataclass and clean up provided keys
janbridley Dec 31, 2024
5eb4bbf
Auto detect cif keys
janbridley Dec 31, 2024
6f30930
Allow pdb matrix keys
janbridley Dec 31, 2024
0ff9e9a
Generalize nonsimple data delimiters
janbridley Dec 31, 2024
bb143d3
Add architecture.md
janbridley Dec 31, 2024
c383131
Update table tests and fix regex for nonsimple data in tabs
janbridley Dec 31, 2024
e84e94a
Add pycifrw to test reqs
janbridley Dec 31, 2024
327d3f1
Lint tests
janbridley Dec 31, 2024
65a0015
Verify all table content
janbridley Jan 1, 2025
fdf1e21
Lint
janbridley Jan 1, 2025
f6051c9
import annotations
janbridley Jan 1, 2025
8b8df48
Remove unused pattern
janbridley Jan 1, 2025
d031882
Rename tables to loops
janbridley Jan 1, 2025
55e8817
Remove extra character from regex
janbridley Jan 1, 2025
f6517b0
Clean up table reader
janbridley Jan 1, 2025
357615a
Clean up
janbridley Jan 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Restructure patterns
  • Loading branch information
janbridley committed Dec 24, 2024
commit 12b73c5d61ed7a6e6276495f0d64b0c6ec40ba91
106 changes: 52 additions & 54 deletions parsnip/oo.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,49 @@
# Copyright (c) 2024, Glotzer Group
# This file is from the parsnip project, released under the BSD 3-Clause License.
r"""An interface for reading CIF files in Python.

.. include:: ../../README.rst
:start-after: .. _parse:
:end-before: .. _installing:

.. admonition:: The CIF Format

This is an example of a simple CIF file. A `key`_ (data name or tag) must start with
an underscore, and is seperated from the data value with whitespace characters.
A `table`_ begins with the ``loop_`` keyword, and contain a header block and a data
block. The vertical position of a tag in the table headings corresponds with the
horizontal position of the associated column in the table values.

.. code-block:: text

# Key-value pairs describing the unit cell:
_cell_length_a 5.40
_cell_length_b 3.43
_cell_length_c 5.08
_cell_angle_alpha 90.0
_cell_angle_beta 132.3
_cell_angle_gamma 90.0

# A table with two columns and eight rows:
loop_
_symmetry_equiv_pos_site_id
_symmetry_equiv_pos_as_xyz
1 x,y,z
2 -x,y,-z
3 -x,-y,-z
4 x,-y,z
5 x+1/2,y+1/2,z
6 -x+1/2,y+1/2,-z
7 -x+1/2,-y+1/2,-z
8 x+1/2,-y+1/2,z

_symmetry_space_group_name_H-M 'C2 / m' # One more key-value pair


.. _key: https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#definitions
.. _table: https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#onelevel

"""

from __future__ import annotations

Expand All @@ -12,69 +56,23 @@
from numpy.lib.recfunctions import structured_to_unstructured

from parsnip._errors import ParseWarning
from parsnip.parse import (
_parsed_line_generator,
from parsnip.patterns import (
_dtype_from_int,
_is_data,
_is_key,
_line_is_continued,
_safe_eval,
_semicolon_to_string,
_strip_comments,
_write_debug_output,
cast_array_to_float,
)
from parsnip.unitcells import _matrix_from_lengths_and_angles
# from parsnip.patterns import

NONTABLE_LINE_PREFIXES = ("_", "#")


def _is_key(line: str | None):
return line is not None and line.strip()[:1] == "_"


def _is_data(line: str | None):
return line is not None and line.strip()[:1] != "_" and line.strip()[:5] != "loop_"


def _strip_comments(s: str):
return s.split("#")[0].strip()


def _strip_quotes(s: str):
return s.replace("'", "").replace('"', "")


def _dtype_from_int(i: int):
return f"<U{i}"


def _semicolon_to_string(line: str):
if "'" in line and '"' in line:
warnings.warn(
(
"String contains single and double quotes - "
"line may be parsed incorrectly"
),
stacklevel=2,
)
# WARNING: because we split our string, we strip "\n" implicitly
# This is technically against spec, but is almost never meaningful
return line.replace(";", "'" if "'" not in line else '"')


def _line_is_continued(line: str | None):
return line is not None and line.strip()[:1] == ";"


def _try_cast_to_numeric(s: str):
"""Attempt to cast a string to a number, returning the original string if invalid.

This method attempts to convert to a float first, followed by an int. Precision
measurements and indicators of significant digits are stripped.
"""
parsed = re.match(r"(\d+\.?\d*)", s.strip())
if parsed is None or re.search(r"[^0-9\.\(\)]", s):
return s
elif "." in parsed.group(0):
return float(parsed.group(0))
else:
return int(parsed.group(0))


class CifFile:
"""Parser for CIF files."""
Expand Down
119 changes: 119 additions & 0 deletions parsnip/patterns.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,76 @@
of string data extracted from CIF files by methods in ``parsnip.parse``.

"""
from __future__ import annotations

import re
import warnings

import numpy as np

from parsnip._errors import ParseWarning

def _safe_eval(str_input: str, x: int | float, y: int | float, z: int | float):
"""Attempt to safely evaluate a string of symmetry equivalent positions.

Python's ``eval`` is notoriously unsafe. While we could evaluate the entire list at
once, doing so carries some risk. The typical alternative, ``ast.literal_eval``,
doesnot work because we need to evaluate mathematical operations.

We first replace the x,y,z values with ordered fstring inputs, to simplify the input
of fractional coordinate data. This is done for convenience more than security.

Once we substitute in the x,y,z values, we should have a string version of a list
containing only numerics and math operators. We apply a substitution to ensure this
is the case, then perform one final check. If it passes, we evaluate the list. Note
that __builtins__ is set to {}, meaning importing functions is not possible. The
__locals__ dict is also set to {}, so no variables are accessible in the evaluation.

I cannot guarantee this is fully safe, but it at the very least makes it extremely
difficult to do any funny business.

Args:
str_input (str): String to be evaluated.
x (int|float): Fractional coordinate in :math:`x`.
y (int|float): Fractional coordinate in :math:`y`.
z (int|float): Fractional coordinate in :math:`z`.

Returns:
list[list[int|float,int|float,int|float]]:
:math:`(N,3)` list of fractional coordinates.

"""
ordered_inputs = {"x": "{0:.20f}", "y": "{1:.20f}", "z": "{2:.20f}"}
# Replace any x, y, or z with the same character surrounded by curly braces. Then,
# perform substitutions to insert the actual values.
substituted_string = (
re.sub(r"([xyz])", r"{\1}", str_input).format(**ordered_inputs).format(x, y, z)
)

# Remove any unexpected characters from the string.
safe_string = re.sub(r"[^\d\[\]\,\+\-\/\*\.]", "", substituted_string)
# Double check to be sure:
assert all(char in ",.0123456789+-/*[]" for char in safe_string), (
"Evaluation aborted. Check that symmetry operation string only contains "
"numerics or characters in { [],.+-/ } and adjust `regex_filter` param "
"accordingly."
)
return eval(safe_string, {"__builtins__": {}}, {}) # noqa: S307


def _write_debug_output(unique_indices, unique_counts, pos, check="Initial"):
print(f"{check} uniqueness check:")
if len(unique_indices) == len(pos):
print("... all points are unique (within tolerance).")
else:
print("(duplicate point, number of occurences)")
[
print(pt, count)
for pt, count in zip(pos[unique_indices], unique_counts)
if count > 1
]

print()

def cast_array_to_float(arr: np.ndarray, dtype: type = np.float32):
"""Cast a Numpy array to a dtype, pruning significant digits from numerical values.
Expand Down Expand Up @@ -100,3 +166,56 @@ def __call__(self, line: str):
for pattern, replacement in zip(self.patterns, self.replacements):
line = pattern.sub(replacement, line)
return line

def _is_key(line: str | None):
return line is not None and line.strip()[:1] == "_"


def _is_data(line: str | None):
return line is not None and line.strip()[:1] != "_" and line.strip()[:5] != "loop_"


def _strip_comments(s: str):
return s.split("#")[0].strip()


def _strip_quotes(s: str):
return s.replace("'", "").replace('"', "")


def _dtype_from_int(i: int):
return f"<U{i}"


def _semicolon_to_string(line: str):
if "'" in line and '"' in line:
warnings.warn(
(
"String contains single and double quotes - "
"line may be parsed incorrectly"
),
ParseWarning,
stacklevel=2,
)
# WARNING: because we split our string, we strip "\n" implicitly
# This is technically against spec, but is almost never meaningful
return line.replace(";", "'" if "'" not in line else '"')


def _line_is_continued(line: str | None):
return line is not None and line.strip()[:1] == ";"


def _try_cast_to_numeric(s: str):
"""Attempt to cast a string to a number, returning the original string if invalid.

This method attempts to convert to a float first, followed by an int. Precision
measurements and indicators of significant digits are stripped.
"""
parsed = re.match(r"(\d+\.?\d*)", s.strip())
if parsed is None or re.search(r"[^0-9\.\(\)]", s):
return s
elif "." in parsed.group(0):
return float(parsed.group(0))
else:
return int(parsed.group(0))