feat: add extra_fields support for custom CSV columns beyond OHLCV (#25)

wallneradam · wallneradam · commit 48335faf2318 · 2026-03-15T17:46:42.000+01:00
Allow scripts to access additional CSV columns beyond standard OHLCV data
via lib.extra_fields dict, which is updated each bar by ScriptRunner.

- DataConverter generates .extra.csv sidecar file during CSV-to-OHLCV
  conversion, position-aligned with the binary file (including gaps)
- OHLCVReader loads sidecar automatically and attaches extra_fields to
  each OHLCV record
- LibrarySeriesTransformer excludes extra_fields from Series conversion
  (NON_SERIES_LIB_ATTRS) so dict subscript access works correctly
- Add documentation and 4 tests covering end-to-end usage, Series
  history indexing, sidecar generation, and backward compatibility
diff --git a/docs/advanced/README.md b/docs/advanced/README.md
@@ -23,3 +23,4 @@ Advanced topics and features of PyneCore
 - [Function Isolation](./function-isolation.md) - Function isolation implementation
 - [OHLCV Reader/Writer](./ohlcv-reader-writer.md) - OHLCV data handling
 - [CSV Reader/Writer](./csv-reader-writer.md) - Fast CSV processing
+- [Extra Fields](./extra-fields.md) - Custom CSV columns beyond OHLCV in scripts
diff --git a/docs/advanced/csv-reader-writer.md b/docs/advanced/csv-reader-writer.md
@@ -252,7 +252,7 @@ When working with financial data in PyneCore, you have two main options:
   - Data interchange with other systems
   - Human inspection and editing
   - Flexible schema requirements
-  - When additional fields beyond OHLCV are needed
+  - When additional fields beyond OHLCV are needed (see [Extra Fields](./extra-fields.md))
 
 Both systems are designed for performance while maintaining pure Python implementation, aligning with the PyneCore project vision.
 
diff --git a/docs/advanced/extra-fields.md b/docs/advanced/extra-fields.md
@@ -0,0 +1,127 @@
+<!--
+---
+weight: 1005
+title: "Extra Fields"
+description: "Accessing custom CSV columns beyond OHLCV data in Pyne scripts"
+icon: "playlist_add"
+date: "2025-03-31"
+lastmod: "2026-03-15"
+draft: false
+toc: true
+categories: ["Advanced", "Data Handling"]
+tags: ["extra-fields", "csv", "custom-data", "series", "data"]
+---
+-->
+
+# Extra Fields
+
+PyneCore allows you to access additional columns beyond standard OHLCV data from your CSV files inside Pyne scripts. This is useful when your data includes pre-computed indicators, signals, or any other custom data that you want to use alongside price data.
+
+## How It Works
+
+When a CSV file contains columns beyond the standard OHLCV fields (`timestamp`, `open`, `high`, `low`, `close`, `volume`), PyneCore automatically makes them available through `extra_fields` — a dictionary that is updated on each bar with the current row's extra column values.
+
+The data flow depends on how you run your script:
+
+### Binary OHLCV Path (workdir)
+
+When running from a workdir with `pyne run`, PyneCore converts CSV to binary `.ohlcv` format. Since the binary format only stores OHLCV data, extra columns are saved to a **sidecar file** (`.extra.csv`) that is position-aligned with the binary data:
+
+```
+workdir/data/my_data/
+    EURUSD_1h.csv          # Source: OHLCV + extra columns
+    EURUSD_1h.ohlcv        # Binary OHLCV (auto-generated)
+    EURUSD_1h.toml         # Symbol metadata (auto-generated)
+    EURUSD_1h.extra.csv    # Extra columns only (auto-generated)
+```
+
+The sidecar file is generated and regenerated automatically whenever the source CSV is converted. You never need to create or edit it manually.
+
+### Direct CSV Path (CSVReader / standalone)
+
+When reading CSV directly (e.g., via `CSVReader` or standalone execution), extra columns are parsed inline — no sidecar file is needed.
+
+## Usage in Scripts
+
+Access extra fields through `lib.extra_fields`, which is a `dict[str, Any]` updated each bar:
+
+```python
+"""
+@pyne
+"""
+from pynecore import Series
+from pynecore.lib import script, ta, close, extra_fields, plot
+
+
+@script.indicator(title="Extra Fields Example", overlay=True)
+def main():
+    # Access extra columns as Series by annotating with Series[T]
+    rsi: Series[float] = extra_fields["rsi"]
+    signal: Series[str] = extra_fields["signal"]
+
+    # Series indexing works — access previous bars
+    prev_rsi = rsi[1]       # Previous bar's RSI value
+    rsi_2_ago = rsi[2]      # RSI from 2 bars ago
+
+    # Use with built-in functions like any other Series
+    rsi_sma = ta.sma(rsi, 14)
+
+    # Use string fields for conditional logic
+    if signal[0] == "buy":
+        plot(close, "Buy Signal", linewidth=2)
+```
+
+### Key Points
+
+- **Type annotation creates the Series**: Writing `rsi: Series[float] = extra_fields["rsi"]` makes `rsi` a proper Series with history. The `extra_fields["rsi"]` part just returns the current bar's value (a plain `float`).
+- **Supported types**: `float`, `int`, `str`, and `bool`. The type is detected automatically from the CSV data.
+- **Missing values**: Empty cells in the CSV appear as empty string (`''`) when read via CSVReader, or `NaN` when read via the binary OHLCV + sidecar path.
+- **No AST magic needed**: The standard Series annotation mechanism handles everything — there is no special treatment for `extra_fields` in the AST transformers.
+
+## CSV Format
+
+Your source CSV simply includes extra columns alongside the standard OHLCV columns:
+
+```csv
+timestamp,open,high,low,close,volume,rsi,signal,custom_price
+2024-01-01T00:00:00,100.0,105.0,95.0,102.0,1000,45.2,buy,99.5
+2024-01-01T01:00:00,102.0,108.0,100.0,106.0,1200,52.1,,101.3
+2024-01-01T02:00:00,106.0,110.0,104.0,108.0,800,38.7,sell,
+```
+
+The following column names are recognized as standard OHLCV and will **not** appear in `extra_fields`:
+
+| Recognized OHLCV columns                              |
+|--------------------------------------------------------|
+| `timestamp`, `time`, `date`, `datetime`                |
+| `open`, `high`, `low`, `close`, `volume`               |
+
+Any other column name is treated as an extra field.
+
+## Sidecar File Format
+
+The auto-generated `.extra.csv` file contains only the extra columns, with rows aligned 1:1 to the binary `.ohlcv` file (including gap-filled rows):
+
+```csv
+rsi,signal,custom_price
+45.2,buy,99.5
+52.1,,101.3
+38.7,sell,
+,,
+,,
+42.0,hold,100.0
+```
+
+Empty rows correspond to gap-filled bars in the OHLCV binary (bars with `volume = -1`).
+
+## Limitations
+
+- **Binary format unchanged**: The `.ohlcv` binary format remains fixed at 24 bytes per record. Extra fields are stored separately in the sidecar CSV.
+- **JSON source files**: Extra field extraction is currently supported for CSV and TXT source formats, not JSON.
+- **Memory**: The sidecar is loaded entirely into memory when opening the OHLCV file. For typical datasets (up to a few hundred thousand bars with a handful of extra columns), this is negligible.
+- **Not available from providers**: Data download providers (e.g., CCXT, TradingView) produce standard OHLCV data only. Extra fields are for user-provided CSV data.
+
+## See Also
+
+- [OHLCV Reader/Writer](./ohlcv-reader-writer.md) — Binary OHLCV format details
+- [CSV Reader/Writer](./csv-reader-writer.md) — CSV processing internals
diff --git a/src/pynecore/core/data_converter.py b/src/pynecore/core/data_converter.py
@@ -6,6 +6,7 @@
 """
 from __future__ import annotations
 
+import csv
 import json
 from enum import Enum
 from datetime import time
@@ -172,6 +173,14 @@ def convert_to_ohlcv(
             # Copy modification time from source to maintain freshness
             copy_mtime(file_path, ohlcv_path)
 
+            # Generate extra fields sidecar CSV if source has extra columns
+            extra_csv_path = file_path.with_suffix('.extra.csv')
+            if detected_format in ('csv', 'txt'):
+                self._generate_extra_csv(file_path, ohlcv_path, extra_csv_path,
+                                         detected_format == 'txt')
+                if extra_csv_path.exists():
+                    copy_mtime(file_path, extra_csv_path)
+
             # Generate TOML symbol info file if needed and not already loaded
             # skip_toml_generation is set earlier if TOML already exists (line 129-134)
             if symbol and not skip_toml_generation and (force or not toml_path.exists()):
@@ -239,14 +248,99 @@ def convert_to_ohlcv(
                     pass
 
         except Exception as e:
-            # Clean up output file on error
-            if ohlcv_path.exists():
-                try:
-                    ohlcv_path.unlink()
-                except OSError:
-                    pass
+            # Clean up output files on error
+            for cleanup_path in (ohlcv_path, file_path.with_suffix('.extra.csv')):
+                if cleanup_path.exists():
+                    try:
+                        cleanup_path.unlink()
+                    except OSError:
+                        pass
             raise ConversionError(f"Failed to convert {file_path}: {e}") from e
 
+    # Column names that are part of standard OHLCV data (not extra fields)
+    _OHLCV_COLUMNS = {
+        'timestamp', 'time', 'date', 'datetime',
+        'open', 'high', 'low', 'close', 'volume',
+    }
+
+    def _generate_extra_csv(
+            self,
+            source_path: Path,
+            ohlcv_path: Path,
+            extra_csv_path: Path,
+            is_txt: bool = False
+    ) -> None:
+        """
+        Generate a sidecar .extra.csv file with non-OHLCV columns from the source data.
+        The sidecar is position-aligned with the binary OHLCV file (including gap-filled rows).
+
+        :param source_path: Path to the original CSV/TXT file
+        :param ohlcv_path: Path to the generated binary OHLCV file
+        :param extra_csv_path: Path for the output sidecar CSV
+        :param is_txt: True if source is TXT format (auto-detect delimiter)
+        """
+        # Detect delimiter for TXT files
+        delimiter = ','
+        if is_txt:
+            with open(source_path, 'r') as f:
+                first_line = f.readline().strip()
+                for delim in ['\t', ';', '|']:
+                    if delim in first_line:
+                        delimiter = delim
+                        break
+
+        # Read source headers and identify extra columns
+        with open(source_path, 'r', newline='') as f:
+            reader = csv.reader(f, delimiter=delimiter)
+            raw_headers = next(reader, None)
+            if not raw_headers:
+                return
+
+            headers_lower = [h.lower().strip() for h in raw_headers]
+            extra_indices = [
+                i for i, h in enumerate(headers_lower)
+                if h not in self._OHLCV_COLUMNS
+            ]
+
+            if not extra_indices:
+                return
+
+            extra_headers = [raw_headers[i].strip() for i in extra_indices]
+
+            # Collect extra values from all source rows (in order)
+            source_extra_rows: list[list[str]] = []
+            for row in reader:
+                if is_txt:
+                    row = [field.strip() for field in row]
+                extra_row = [row[i] if i < len(row) else '' for i in extra_indices]
+                source_extra_rows.append(extra_row)
+
+        if not source_extra_rows:
+            return
+
+        # Align with OHLCV binary (which may have gap-filled rows)
+        with OHLCVReader(ohlcv_path) as ohlcv_reader:
+            total_positions = ohlcv_reader.size
+            empty_row = [''] * len(extra_headers)
+            source_idx = 0
+
+            with open(extra_csv_path, 'w', newline='') as out_f:
+                writer = csv.writer(out_f)
+                writer.writerow(extra_headers)
+
+                for pos in range(total_positions):
+                    ohlcv = ohlcv_reader.read(pos)
+                    if ohlcv.volume < 0:
+                        # Gap-filled row — write empty values
+                        writer.writerow(empty_row)
+                    else:
+                        # Real data row — consume next source row
+                        if source_idx < len(source_extra_rows):
+                            writer.writerow(source_extra_rows[source_idx])
+                            source_idx += 1
+                        else:
+                            writer.writerow(empty_row)
+
     @staticmethod
     def detect_format(file_path: Path) -> Literal['csv', 'txt', 'json', 'ohlcv', 'unknown']:
         """
diff --git a/src/pynecore/core/ohlcv_file.py b/src/pynecore/core/ohlcv_file.py
@@ -1365,7 +1365,8 @@ class OHLCVReader:
     Very fast OHLCV data reader using memory mapping.
     """
 
-    __slots__ = ('path', '_file', '_mmap', '_size', '_start_timestamp', '_interval')
+    __slots__ = ('path', '_file', '_mmap', '_size', '_start_timestamp', '_interval',
+                 '_extra_data', '_extra_headers')
 
     def __init__(self, path: str | Path):
         self.path = str(path)
@@ -1374,6 +1375,8 @@ def __init__(self, path: str | Path):
         self._size = 0
         self._start_timestamp = None
         self._interval = None
+        self._extra_data: list[dict[str, int | float | str]] | None = None
+        self._extra_headers: list[str] | None = None
 
     def __enter__(self):
         self.open()
@@ -1467,8 +1470,59 @@ def open(self) -> 'OHLCVReader':
                 second_timestamp = struct.unpack('I', cast(Buffer, self._mmap[RECORD_SIZE:RECORD_SIZE + 4]))[0]
                 self._interval = second_timestamp - self._start_timestamp
 
+        self._load_extra_csv()
+
         return self
 
+    def _load_extra_csv(self) -> None:
+        """
+        Load extra fields from sidecar .extra.csv file if it exists.
+        The sidecar is position-aligned with the binary OHLCV file.
+        """
+        extra_path = Path(self.path).with_suffix('.extra.csv')
+        if not extra_path.exists():
+            return
+
+        with open(extra_path, 'r', newline='') as f:
+            reader = csv.reader(f)
+            headers = next(reader, None)
+            if not headers:
+                return
+
+            self._extra_headers = headers
+
+            # Detect column types from first non-empty data row
+            rows_raw: list[list[str]] = []
+            col_is_numeric: list[bool | None] = [None] * len(headers)
+
+            for row in reader:
+                rows_raw.append(row)
+                for i, val in enumerate(row):
+                    if col_is_numeric[i] is None and val and val.lower() not in ('', 'nan', 'na'):
+                        try:
+                            float(val)
+                            col_is_numeric[i] = True
+                        except ValueError:
+                            col_is_numeric[i] = False
+
+            # Default undetected columns to string
+            col_is_numeric = [v if v is not None else False for v in col_is_numeric]
+
+            # Parse all rows with detected types
+            self._extra_data = []
+            for row in rows_raw:
+                parsed: dict[str, int | float | str] = {}
+                for i, header in enumerate(headers):
+                    val = row[i] if i < len(row) else ''
+                    if col_is_numeric[i]:
+                        if not val or val.lower() in ('nan', 'na', ''):
+                            parsed[header] = float('nan')
+                        else:
+                            parsed[header] = float(val)
+                    else:
+                        parsed[header] = val
+                self._extra_data.append(parsed)
+
     def __iter__(self) -> Iterator[OHLCV]:
         """
         Iterate through all candles
@@ -1487,7 +1541,12 @@ def read(self, position: int) -> OHLCV:
 
         offset = position * RECORD_SIZE
         data = struct.unpack(STRUCT_FORMAT, self._mmap[offset:offset + RECORD_SIZE])
-        return OHLCV(*data, extra_fields={})
+
+        extra = {}
+        if self._extra_data is not None and position < len(self._extra_data):
+            extra = self._extra_data[position]
+
+        return OHLCV(*data, extra_fields=extra)
 
     def read_from(self, start_timestamp: int, end_timestamp: int | None = None, skip_gaps: bool = True) \
             -> Iterator[OHLCV]:
@@ -1524,6 +1583,8 @@ def close(self):
         if self._file:
             self._file.close()
             self._file = None
+        self._extra_data = None
+        self._extra_headers = None
 
     def get_positions(self, start_timestamp: int | None = None, end_timestamp: int | None = None) -> tuple[int, int]:
         """
diff --git a/src/pynecore/core/script_runner.py b/src/pynecore/core/script_runner.py
@@ -98,6 +98,7 @@ def _set_lib_properties(ohlcv: OHLCV, bar_index: int, tz: 'ZoneInfo', lib: Modul
     lib.close = _round_price(ohlcv.close)
 
     lib.volume = ohlcv.volume
+    lib.extra_fields = ohlcv.extra_fields if ohlcv.extra_fields else {}
 
     lib.hl2 = (lib.high + lib.low) / 2.0
     lib.hlc3 = (lib.high + lib.low + lib.close) / 3.0
@@ -162,6 +163,7 @@ def _reset_lib_vars(lib: ModuleType):
     lib._time = 0
     lib._datetime = datetime.fromtimestamp(0, UTC)
 
+    lib.extra_fields = {}
     lib._lib_semaphore = False
 
     lib.barstate.isfirst = True
diff --git a/src/pynecore/lib/__init__.py b/src/pynecore/lib/__init__.py
@@ -95,6 +95,9 @@
 # Stores data to polot
 _plot_data: dict[str, Any] = {}
 
+# Extra fields from CSV data (beyond OHLCV), populated each bar by ScriptRunner
+extra_fields: dict[str, Any] = {}
+
 # Lib semaphore - to prevent lib`s main function to do things it must not (plot, strategy things, etc.)
 _lib_semaphore = False
 
diff --git a/src/pynecore/transformers/lib_series.py b/src/pynecore/transformers/lib_series.py
diff --git a/tests/t00_pynecore/core/data/extra_fields.csv b/tests/t00_pynecore/core/data/extra_fields.csv
diff --git a/tests/t00_pynecore/core/test_007_extra_fields.py b/tests/t00_pynecore/core/test_007_extra_fields.py